📊 ArXiv 研究报告 (2026-03-28)

生成时间: 2026-03-28 09:22:31 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 296 篇
及格论文: 7 篇 (2.4%)
深度分析: 7 篇

⭐ 及格论文详细分析

1. Closing the Confidence-Faithfulness Gap in Large Language Models

作者: Miranda Muqing Miao, Lyle Ungar 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25052v1

评分: 44.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM的置信度校准问题，属于大模型技术原理创新。高度相关关键词：1) “Large Language Models” (核心研究对象，10分)；2) “Mechanistic Interpretability” (使用线性探针和CAA进行机制解释，10分)。中等相关：1) “Chain of Thought” (论文研究推理过程对置信度的影响，8分)；2) “System 2 Thinking” (涉及深度推理过程，8分)；3) “Hallucination Mitigation” (置信度校准与事实性相关，8分)。其他关键词未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文研究发现大语言模型的置信度表达与其实际准确性之间存在脱节，并揭示了推理过程会污染置信度表达的机制，进而提出了一种自适应引导方法显著改善了校准对齐。

摘要翻译

大型语言模型（LLM）倾向于表达与其实际准确率显著脱节的置信度分数，然而支配这一行为的几何关系仍鲜为人知。在本研究中，我们对语言化置信度进行了机制可解释性分析，通过线性探针和对比激活添加（CAA）导向技术表明：校准信号与语言化置信度信号虽以线性方式编码，但彼此正交——这一发现在三个开源权重模型和四个数据集中均保持一致。有趣的是，当模型被要求同时进行问题推理并表达置信度分数时，推理过程会干扰语言化置信度的方向，加剧校准失准。我们将此现象称为“推理污染效应”。基于这一发现，我们提出了一种两阶段自适应导向流程：该流程读取模型的内部准确率估计值，并引导语言化输出与之匹配，从而在所有评估模型中显著提升了校准对齐度。

摘要 (Abstract)

Large language models (LLMs) tend to verbalize confidence scores that are largely detached from their actual accuracy, yet the geometric relationship governing this behavior remain poorly understood. In this work, we present a mechanistic interpretability analysis of verbalized confidence, using linear probes and contrastive activation addition (CAA) steering to show that calibration and verbalized confidence signals are encoded linearly but are orthogonal to one another – a finding consistent across three open-weight models and four datasets. Interestingly, when models are prompted to simultaneously reason through a problem and verbalize a confidence score, the reasoning process disrupts the verbalized confidence direction, exacerbating miscalibration. We term this the “Reasoning Contamination Effect.” Leveraging this insight, we introduce a two-stage adaptive steering pipeline that reads the model’s internal accuracy estimate and steers verbalized output to match it, substantially improving calibration alignment across all evaluated models.

关键词: Large Language Models, Confidence Calibration, Mechanistic Interpretability, Linear Probes, Contrastive Activation Addition, Reasoning Contamination Effect, Adaptive Steering, Verbalized Confidence

深度分析:

缩小大语言模型中的置信度-忠实度差距

摘要:

本文针对大语言模型（LLM）表达出的置信度与实际准确率脱节的问题进行了机制可解释性分析。研究发现，模型内部虽然编码了校准良好的准确率信息，但表达出的置信度信号与之几乎正交。特别是当模型同时进行推理和表达置信度时，推理过程会干扰置信度方向，导致校准恶化（推理污染效应）。基于此，作者提出了一种两阶段自适应引导管道，通过读取模型内部的准确率估计并引导输出以匹配该估计，显著提升了模型的校准对齐度。

创新点:

发现了模型内部准确率信号与表达出的置信度信号在几何上呈正交关系，解释了为何模型“知道”却“说错”
揭示了“推理污染效应”，即联合提示（同时推理和打分）会导致置信度与准确率方向从弱对齐转变为强烈对立
提出了一种两阶段自适应引导管道，利用对比激活添加（CAA）技术，在不重新训练模型的情况下显著改善了校准效果

方法

!!! info

本文主要采用机制可解释性分析方法。首先，使用岭回归线性探针在残差流中检测准确率信息的线性可访问性。其次，利用对比激活添加（CAA）技术，通过对比高置信度和低置信度提示下的激活差异构建引导向量。研究设计了纯正确性、纯置信度和联合三种提示类型来解构模型表示，并在MATH、MMLU等数据集上对Llama、Qwen等模型进行了实验验证。

关键结果:

模型内部存在线性可访问的、校准良好的准确率方向，但表达出的置信度方向与之几乎正交（余弦相似度 < 0.04）
联合提示会导致置信度和准确率方向的关系反转，余弦相似度从 +0.26 降至 -0.63
基于引导的方法将校准误差降低了 4-7 倍，优于现有的提示工程方法

技术栈: 线性探针, 对比激活添加, 岭回归, 自回归生成干预, 余弦相似度分析

优点

深入的机制可解释性分析，从几何角度揭示了校准问题的根源
提出的方法无需重新训练模型，推理时干预即可生效，实用性强
实验覆盖了多个主流开源模型和不同类型的数据集，结论具有普适性

局限

主要关注线性方向，可能忽略了更复杂的非线性关系
引导向量的构建依赖于特定的提示框架，可能对提示词的变化敏感
虽然改善了校准，但可能对模型的生成质量或流畅性有潜在影响

与研究方向的相关性:

本文属于大模型技术原理的创新领域。它深入研究了深度学习模型（特别是Transformer架构的LLM）的内部机制，利用机制可解释性技术（线性探针、激活引导）解决了大模型在实际应用中关键的可靠性问题（置信度校准）。这直接关联到“大模型和深度学习技术原理的创新”，对于提升大模型在科学、医疗等高风险领域的可信度和安全性具有重要意义。

2. AD-CARE: A Guideline-grounded, Modality-agnostic LLM Agent for Real-world Alzheimer’s Disease Diagno

作者: Wenlong Hou, Sheng Bi, Guangqian Yang, Lihao Liu, Ye Du, Hanxiao Xue, Juncheng Wang, Yuxiang Feng, Yue Xun, Nanxi Yu, Ning Mao, Mo Yang, Yi Wah Eva Cheung, Ling Long, Kay Chen Tan, Lequan Yu, Xiaomeng Ma, Shaozhen Yan, Shujun Wang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25322v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文AD-CARE的核心是开发一个基于大语言模型（LLM）的智能体（Agent），用于阿尔茨海默病的临床诊断支持。因此，与"Large Language Models"和"LLM Agents"高度相关（10分），因为论文明确提到使用LLMs并构建了一个LLM驱动的智能体。与"Tool Use"高度相关（10分），因为智能体动态编排专门的诊断工具。与"AI for Science"高度相关（10分），因为这是大模型在生物医学（阿尔茨海默病诊断）领域的应用。论文未涉及其他关键词的具体技术细节（如MoE、量化、推理加速、对齐方法等），因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究提出了AD-CARE，一个基于大语言模型的智能体框架，用于从异构多模态数据中生成阿尔茨海默病的临床诊断报告，在多个队列中实现了84.9%的诊断准确率，提升了公平性，并在读者研究中提高了医生诊断准确性并减少了决策时间。

摘要翻译

阿尔茨海默病（Alzheimer’s disease, AD）随着人口老龄化已成为日益严峻的全球健康挑战，及时、准确的诊断对于减轻个体和社会负担至关重要。然而，现实世界中的AD评估受到不完整、异质的多模态数据以及不同中心与患者人口统计学差异的阻碍。尽管大语言模型（large language models, LLMs）在生物医学领域展现出潜力，但其在AD中的应用大多局限于回答狭窄的疾病特定问题，而非生成支持临床决策的综合性诊断报告。本研究通过引入AD-CARE扩展了LLMs在临床决策支持方面的能力：这是一种模态无关的智能体，能够基于不完整且异质的输入数据，在不补全缺失模态的情况下，执行基于临床指南的诊断评估。通过动态协调专用诊断工具并将临床指南嵌入LLM驱动的推理过程，AD-CARE生成透明、报告式的输出，其形式与现实临床工作流程相符。在包含10,303个病例的六个队列中，AD-CARE实现了84.9%的诊断准确率，相对于基线方法获得了4.2%-13.7%的相对提升。尽管队列间存在差异，其数据集特异性准确率保持稳健（80.4%-98.8%），且该智能体在所有队列中均优于所有基线方法。AD-CARE减少了不同种族和年龄亚组间的性能差异，将四项指标的平均离散度分别降低了21%-68%和28%-51%。在一项受控阅片者研究中，该智能体将神经科医生和放射科医生的诊断准确率提高了6%-11%，并将决策时间缩短了一半以上。该框架在八种骨干LLMs上实现了2.29%-10.66%的绝对性能提升，并使它们的表现趋于一致。这些结果表明，AD-CARE是一个可扩展、具备实际部署能力的框架，可整合到常规临床工作流程中，为AD提供多模态决策支持。

摘要 (Abstract)

Alzheimer’s disease (AD) is a growing global health challenge as populations age, and timely, accurate diagnosis is essential to reduce individual and societal burden. However, real-world AD assessment is hampered by incomplete, heterogeneous multimodal data and variability across sites and patient demographics. Although large language models (LLMs) have shown promise in biomedicine, their use in AD has largely been confined to answering narrow, disease-specific questions rather than generating comprehensive diagnostic reports that support clinical decision-making. Here we expand LLM capabilities for clinical decision support by introducing AD-CARE, a modality-agnostic agent that performs guideline-grounded diagnostic assessment from incomplete, heterogeneous inputs without imputing missing modalities. By dynamically orchestrating specialized diagnostic tools and embedding clinical guidelines into LLM-driven reasoning, AD-CARE generates transparent, report-style outputs aligned with real-world clinical workflows. Across six cohorts comprising 10,303 cases, AD-CARE achieved 84.9% diagnostic accuracy, delivering 4.2%-13.7% relative improvements over baseline methods. Despite cohort-level differences, dataset-specific accuracies remain robust (80.4%-98.8%), and the agent consistently outperforms all baselines. AD-CARE reduced performance disparities across racial and age subgroups, decreasing the average dispersion of four metrics by 21%-68% and 28%-51%, respectively. In a controlled reader study, the agent improved neurologist and radiologist accuracy by 6%-11% and more than halved decision time. The framework yielded 2.29%-10.66% absolute gains over eight backbone LLMs and converges their performance. These results show that AD-CARE is a scalable, practically deployable framework that can be integrated into routine clinical workflows for multimodal decision support in AD.

关键词: Alzheimer’s disease diagnosis, Large language models, LLM agent, Clinical decision support, Multimodal data, Guideline-grounded, Fairness analysis, Reader study

深度分析:

AD-CARE：基于指南、模态无关的LLM智能体用于现实世界阿尔茨海默病诊断的多队列评估、公平性分析与读者研究

摘要:

针对阿尔茨海默病（AD）诊断中面临的多模态数据不完整、异构性强及跨站点差异大等现实挑战，论文提出了AD-CARE，这是一个基于指南、模态无关的LLM智能体。该系统通过动态编排专业诊断工具，将临床指南嵌入LLM推理过程，能够在不进行缺失模态插补的情况下，利用不完整、异构的输入生成透明的诊断报告。在包含10,303个病例的六个队列中，AD-CARE达到84.9%的诊断准确率，显著优于基线方法。它有效减少了不同种族和年龄亚组间的性能差异，在读者研究中显著提高了神经科医生和放射科医生的准确率及效率，并展现出对不同LLM骨干模型的良好兼容性和鲁棒性。

创新点:

提出了首个模态无关、零插补的AD诊断智能体，能够直接处理不完整、异构的多模态临床数据，无需对缺失模态进行人工插补。
设计了基于临床指南的LLM智能体架构，通过规划器、专用工具库和协调器的协作，模拟专科医生的诊断推理流程，生成透明且符合临床规范的报告。
进行了大规模、多维度（多队列、公平性、读者研究）的临床验证，证明了系统在跨中心泛化、减少种族/年龄偏差以及辅助医生决策方面的显著优势。
展示了骨干模型无关的通用性，该框架在多种不同规模和架构的LLM上均能实现性能提升并收敛，便于在不同资源环境中部署。

方法

!!! info

论文采用基于LLM的智能体框架，包含三个核心组件：1) **规划器**：负责理解诊断请求，分解任务并识别可用的数据模态；2) **工具库**：集成针对特定模态的专用分析工具（如海马体萎缩量化、脑体积测量、多基因风险评分计算等）；3) **协调器**：依据临床指南和范例病例，整合各工具的输出，生成结构化的诊断报告。研究通过多队列评估（6个队列，10,303例）测试泛化能力，利用亚组分析评估公平性，并通过受控读者研究验证临床效用。

关键结果:

在6个队列上平均诊断准确率达到84.9%（95% CI: 84.2%–85.6%），相比基线方法有4.2%–13.7%的相对提升。
在不同数据集上的准确率保持在80.4%–98.8%之间，表现出极强的鲁棒性。
显著减少了不同种族和年龄亚组间的性能差异，准确率、F1分数等指标的离散度降低了21%–68%。
在读者研究中，帮助神经科医生和放射科医生将准确率提高了6%–11%，并将决策时间缩短了一半以上。
在8种不同的LLM骨干模型上均实现了性能提升，并使不同模型间的性能趋于一致。

技术栈: 大语言模型, 智能体框架, 多模态学习, 海马体和全局萎缩量化算法, 白质/灰质体积测量, 多基因风险评分计算, 零样本/少样本推理

优点

直接解决临床痛点：针对现实世界中数据不完整和异构的常态设计，避免了插补带来的偏差，实用性强。
可解释性高：生成基于指南的结构化诊断报告，推理过程透明，易于被临床医生理解和信任。
验证严谨：涵盖了大规模多队列评估、公平性分析和读者研究，证据链完整，科学性强。
部署灵活：骨干模型无关的特性使得该框架可以根据计算资源灵活选择底层LLM，适应性强。

局限

系统性能依赖于工具库中专用分析工具（如影像分割、基因计算）的准确性，若底层工具存在偏差可能影响结果。
尽管兼容低成本模型，但作为LLM驱动的智能体，运行整套系统仍需一定的计算资源，可能在资源极度匮乏地区受限。
基于现有临床指南构建，若未来诊断标准发生重大变更，可能需要重新调整系统逻辑。
LLM固有的幻觉风险虽受工具和指南约束，但在处理极罕见病例时仍需警惕。

与研究方向的相关性:

该论文与关键词高度相关。它属于“大模型和深度学习在科学领域的应用”这一子领域，具体聚焦于生物医药AI（阿尔茨海默病诊断）。论文创新性地将LLM智能体技术应用于解决复杂的临床多模态诊断问题，不仅应用了前沿技术，还在智能体架构设计上有所突破（如模态无关、基于指南的推理），具有很高的技术创新性和应用价值。

3. FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Prot

作者: Jie Zhu, Yimin Tian, Boyang Li, Kehao Wu, Zhongzhi Liang, Junhui Li, Xianyin Zhang, Lifan Guo, Feng Chen, Yong Liu, Chi Zhang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.24943v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心是评估LLM代理在金融领域通过工具调用解决实际问题的能力，因此与"LLM Agents"和"Tool Use"高度相关（10分）。论文明确涉及LLM，故"Large Language Models"得10分。论文评估推理能力，与"Chain of Thought"和"System 2 Thinking"有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文主题无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了FinMCP-Bench基准，用于评估大型语言模型代理通过调用金融工具解决现实世界金融问题的能力，并系统评估了主流LLM的性能。

摘要翻译

本文介绍了\textbf{FinMCP-Bench}，这是一个通过金融模型上下文协议工具调用来评估大语言模型解决现实世界金融问题能力的新型基准。FinMCP-Bench包含613个样本，涵盖10个主要场景和33个子场景，融合了真实与合成的用户查询，以确保多样性和真实性。它整合了65个真实的金融MCP以及三种样本类型——单工具、多工具和多轮对话，从而能够评估模型在不同任务复杂度下的表现。利用此基准，我们系统评估了一系列主流大语言模型，并提出了明确衡量工具调用准确性和推理能力的指标。FinMCP-Bench为推进金融大语言模型智能体的研究提供了一个标准化、实用且具有挑战性的测试平台。

摘要 (Abstract)

This paper introduces \textbf{FinMCP-Bench}, a novel benchmark for evaluating large language models (LLMs) in solving real-world financial problems through tool invocation of financial model context protocols. FinMCP-Bench contains 613 samples spanning 10 main scenarios and 33 sub-scenarios, featuring both real and synthetic user queries to ensure diversity and authenticity. It incorporates 65 real financial MCPs and three types of samples, single tool, multi-tool, and multi-turn, allowing evaluation of models across different levels of task complexity. Using this benchmark, we systematically assess a range of mainstream LLMs and propose metrics that explicitly measure tool invocation accuracy and reasoning capabilities. FinMCP-Bench provides a standardized, practical, and challenging testbed for advancing research on financial LLM agents.

关键词: LLM Agents, Tool Use, Financial Benchmark, Model Context Protocol, Reasoning Evaluation, Real-world Financial Problems, Tool Invocation Accuracy

深度分析:

FinMCP-Bench：基于模型上下文协议的现实世界金融工具使用大模型智能体基准测试

摘要:

本文介绍了FinMCP-Bench，这是一个用于评估大语言模型（LLM）在现实金融场景中通过模型上下文协议（MCP）调用工具能力的基准测试。该基准包含613个样本，涵盖10个主要场景和33个子场景，集成了65个真实的金融MCP工具。数据集包含单工具、多工具和多轮对话三种类型，通过从生产环境日志中提取真实数据，并利用基于链和基于角色扮演的方法合成高难度样本构建而成。研究对主流LLM进行了系统评估，提出了衡量工具调用准确性和推理能力的指标，为金融LLM智能体的研究提供了标准化且具有挑战性的测试平台。

创新点:

提出了首个针对金融领域MCP工具使用的基准测试FinMCP-Bench，涵盖真实和合成的多样化金融场景。
设计了基于链的多工具样本合成方法，通过构建工具依赖图和生成轨迹来增加任务复杂度。
设计了基于角色扮演的多轮样本合成方法，模拟真实用户画像和目标，生成高质量的多轮对话数据。
建立了严格的质量控制流程，结合自动验证和专家审查，确保数据的高质量和真实性。

方法

!!! info

研究首先从“小顾”AI助手的生产环境日志中收集真实交互数据。随后采用两种合成策略增强数据复杂度：一是基于链的方法，构建工具依赖图并采样路径生成多工具轨迹；二是基于角色扮演的方法，设定用户画像和目标，利用LLM模拟对话生成多轮样本。最后，通过自动验证和6位金融专家的独立审查（5点李克特量表）进行质量控制。

关键结果:

构建了包含613个高质量样本的FinMCP-Bench数据集，覆盖市场分析、投资规划等10大场景及33个子场景。
集成了65个真实的金融MCP工具，数据集包含单工具、多工具和多轮对话三种类型，平均每个样本涉及多个工具调用。
通过对主流LLM的评估，揭示了当前模型在处理复杂多工具依赖和多轮对话方面的优势与挑战。

技术栈: Model Context Protocol (MCP), LLM-as-a-Judge (Qwen3-235B-2507), Tool Dependency Graph, Role-playing Simulation, 5-point Likert Scale

优点

数据真实性强，基于生产环境日志和真实金融工具。
场景覆盖全面，包含多种金融任务和不同复杂度的交互模式。
数据构建方法科学严谨，结合了真实数据提取与LLM辅助合成，并经过专家严格审核。
基于MCP标准化协议，具有良好的通用性和扩展性。

局限

数据集规模（613个样本）相对有限，可能无法覆盖所有长尾金融场景。
合成数据虽然经过验证，但与完全真实的用户行为可能仍存在偏差。
主要关注工具调用流程的准确性，对最终金融建议的专业合规性评估细节提及较少。

与研究方向的相关性:

该论文属于大模型在金融垂直领域的应用研究，高度契合“大模型在不同领域的研究应用”这一关键词。论文提出了新的基准测试构建方法（基于链和角色扮演的合成），涉及LLM智能体技术原理的创新，符合“大模型和深度学习技术原理的创新”。研究聚焦于LLM智能体的工具调用能力，是当前Agent研究的热点，具有较强的创新性和应用价值。

4. SEVerA: Verified Synthesis of Self-Evolving Agents

作者: Debangshu Banerjee, Changming Xu, Gagandeep Singh 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25111v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文SEVerA专注于为自演化LLM代理提供形式化验证框架，确保其安全性和正确性。核心内容直接涉及LLM代理（“LLM Agents”）、自演化/自改进（“Self-Correction”）以及工具使用（“Tool Use”），这些是论文的核心创新点，因此给予10分。论文明确提到LLM作为规划器和生成模型，因此"Large Language Models"高度相关，得10分。其他关键词如MoE、量化、推理加速、幻觉缓解等，论文未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文提出了SEVerA框架，通过形式化验证确保自演化LLM代理的安全性和正确性，在程序验证、数学合成和工具使用等任务中实现了零约束违规并提升了性能。

摘要翻译

近期研究表明，自进化大语言模型智能体在程序修复和科学发现等任务中展现出显著成效。该范式通过规划大语言模型合成调用参数化模型（包括大语言模型）的智能体程序，并针对具体任务进行调优以提升性能。然而，现有自进化智能体框架缺乏对安全性或正确性的形式化保证。由于此类程序常需在未见输入上自主执行，这种保证缺失引发了可靠性与安全性的担忧。本文将智能体代码生成建模为约束学习问题，将硬性形式化规约与捕捉任务效用的软性目标相结合。我们提出形式化守护生成模型，该模型允许规划大语言模型使用一阶逻辑为每个生成模型调用指定形式化输出契约。每个形式化守护生成模型调用将底层模型封装于具备可验证回退机制的拒绝采样器中，确保在任何输入和参数设置下返回的输出均满足契约要求。基于形式化守护生成模型，我们构建了自进化验证智能体框架，其包含三个阶段：搜索阶段合成包含形式化守护生成模型调用的候选参数化程序；验证阶段针对所有参数值证明程序满足硬性约束的正确性，将问题简化为无约束学习；学习阶段采用可扩展的基于梯度的优化方法（包括类GRPO微调）来提升软性目标，同时保持正确性。我们在Dafny程序验证、符号数学合成及策略合规智能体工具使用三个基准任务上评估自进化验证智能体框架。实验表明，该框架在所有任务中均实现零约束违反，并在性能上超越无约束方法与当前最优基线，证明形式化行为约束不仅能保证正确性，还能引导合成过程产生更高质量的智能体。

摘要 (Abstract)

Recent advances have shown the effectiveness of self-evolving LLM agents on tasks such as program repair and scientific discovery. In this paradigm, a planner LLM synthesizes an agent program that invokes parametric models, including LLMs, which are then tuned per task to improve performance. However, existing self-evolving agent frameworks provide no formal guarantees of safety or correctness. Because such programs are often executed autonomously on unseen inputs, this lack of guarantees raises reliability and security concerns. We formulate agentic code generation as a constrained learning problem, combining hard formal specifications with soft objectives capturing task utility. We introduce Formally Guarded Generative Models (FGGM), which allow the planner LLM to specify a formal output contract for each generative model call using first-order logic. Each FGGM call wraps the underlying model in a rejection sampler with a verified fallback, ensuring every returned output satisfies the contract for any input and parameter setting. Building on FGGM, we present SEVerA (Self-Evolving Verified Agents), a three-stage framework: Search synthesizes candidate parametric programs containing FGGM calls; Verification proves correctness with respect to hard constraints for all parameter values, reducing the problem to unconstrained learning; and Learning applies scalable gradient-based optimization, including GRPO-style fine-tuning, to improve the soft objective while preserving correctness. We evaluate SEVerA on Dafny program verification, symbolic math synthesis, and policy-compliant agentic tool use ($τ^2$-bench). Across tasks, SEVerA achieves zero constraint violations while improving performance over unconstrained and SOTA baselines, showing that formal behavioral constraints not only guarantee correctness but also steer synthesis toward higher-quality agents.

关键词: Self-Evolving Agents, LLM Agents, Formal Verification, Safety Guarantees, Agentic Code Generation, Tool Use, Constrained Learning, Verified Synthesis

深度分析:

SEVerA：自进化智能体的验证合成

摘要:

针对自进化大模型智能体在程序修复等任务中缺乏形式化安全保证的问题，本文提出了SEVerA框架。该框架将智能体代码生成视为一个结合硬形式化规范与软任务目标的约束学习问题。作者引入了形式化保护生成模型（FGGM），利用一阶逻辑为模型调用指定输出契约，并通过拒绝采样确保输出满足约束。SEVerA包含搜索、验证和学习三个阶段：规划器LLM合成候选程序，验证器证明其正确性，最后通过梯度优化提升性能。实验结果表明，SEVerA在多个基准测试中实现了零约束违规，且任务性能优于现有最先进的基线，证明了形式化约束能有效引导合成更高质量的智能体。

创新点:

提出了形式化保护生成模型（FGGM），允许通过一阶逻辑定义并强制执行模型调用的局部契约，适用于开源和闭源模型。
设计了SEVerA，这是首个具有可验证保证的自进化智能体合成算法，通过搜索、验证和学习三阶段解决约束学习问题。
证明了SEVerA的健全性，即返回的智能体在所有输入和参数值下均满足行为规范。
展示了形式化行为约束不仅保证安全性，还能通过修剪候选程序空间，引导合成更高性能的智能体。

方法

!!! info

论文将智能体代码生成建模为约束学习问题。首先，引入FGGM包装生成模型，使用一阶逻辑定义输入输出契约，并利用拒绝采样和验证回退机制确保输出合规。其次，实施三阶段流程：1) 搜索阶段，规划器LLM合成包含FGGM调用的候选参数化程序（使用Dafny语言）；2) 验证阶段，利用Dafny内置验证器证明程序在所有参数值下满足硬约束；3) 学习阶段，将验证后的程序转化为无约束优化问题，应用基于梯度的优化（如GRPO风格微调）来最大化软目标性能。

关键结果:

在所有评估任务中实现了零约束违规，保证了形式化安全性。
在HumanEvalDafny上达到97.0%的验证率，优于最佳基线的86.9%。
在GSM-Symbolic上达到66.0%的准确率，优于最佳约束解码方法的44.7%。
在τ2-bench航空领域使用Qwen3-8B达到52.6%的通过率，甚至超过了使用Claude Sonnet 4.5的Agent-C。

技术栈: 大语言模型, 形式化验证, Dafny编程语言与验证器, SMT求解器, 拒绝采样, GRPO (Group Relative Policy Optimization), 一阶逻辑, 梯度下降优化

优点

提供了严格的形式化安全性和正确性保证，解决了自进化智能体的可靠性隐患。
方法通用性强，不依赖于特定的模型架构，可应用于开源和闭源模型。
在保证安全的同时提升了任务性能，证明了约束有助于引导搜索。
理论基础扎实，提供了完备的定理证明。

局限

依赖于一阶逻辑来表达约束，对于极其复杂或难以形式化的自然语言约束可能存在表达瓶颈。
验证阶段可能需要大量的计算资源，特别是对于复杂的程序逻辑。
拒绝采样机制可能会增加推理延迟，因为不满足约束的输出会被拒绝并重试。

与研究方向的相关性:

该论文属于大模型技术原理的创新，专注于智能体的安全验证与合成技术，直接契合用户对大模型技术原理创新的关注。虽然主要应用场景是程序合成和符号数学，但其提出的验证框架具有通用性，可扩展至科学计算等其他领域。论文创新性极强，结合了形式化方法与深度学习，符合高分标准。

5. Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Reward

作者: Cheng Jiayang, Xin Liu, Zhihan Zhang, Haoyang Wen, Zixuan Zhang, Qingyu Yin, Shiyang Li, Priyanka Nigam, Bing Yin, Chao Zhang, Yangqiu Song 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24709v1

评分: 38.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在多步工具编排（multi-step tool orchestration）中的训练问题，与"Large Language Models"、“Tool Use"和"LLM Agents"高度相关（10分），因为论文直接研究LLMs如何调用多个API并管理依赖关系。与"Chain of Thought"有一定关联（8分），因为多步编排涉及顺序推理，但论文未明确使用CoT术语。其他关键词如MoE、SFT、RAG等未在论文中提及或相关，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了LLMs在多步工具编排中因参数值错误和序列依赖而失败的问题，通过构建基于真实API响应的强化学习环境和分级奖励设计，在ComplexFuncBench上显著提高了任务执行准确性。

摘要翻译

多步骤工具编排任务要求大语言模型按正确顺序调用多个相互依赖的API并传递中间输出，目前仍具挑战性。现有先进模型在执行完整序列时频繁出错，其中参数值错误占失败案例的显著比例。训练模型处理此类工作流面临两大障碍：现有环境主要关注基于模拟数据的单轮次简单函数调用，且二元奖励机制无法为部分正确的执行提供有效信号。
我们提出了一个同时应对这两项挑战的框架。首先，我们构建了一个由大规模真实API响应缓存支持的强化学习环境，该环境支持数据合成流程，能够以可控复杂度采样有效的多步骤编排轨迹，其生成效率显著高于无约束方法。其次，我们提出了一种渐进式奖励设计，将正确性分解为原子有效性（在递增的粒度层级上评估单个函数调用的正确性）和编排能力（遵循依赖关系的正确工具排序）。在ComplexFuncBench基准测试中，我们的方法在轮次准确率上展现出显著提升。消融实验证实两项奖励组件均不可或缺：单独使用任一组件都会导致性能大幅下降。

摘要 (Abstract)

Multi-step tool orchestration, where LLMs must invoke multiple dependent APIs in the correct order while propagating intermediate outputs, remains challenging. State-of-the-art models frequently fail on full sequence execution, with parameter value errors accounting for a significant portion of failures. Training models to handle such workflows faces two obstacles: existing environments focus on simple per-turn function calls with simulated data, and binary rewards provide no signal for partial correctness. We present a framework addressing both challenges. First, we construct a reinforcement learning environment backed by a large-scale cache of real API responses, enabling a data synthesis pipeline that samples valid multi-step orchestration traces with controllable complexity and significantly higher generation efficiency than unconstrained methods. Second, we propose a graduated reward design that decomposes correctness into atomic validity (individual function call correctness at increasing granularity) and orchestration (correct tool sequencing with dependency respect). On ComplexFuncBench, our approach demonstrates substantial improvements in turn accuracy. Ablation studies confirm both reward components are essential: using either alone significantly degrades performance.

关键词: Large Language Models, Multi-step Tool Orchestration, API Tool Use, Reinforcement Learning, Graduated Rewards, Data Synthesis, ComplexFuncBench, Turn Accuracy

深度分析:

基于约束数据合成与分级奖励的多步工具编排大模型训练

摘要:

针对大模型在多步工具编排任务中面临的参数值错误率高、训练环境缺乏真实依赖链以及奖励信号稀疏等挑战，本文提出了一种新的训练框架。首先，构建了一个基于大规模真实API响应缓存的确定性强化学习环境，并设计了约束数据合成管道，利用工作流模板高效生成具有可控复杂度的有效多步轨迹。其次，提出了一种分级奖励设计，将正确性分解为原子有效性（验证单个函数调用的正确性）和编排有效性（验证正确的工具排序和依赖关系），从而提供密集的学习信号。在ComplexFuncBench上的实验表明，该方法显著提高了轮次准确率，且消融研究证实了两种奖励组件的必要性。

创新点:

构建了基于大规模真实API缓存（10万+响应）的确定性RL训练环境，解决了模拟数据不一致的问题。
提出了约束数据合成管道，利用工作流模板和反向索引高效采样有效的多步轨迹，显著提高了数据生成效率。
设计了分级奖励机制，将奖励分解为原子正确性和编排正确性，为多步任务提供了密集的反馈信号，解决了稀疏奖励问题。

方法

!!! info

论文采用强化学习方法。首先，通过定义工作流模板和收集Booking.com真实API响应构建缓存环境。其次，利用反向索引进行约束感知的缓存采样，结合LLM生成对应的自然语言查询，合成训练数据。最后，在训练过程中，采用GRPO（Group Relative Policy Optimization）算法，利用设计的分级奖励（R_atomic和R_orch）对模型进行更新，使其学习正确的函数调用顺序和参数传递。

关键结果:

在ComplexFuncBench数据集上，该方法显著提升了模型的轮次准确率。
消融实验表明，仅使用原子奖励或仅使用编排奖励都会导致性能显著下降，证明了两者结合的必要性。
约束数据合成方法相比无约束方法具有更高的生成成功率和效率。

技术栈: 强化学习（RL）, 组相对策略优化（GRPO）, 工作流模板, 反向索引, 大语言模型（LLM）用于查询生成, 缓存机制

优点

解决了多步工具编排中真实数据依赖缺失的问题，使用了真实API缓存。
有效缓解了长序列任务中的稀疏奖励问题，通过分级奖励提供细粒度反馈。
数据合成效率高，通过约束采样避免了无效轨迹的生成。

局限

数据合成依赖于预定义的工作流模板，可能无法覆盖所有潜在的复杂编排模式。
缓存环境虽然基于真实数据，但毕竟是静态的，可能无法完全反映实时API的变化。
论文主要在特定领域（如酒店预订、租车）进行了验证，泛化到其他领域的程度尚需进一步研究。

与研究方向的相关性:

本文主要关注大模型技术原理的创新，特别是针对大模型使用外部工具能力的提升。它涉及强化学习在LLM训练中的应用，属于深度学习与大模型技术原理的交叉创新。虽然应用场景是API调用，但其核心贡献在于改进了LLM处理复杂逻辑和序列决策的能力，与“大模型和深度学习技术原理的创新”高度相关。

6. An Experimental Comparison of the Most Popular Approaches to Fake News Detection

作者: Pietro Dell’Oglio, Alessandro Bondielli, Francesco Marcelloni, Lucia C. Passaro 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25501v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究虚假新闻检测，直接涉及LLMs（作为虚假新闻的生成和检测工具），因此"Large Language Models"得10分。“Hallucination Mitigation"得10分，因为虚假新闻检测本质上就是识别和缓解模型生成或传播的虚假/不实信息。“Pre-training"和"Post-training"各得5分，因为论文提到了微调模型和预训练暴露问题。“In-context Learning"得5分，因为论文提到LLMs通过零样本和少样本学习进行检测。其他关键词（如MoE、量化、RAG等）论文未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文系统评估了12种虚假新闻检测方法在10个不同领域数据集上的性能，发现微调模型在领域内表现良好但泛化能力有限，而大语言模型通过零样本和少样本学习提供了有前景的替代方案。

摘要翻译

近年来，虚假新闻检测在公共讨论与科学研究中日益受到关注。尽管检测技术不断进步，但在大语言模型（LLMs）的驱动和社交媒体的放大效应下，虚假信息的产生与传播也变得更加复杂。本文对12种具有代表性的虚假新闻检测方法进行了批判性评估，涵盖传统机器学习、深度学习、Transformer模型以及专门的跨领域架构。我们在10个公开可用的数据集上评估了这些方法，这些数据集在体裁、来源、主题和标注逻辑上各不相同。我们将纯文本英文虚假新闻检测视为二分类任务，通过统一标注为“真实”与“虚假”来确保一致的评估流程。我们承认不同数据集的标签语义存在差异，统一标注不可避免地消除了此类语义上的细微差别。每个数据集均被视为一个独立领域。我们进行了领域内、多领域和跨领域实验，以模拟涉及领域偏移和分布外数据的真实场景。微调模型在领域内表现良好，但泛化能力不足。跨领域架构可以缩小这一差距，但需要大量数据；而大语言模型通过零样本和少样本学习提供了有前景的替代方案。鉴于数据集固有的混杂因素以及可能的预训练数据暴露，实验结果应视为在此纯文本英文评估框架内的鲁棒性评估。

摘要 (Abstract)

In recent years, fake news detection has received increasing attention in public debate and scientific research. Despite advances in detection techniques, the production and spread of false information have become more sophisticated, driven by Large Language Models (LLMs) and the amplification power of social media. We present a critical assessment of 12 representative fake news detection approaches, spanning traditional machine learning, deep learning, transformers, and specialized cross-domain architectures. We evaluate these methods on 10 publicly available datasets differing in genre, source, topic, and labeling rationale. We address text-only English fake news detection as a binary classification task by harmonizing labels into “Real” and “Fake” to ensure a consistent evaluation protocol. We acknowledge that label semantics vary across datasets and that harmonization inevitably removes such semantic nuances. Each dataset is treated as a distinct domain. We conduct in-domain, multi-domain and cross-domain experiments to simulate real-world scenarios involving domain shift and out-of-distribution data. Fine-tuned models perform well in-domain but struggle to generalize. Cross-domain architectures can reduce this gap but are data-hungry, while LLMs offer a promising alternative through zero- and few-shot learning. Given inherent dataset confounds and possible pre-training exposure, results should be interpreted as robustness evaluations within this English, text-only protocol.

关键词: fake news detection, large language models, domain adaptation, zero-shot learning, few-shot learning, binary classification, cross-domain evaluation, robustness evaluation

深度分析:

假新闻检测最流行方法的实验比较

摘要:

本文针对假新闻检测模型在跨域场景下的泛化能力不足问题，进行了一项大规模实证基准研究。研究评估了12种代表性方法，涵盖传统机器学习、深度学习、Transformer及专用跨域架构，并在10个异构数据集上进行了测试。通过设计数据集特定、跨数据集、混合训练及留一数据集四种实验场景，模拟了真实的域偏移情况。结果表明，尽管微调模型在域内表现优异，但泛化能力较差；专用跨域架构虽能提升泛化性但极度依赖数据；而大语言模型（LLM）通过零样本和少样本学习，在泛化能力与数据依赖之间提供了良好的平衡，为构建鲁棒的检测系统提供了重要参考。

创新点:

设计了包含四种场景（域内、跨域、混合训练、留一法）的严格实验协议，系统评估了模型在真实域偏移条件下的鲁棒性。
对比了12种跨越不同技术范式（传统ML、深度学习、Transformer、跨域架构及LLM）的方法，提供了全面的基准测试。
揭示了微调模型在跨域场景下的泛化失效问题，并验证了LLM作为无需大量标注数据的通用检测器的潜力。
将10个异构数据集统一处理为二分类任务，并明确将“域”定义为数据集身份，综合考察了主题、来源、标注语义等多重分布偏移的联合效应。

方法

!!! info

论文采用了实证对比研究方法。首先，选取10个公开的英文纯文本假新闻数据集，并将标签统一为“真实”和“虚假”。其次，选择了12种代表性模型，包括传统机器学习（如SVM）、深度学习（如BiLSTM）、Transformer（如BERT）、专用跨域架构（如MDFEND, MERMAID）以及大语言模型（如GPT-3.5）。最后，设计了四组实验：数据集特定测试（基准）、跨数据集测试（暴露域偏移）、混合训练测试（多域模拟）和留一数据集测试（跨域泛化评估），使用准确率等指标进行评估。

关键结果:

大多数标准方法（如微调的BERT）在域内表现优异，但在跨域测试中性能显著下降，存在严重的过拟合现象。
专门设计的跨域架构（如MoE模型）在留一数据集实验中表现较好，但需要大量训练数据，在数据受限时表现不佳。
大语言模型（LLM）在零样本和少样本设置下，虽然绝对精度可能低于微调的专用模型，但展现出更强的跨域泛化能力和鲁棒性。
数据集特定的伪影（如写作风格、标注方案）是导致模型难以泛化的主要原因。

技术栈: SVM, BiLSTM, BERT, RoBERTa, REAL-FND, DAFNE, FADED, MDFEND, M3FEND, MERMAID, GPT-3.5/4, 监督学习, 迁移学习, 对抗训练, 混合专家模型, 零样本/少样本学习, 提示工程

优点

评估全面：涵盖了从传统ML到最新LLM的广泛技术谱系，并在多个异构数据集上进行测试。
关注泛化：突破了仅关注域内性能的局限，重点解决了实际部署中至关重要的跨域泛化问题。
实验设计严谨：通过四种递进的实验设置，逐步施压测试模型的鲁棒性，结论更具说服力。
现实意义强：明确指出了现有模型在真实场景下的局限性，并验证了LLM在减少标注数据依赖方面的潜力。

局限

模态单一：仅关注英文纯文本，忽略了图像、视频及社交传播上下文（如用户画像、转发结构）等多模态信息。
数据混淆：无法完全解耦主题、时间、来源和标注语义等混淆因素对模型性能的具体影响。
LLM预训练污染：未控制LLM在预训练阶段可能接触过测试数据的情况，可能影响对LLM真实泛化能力的评估。
未评估生成式假新闻：实验数据主要基于人类撰写的假新闻，未专门针对LLM生成的假新闻进行测试。

与研究方向的相关性:

论文高度相关。它直接涉及大语言模型（LLM）在假新闻检测这一具体科学/社会问题中的应用。它不仅评估了LLM的零样本和少样本学习能力，还对比了深度学习模型（如BERT、BiLSTM）和专用架构。论文深入探讨了模型泛化这一技术原理问题，分析了不同技术路线在面对域偏移时的优缺点，符合“大模型和深度学习技术原理的创新”以及“大模型在不同领域的研究应用”的要求。

7. TAPO: Translation Augmented Policy Optimization for Multilingual Mathematical Reasoning

作者: Xu Huang, Zhejian Lai, Zixian Huang, Jiajun Chen, Shujian Huang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25419v1

评分: 34.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	8.0/10	8.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在数学推理中的应用，特别是通过强化学习框架TAPO解决多语言场景下的性能差距。因此，与"Large Language Models"高度相关（10分）。论文基于GRPO构建强化学习框架，与"RLHF"等相关（8分）。研究涉及理解-推理范式，与"Chain of Thought"和"System 2 Thinking"相关（各8分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RAG、Quantization等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在多语言数学推理中因语言理解不足导致的性能差距问题，提出了一种基于强化学习的翻译增强策略优化框架TAPO，有效提升了模型在多语言数学推理和翻译任务中的性能，并能泛化到未见语言和领域外任务。

摘要翻译

大型语言模型（LLMs）在英语数学推理方面已展现出卓越能力，但在多语言场景中仍存在显著的性能差距，这主要归因于语言理解能力的不足。为弥合这一差距，我们提出了翻译增强策略优化（Translation-Augmented Policy Optimization, TAPO），这是一个基于GRPO构建的新型强化学习框架。TAPO采用显式的对齐策略，使模型以英语为枢轴语言，并遵循“先理解后推理”的范式。关键之处在于，我们引入了步骤级相对优势机制，将理解与推理过程解耦，从而能够在避免优化冲突的前提下整合翻译质量奖励。大量实验表明，TAPO能有效协同语言理解与推理能力，并兼容多种模型。该方法在多语言数学推理和翻译任务中均优于基线方法，同时对新语言和领域外任务展现出良好的泛化能力。

摘要 (Abstract)

Large Language Models (LLMs) have demonstrated remarkable proficiency in English mathematical reasoning, yet a significant performance disparity persists in multilingual contexts, largely attributed to deficiencies in language understanding. To bridge this gap, we introduce Translation-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework built upon GRPO. TAPO enforces an explicit alignment strategy where the model leverages English as a pivot and follows an understand-then-reason paradigm. Crucially, we employ a step-level relative advantage mechanism that decouples understanding from reasoning, allowing the integration of translation quality rewards without introducing optimization conflicts. Extensive experiments reveal that TAPO effectively synergizes language understanding with reasoning capabilities and is compatible with various models. It outperforms baseline methods in both multilingual mathematical reasoning and translation tasks, while generalizing well to unseen languages and out-of-domain tasks.

关键词: Large Language Models, multilingual mathematical reasoning, reinforcement learning, policy optimization, translation-augmented, understand-then-reason, GRPO, step-level relative advantage

深度分析:

TAPO：面向多语言数学推理的翻译增强策略优化

摘要:

论文针对大语言模型在多语言数学推理中存在的性能差距，指出其主要原因在于语言理解能力的不足。为此，作者提出了TAPO（Translation Augmented Policy Optimization），一种基于GRPO的新型强化学习框架。TAPO强制模型采用“先理解后推理”的显式范式，即先将问题翻译成英语作为理解的代理，再进行推理。为了解决翻译奖励与推理奖励之间的冲突，论文引入了步级相对优势机制，将理解与推理的信用分配解耦。实验表明，TAPO能有效协同语言理解与推理能力，在多语言数学推理和翻译任务上优于基线方法，并能泛化到未见语言和域外任务。

创新点:

提出了TAPO框架，通过强化学习强制模型执行显式的“先理解后推理”范式，利用英语作为枢纽语言来弥补多语言理解差距。
引入了步级相对优势机制，成功解决了在联合优化翻译和推理奖励时出现的轨迹级奖励冲突问题，实现了对理解和推理过程的独立信用分配。
利用标准翻译指标（如ChrF++, XCOMET）作为量化问题理解的代理奖励信号，为模型提供了显式的学习信号。
证明了该方法不仅能提升数学推理性能，还能提高翻译质量，且具有良好的跨语言泛化能力。

方法

!!! info

论文主要采用强化学习方法，基于GRPO（Group Relative Policy Optimization）框架构建TAPO。首先，设计包含格式奖励、翻译奖励（使用ChrF++、XCOMET等指标）和推理奖励（使用Math-Verify）的混合奖励模型。其次，通过特定的Prompt引导模型先生成英语翻译，再生成英语推理过程。核心在于引入步级相对优势机制，分别对翻译片段和推理片段的奖励进行归一化处理，计算独立的优势值，并通过插值系数融合翻译优势，从而避免不同奖励信号之间的相互干扰。

关键结果:

TAPO在多语言数学推理任务上表现优于朴素GRPO等基线方法。
模型的翻译质量得到显著提升，验证了通过翻译奖励增强理解能力的有效性。
步级相对优势机制有效避免了奖励冲突，防止了模型生成“巧合正确但理解错误”的推理轨迹。
该方法在未见过的语言和域外数学基准（如MMATH, MSVAMP）上表现出良好的泛化能力。
显式的翻译步骤并未显著增加推理成本。

技术栈: GRPO (Group Relative Policy Optimization), Reinforcement Learning (RL), Step-level Relative Advantage, Qwen2.5-3B-Instruct, Llama3.2-3B-Instruct, ChrF++, XCOMET-XL, COMETKIWI-DA-XL, Math-Verify, MGSM8KInstruct, GSM8K, MGSM, MMATH, MSVAMP

优点

针对性强：准确识别了多语言推理中的“理解瓶颈”，并提出了针对性的解决方案。
机制创新：步级相对优势机制巧妙地解决了多目标强化学习中的奖励冲突问题，保证了训练的稳定性。
可解释性：通过显式的翻译步骤，使得模型的“理解”过程变得可观察和可量化。
泛化能力：不仅适用于训练语言，还能泛化到未见语言和其他数学任务。

局限

依赖英语枢纽：方法严重依赖英语作为枢纽语言，对于非英语到非英语的推理可能不是最优解。
翻译指标局限：现有的翻译指标（如XCOMET）在低资源语言上可能表现不佳或容易受到奖励攻击。
推理开销：虽然论文提到未显著增加成本，但显式生成翻译步骤在理论上仍会增加生成长度，可能影响推理延迟。
数据依赖：需要高质量的平行语料或参考翻译来计算翻译奖励，这在某些低资源语言中可能难以获取。

与研究方向的相关性:

该论文属于大模型技术原理创新领域，专注于解决大模型在多语言场景下的推理能力问题。它涉及强化学习（RL）在LLM微调中的应用，属于深度学习技术原理的创新。虽然应用场景是数学推理，但核心贡献在于技术方法（TAPO框架和步级优势机制），符合“大模型和深度学习技术原理的创新”这一关键词，创新性较强。

📋 所有论文列表

1. ✅ Closing the Confidence-Faithfulness Gap in Large Language Models

作者: Miranda Muqing Miao, Lyle Ungar 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25052v1

评分: 44.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究发现大语言模型的置信度表达与其实际准确性之间存在脱节，并揭示了推理过程会污染置信度表达的机制，进而提出了一种自适应引导方法显著改善了校准对齐。

摘要翻译

大型语言模型（LLM）倾向于表达与其实际准确率显著脱节的置信度分数，然而支配这一行为的几何关系仍鲜为人知。在本研究中，我们对语言化置信度进行了机制可解释性分析，通过线性探针和对比激活添加（CAA）导向技术表明：校准信号与语言化置信度信号虽以线性方式编码，但彼此正交——这一发现在三个开源权重模型和四个数据集中均保持一致。有趣的是，当模型被要求同时进行问题推理并表达置信度分数时，推理过程会干扰语言化置信度的方向，加剧校准失准。我们将此现象称为“推理污染效应”。基于这一发现，我们提出了一种两阶段自适应导向流程：该流程读取模型的内部准确率估计值，并引导语言化输出与之匹配，从而在所有评估模型中显著提升了校准对齐度。

摘要 (Abstract)

Large language models (LLMs) tend to verbalize confidence scores that are largely detached from their actual accuracy, yet the geometric relationship governing this behavior remain poorly understood. In this work, we present a mechanistic interpretability analysis of verbalized confidence, using linear probes and contrastive activation addition (CAA) steering to show that calibration and verbalized confidence signals are encoded linearly but are orthogonal to one another – a finding consistent across three open-weight models and four datasets. Interestingly, when models are prompted to simultaneously reason through a problem and verbalize a confidence score, the reasoning process disrupts the verbalized confidence direction, exacerbating miscalibration. We term this the “Reasoning Contamination Effect.” Leveraging this insight, we introduce a two-stage adaptive steering pipeline that reads the model’s internal accuracy estimate and steers verbalized output to match it, substantially improving calibration alignment across all evaluated models.

2. ✅ AD-CARE: A Guideline-grounded, Modality-agnostic LLM Agent for Real-world Alzheimer’s Disease Diagnosis with Multi-cohort Assessment, Fairness Analysis, and Reader Study

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该研究提出了AD-CARE，一个基于大语言模型的智能体框架，用于从异构多模态数据中生成阿尔茨海默病的临床诊断报告，在多个队列中实现了84.9%的诊断准确率，提升了公平性，并在读者研究中提高了医生诊断准确性并减少了决策时间。

摘要翻译

阿尔茨海默病（Alzheimer’s disease, AD）随着人口老龄化已成为日益严峻的全球健康挑战，及时、准确的诊断对于减轻个体和社会负担至关重要。然而，现实世界中的AD评估受到不完整、异质的多模态数据以及不同中心与患者人口统计学差异的阻碍。尽管大语言模型（large language models, LLMs）在生物医学领域展现出潜力，但其在AD中的应用大多局限于回答狭窄的疾病特定问题，而非生成支持临床决策的综合性诊断报告。本研究通过引入AD-CARE扩展了LLMs在临床决策支持方面的能力：这是一种模态无关的智能体，能够基于不完整且异质的输入数据，在不补全缺失模态的情况下，执行基于临床指南的诊断评估。通过动态协调专用诊断工具并将临床指南嵌入LLM驱动的推理过程，AD-CARE生成透明、报告式的输出，其形式与现实临床工作流程相符。在包含10,303个病例的六个队列中，AD-CARE实现了84.9%的诊断准确率，相对于基线方法获得了4.2%-13.7%的相对提升。尽管队列间存在差异，其数据集特异性准确率保持稳健（80.4%-98.8%），且该智能体在所有队列中均优于所有基线方法。AD-CARE减少了不同种族和年龄亚组间的性能差异，将四项指标的平均离散度分别降低了21%-68%和28%-51%。在一项受控阅片者研究中，该智能体将神经科医生和放射科医生的诊断准确率提高了6%-11%，并将决策时间缩短了一半以上。该框架在八种骨干LLMs上实现了2.29%-10.66%的绝对性能提升，并使它们的表现趋于一致。这些结果表明，AD-CARE是一个可扩展、具备实际部署能力的框架，可整合到常规临床工作流程中，为AD提供多模态决策支持。

摘要 (Abstract)

Alzheimer’s disease (AD) is a growing global health challenge as populations age, and timely, accurate diagnosis is essential to reduce individual and societal burden. However, real-world AD assessment is hampered by incomplete, heterogeneous multimodal data and variability across sites and patient demographics. Although large language models (LLMs) have shown promise in biomedicine, their use in AD has largely been confined to answering narrow, disease-specific questions rather than generating comprehensive diagnostic reports that support clinical decision-making. Here we expand LLM capabilities for clinical decision support by introducing AD-CARE, a modality-agnostic agent that performs guideline-grounded diagnostic assessment from incomplete, heterogeneous inputs without imputing missing modalities. By dynamically orchestrating specialized diagnostic tools and embedding clinical guidelines into LLM-driven reasoning, AD-CARE generates transparent, report-style outputs aligned with real-world clinical workflows. Across six cohorts comprising 10,303 cases, AD-CARE achieved 84.9% diagnostic accuracy, delivering 4.2%-13.7% relative improvements over baseline methods. Despite cohort-level differences, dataset-specific accuracies remain robust (80.4%-98.8%), and the agent consistently outperforms all baselines. AD-CARE reduced performance disparities across racial and age subgroups, decreasing the average dispersion of four metrics by 21%-68% and 28%-51%, respectively. In a controlled reader study, the agent improved neurologist and radiologist accuracy by 6%-11% and more than halved decision time. The framework yielded 2.29%-10.66% absolute gains over eight backbone LLMs and converges their performance. These results show that AD-CARE is a scalable, practically deployable framework that can be integrated into routine clinical workflows for multimodal decision support in AD.

关键词: Alzheimer’s disease diagnosis, Large language models, LLM agent, Clinical decision support, Multimodal data, Guideline-grounded, Fairness analysis, Reader study

3. ✅ FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了FinMCP-Bench基准，用于评估大型语言模型代理通过调用金融工具解决现实世界金融问题的能力，并系统评估了主流LLM的性能。

摘要翻译

本文介绍了\textbf{FinMCP-Bench}，这是一个通过金融模型上下文协议工具调用来评估大语言模型解决现实世界金融问题能力的新型基准。FinMCP-Bench包含613个样本，涵盖10个主要场景和33个子场景，融合了真实与合成的用户查询，以确保多样性和真实性。它整合了65个真实的金融MCP以及三种样本类型——单工具、多工具和多轮对话，从而能够评估模型在不同任务复杂度下的表现。利用此基准，我们系统评估了一系列主流大语言模型，并提出了明确衡量工具调用准确性和推理能力的指标。FinMCP-Bench为推进金融大语言模型智能体的研究提供了一个标准化、实用且具有挑战性的测试平台。

摘要 (Abstract)

This paper introduces \textbf{FinMCP-Bench}, a novel benchmark for evaluating large language models (LLMs) in solving real-world financial problems through tool invocation of financial model context protocols. FinMCP-Bench contains 613 samples spanning 10 main scenarios and 33 sub-scenarios, featuring both real and synthetic user queries to ensure diversity and authenticity. It incorporates 65 real financial MCPs and three types of samples, single tool, multi-tool, and multi-turn, allowing evaluation of models across different levels of task complexity. Using this benchmark, we systematically assess a range of mainstream LLMs and propose metrics that explicitly measure tool invocation accuracy and reasoning capabilities. FinMCP-Bench provides a standardized, practical, and challenging testbed for advancing research on financial LLM agents.

关键词: LLM Agents, Tool Use, Financial Benchmark, Model Context Protocol, Reasoning Evaluation, Real-world Financial Problems, Tool Invocation Accuracy

4. ✅ SEVerA: Verified Synthesis of Self-Evolving Agents

作者: Debangshu Banerjee, Changming Xu, Gagandeep Singh 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25111v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

论文提出了SEVerA框架，通过形式化验证确保自演化LLM代理的安全性和正确性，在程序验证、数学合成和工具使用等任务中实现了零约束违规并提升了性能。

摘要翻译

近期研究表明，自进化大语言模型智能体在程序修复和科学发现等任务中展现出显著成效。该范式通过规划大语言模型合成调用参数化模型（包括大语言模型）的智能体程序，并针对具体任务进行调优以提升性能。然而，现有自进化智能体框架缺乏对安全性或正确性的形式化保证。由于此类程序常需在未见输入上自主执行，这种保证缺失引发了可靠性与安全性的担忧。本文将智能体代码生成建模为约束学习问题，将硬性形式化规约与捕捉任务效用的软性目标相结合。我们提出形式化守护生成模型，该模型允许规划大语言模型使用一阶逻辑为每个生成模型调用指定形式化输出契约。每个形式化守护生成模型调用将底层模型封装于具备可验证回退机制的拒绝采样器中，确保在任何输入和参数设置下返回的输出均满足契约要求。基于形式化守护生成模型，我们构建了自进化验证智能体框架，其包含三个阶段：搜索阶段合成包含形式化守护生成模型调用的候选参数化程序；验证阶段针对所有参数值证明程序满足硬性约束的正确性，将问题简化为无约束学习；学习阶段采用可扩展的基于梯度的优化方法（包括类GRPO微调）来提升软性目标，同时保持正确性。我们在Dafny程序验证、符号数学合成及策略合规智能体工具使用三个基准任务上评估自进化验证智能体框架。实验表明，该框架在所有任务中均实现零约束违反，并在性能上超越无约束方法与当前最优基线，证明形式化行为约束不仅能保证正确性，还能引导合成过程产生更高质量的智能体。

摘要 (Abstract)

Recent advances have shown the effectiveness of self-evolving LLM agents on tasks such as program repair and scientific discovery. In this paradigm, a planner LLM synthesizes an agent program that invokes parametric models, including LLMs, which are then tuned per task to improve performance. However, existing self-evolving agent frameworks provide no formal guarantees of safety or correctness. Because such programs are often executed autonomously on unseen inputs, this lack of guarantees raises reliability and security concerns. We formulate agentic code generation as a constrained learning problem, combining hard formal specifications with soft objectives capturing task utility. We introduce Formally Guarded Generative Models (FGGM), which allow the planner LLM to specify a formal output contract for each generative model call using first-order logic. Each FGGM call wraps the underlying model in a rejection sampler with a verified fallback, ensuring every returned output satisfies the contract for any input and parameter setting. Building on FGGM, we present SEVerA (Self-Evolving Verified Agents), a three-stage framework: Search synthesizes candidate parametric programs containing FGGM calls; Verification proves correctness with respect to hard constraints for all parameter values, reducing the problem to unconstrained learning; and Learning applies scalable gradient-based optimization, including GRPO-style fine-tuning, to improve the soft objective while preserving correctness. We evaluate SEVerA on Dafny program verification, symbolic math synthesis, and policy-compliant agentic tool use ($τ^2$-bench). Across tasks, SEVerA achieves zero constraint violations while improving performance over unconstrained and SOTA baselines, showing that formal behavioral constraints not only guarantee correctness but also steer synthesis toward higher-quality agents.

关键词: Self-Evolving Agents, LLM Agents, Formal Verification, Safety Guarantees, Agentic Code Generation, Tool Use, Constrained Learning, Verified Synthesis

5. ✅ Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards

评分: 38.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在多步工具编排（multi-step tool orchestration）中的训练问题，与"Large Language Models”、“Tool Use"和"LLM Agents"高度相关（10分），因为论文直接研究LLMs如何调用多个API并管理依赖关系。与"Chain of Thought"有一定关联（8分），因为多步编排涉及顺序推理，但论文未明确使用CoT术语。其他关键词如MoE、SFT、RAG等未在论文中提及或相关，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了LLMs在多步工具编排中因参数值错误和序列依赖而失败的问题，通过构建基于真实API响应的强化学习环境和分级奖励设计，在ComplexFuncBench上显著提高了任务执行准确性。

摘要翻译

多步骤工具编排任务要求大语言模型按正确顺序调用多个相互依赖的API并传递中间输出，目前仍具挑战性。现有先进模型在执行完整序列时频繁出错，其中参数值错误占失败案例的显著比例。训练模型处理此类工作流面临两大障碍：现有环境主要关注基于模拟数据的单轮次简单函数调用，且二元奖励机制无法为部分正确的执行提供有效信号。
我们提出了一个同时应对这两项挑战的框架。首先，我们构建了一个由大规模真实API响应缓存支持的强化学习环境，该环境支持数据合成流程，能够以可控复杂度采样有效的多步骤编排轨迹，其生成效率显著高于无约束方法。其次，我们提出了一种渐进式奖励设计，将正确性分解为原子有效性（在递增的粒度层级上评估单个函数调用的正确性）和编排能力（遵循依赖关系的正确工具排序）。在ComplexFuncBench基准测试中，我们的方法在轮次准确率上展现出显著提升。消融实验证实两项奖励组件均不可或缺：单独使用任一组件都会导致性能大幅下降。

摘要 (Abstract)

Multi-step tool orchestration, where LLMs must invoke multiple dependent APIs in the correct order while propagating intermediate outputs, remains challenging. State-of-the-art models frequently fail on full sequence execution, with parameter value errors accounting for a significant portion of failures. Training models to handle such workflows faces two obstacles: existing environments focus on simple per-turn function calls with simulated data, and binary rewards provide no signal for partial correctness. We present a framework addressing both challenges. First, we construct a reinforcement learning environment backed by a large-scale cache of real API responses, enabling a data synthesis pipeline that samples valid multi-step orchestration traces with controllable complexity and significantly higher generation efficiency than unconstrained methods. Second, we propose a graduated reward design that decomposes correctness into atomic validity (individual function call correctness at increasing granularity) and orchestration (correct tool sequencing with dependency respect). On ComplexFuncBench, our approach demonstrates substantial improvements in turn accuracy. Ablation studies confirm both reward components are essential: using either alone significantly degrades performance.

关键词: Large Language Models, Multi-step Tool Orchestration, API Tool Use, Reinforcement Learning, Graduated Rewards, Data Synthesis, ComplexFuncBench, Turn Accuracy

6. ✅ An Experimental Comparison of the Most Popular Approaches to Fake News Detection

作者: Pietro Dell’Oglio, Alessandro Bondielli, Francesco Marcelloni, Lucia C. Passaro 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25501v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文系统评估了12种虚假新闻检测方法在10个不同领域数据集上的性能，发现微调模型在领域内表现良好但泛化能力有限，而大语言模型通过零样本和少样本学习提供了有前景的替代方案。

摘要翻译

近年来，虚假新闻检测在公共讨论与科学研究中日益受到关注。尽管检测技术不断进步，但在大语言模型（LLMs）的驱动和社交媒体的放大效应下，虚假信息的产生与传播也变得更加复杂。本文对12种具有代表性的虚假新闻检测方法进行了批判性评估，涵盖传统机器学习、深度学习、Transformer模型以及专门的跨领域架构。我们在10个公开可用的数据集上评估了这些方法，这些数据集在体裁、来源、主题和标注逻辑上各不相同。我们将纯文本英文虚假新闻检测视为二分类任务，通过统一标注为“真实”与“虚假”来确保一致的评估流程。我们承认不同数据集的标签语义存在差异，统一标注不可避免地消除了此类语义上的细微差别。每个数据集均被视为一个独立领域。我们进行了领域内、多领域和跨领域实验，以模拟涉及领域偏移和分布外数据的真实场景。微调模型在领域内表现良好，但泛化能力不足。跨领域架构可以缩小这一差距，但需要大量数据；而大语言模型通过零样本和少样本学习提供了有前景的替代方案。鉴于数据集固有的混杂因素以及可能的预训练数据暴露，实验结果应视为在此纯文本英文评估框架内的鲁棒性评估。

摘要 (Abstract)

In recent years, fake news detection has received increasing attention in public debate and scientific research. Despite advances in detection techniques, the production and spread of false information have become more sophisticated, driven by Large Language Models (LLMs) and the amplification power of social media. We present a critical assessment of 12 representative fake news detection approaches, spanning traditional machine learning, deep learning, transformers, and specialized cross-domain architectures. We evaluate these methods on 10 publicly available datasets differing in genre, source, topic, and labeling rationale. We address text-only English fake news detection as a binary classification task by harmonizing labels into “Real” and “Fake” to ensure a consistent evaluation protocol. We acknowledge that label semantics vary across datasets and that harmonization inevitably removes such semantic nuances. Each dataset is treated as a distinct domain. We conduct in-domain, multi-domain and cross-domain experiments to simulate real-world scenarios involving domain shift and out-of-distribution data. Fine-tuned models perform well in-domain but struggle to generalize. Cross-domain architectures can reduce this gap but are data-hungry, while LLMs offer a promising alternative through zero- and few-shot learning. Given inherent dataset confounds and possible pre-training exposure, results should be interpreted as robustness evaluations within this English, text-only protocol.

关键词: fake news detection, large language models, domain adaptation, zero-shot learning, few-shot learning, binary classification, cross-domain evaluation, robustness evaluation

7. ✅ TAPO: Translation Augmented Policy Optimization for Multilingual Mathematical Reasoning

作者: Xu Huang, Zhejian Lai, Zixian Huang, Jiajun Chen, Shujian Huang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25419v1

评分: 34.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	8.0/10	8.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在多语言数学推理中因语言理解不足导致的性能差距问题，提出了一种基于强化学习的翻译增强策略优化框架TAPO，有效提升了模型在多语言数学推理和翻译任务中的性能，并能泛化到未见语言和领域外任务。

摘要翻译

大型语言模型（LLMs）在英语数学推理方面已展现出卓越能力，但在多语言场景中仍存在显著的性能差距，这主要归因于语言理解能力的不足。为弥合这一差距，我们提出了翻译增强策略优化（Translation-Augmented Policy Optimization, TAPO），这是一个基于GRPO构建的新型强化学习框架。TAPO采用显式的对齐策略，使模型以英语为枢轴语言，并遵循“先理解后推理”的范式。关键之处在于，我们引入了步骤级相对优势机制，将理解与推理过程解耦，从而能够在避免优化冲突的前提下整合翻译质量奖励。大量实验表明，TAPO能有效协同语言理解与推理能力，并兼容多种模型。该方法在多语言数学推理和翻译任务中均优于基线方法，同时对新语言和领域外任务展现出良好的泛化能力。

摘要 (Abstract)

Large Language Models (LLMs) have demonstrated remarkable proficiency in English mathematical reasoning, yet a significant performance disparity persists in multilingual contexts, largely attributed to deficiencies in language understanding. To bridge this gap, we introduce Translation-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework built upon GRPO. TAPO enforces an explicit alignment strategy where the model leverages English as a pivot and follows an understand-then-reason paradigm. Crucially, we employ a step-level relative advantage mechanism that decouples understanding from reasoning, allowing the integration of translation quality rewards without introducing optimization conflicts. Extensive experiments reveal that TAPO effectively synergizes language understanding with reasoning capabilities and is compatible with various models. It outperforms baseline methods in both multilingual mathematical reasoning and translation tasks, while generalizing well to unseen languages and out-of-domain tasks.

8. ❌ Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance?

作者: Liang Zhang, Yu Fu, Xinyi Jin 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25633v1

评分: 26.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	8.0/10	8.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在数学教育中的应用，特别是作为问题解决者和评估者的双重角色，因此与"Large Language Models"高度相关（10分）。研究涉及数学推理中的多步推理和错误定位，与"Chain of Thought"相关（8分）。论文评估LLM-based math tutor agents，属于"LLM Agents"范畴（8分）。其他关键词如MoE、SLMs、训练方法、优化技术、推理加速、科学AI应用等均未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该研究探讨了大语言模型在数学问题解决能力与步骤级评估性能之间的关系，发现模型在正确解决的问题上评估准确性更高，但评估任务比直接解决问题更困难，需要额外的步骤跟踪和错误定位能力。

摘要翻译

大型语言模型（LLM）在数学教育中的应用日益广泛，不仅作为问题求解工具，也作为学习者推理过程的评估者。然而，更强的数学问题求解能力是否与更强的步骤级评估性能相关联，目前尚不明确。本研究利用PROCESSBENCH（一个用于识别数学推理中最早错误步骤的人工标注基准）中的GSM8K和MATH子集，探讨了这一关系。我们评估了两种基于LLM的数学辅导智能体设置（分别以GPT-4和GPT-5实例化），在相同数学问题上的两项独立任务：求解原始问题，以及通过预测最早错误步骤来评估基准提供的解答。结果显示出一致的模型内规律：同一模型在自身能正确求解的数学问题条目上，其评估准确率显著高于在自身求解错误的条目上，这一关联在两种模型和数据集上均具有统计显著性。同时，评估任务仍比直接问题求解更为困难，尤其是在存在错误的解答上。这些发现表明，数学问题求解的专业能力有助于提升评估表现，但可靠的步骤级诊断还需要额外的能力，如步骤追踪、过程监控和精确错误定位。研究结果对数学教育中用于形成性评估的人工智能支持自适应教学系统（AIS）的设计与评估具有启示意义。

摘要 (Abstract)

Large Language Models (LLMs) are increasingly used in math education not only as problem solvers but also as assessors of learners’ reasoning. However, it remains unclear whether stronger math problem-solving ability is associated with stronger step-level assessment performance. This study examines that relationship using the GSM8K and MATH subsets of PROCESSBENCH, a human-annotated benchmark for identifying the earliest erroneous step in mathematical reasoning. We evaluate two LLM-based math tutor agent settings, instantiated with GPT-4 and GPT-5, in two independent tasks on the same math problems: solving the original problem and assessing a benchmark-provided solution by predicting the earliest erroneous step. Results show a consistent within-model pattern: assessment accuracy is substantially higher on math problem items the same model solved correctly than on items it solved incorrectly, with statistically significant associations across both models and datasets. At the same time, assessment remains more difficult than direct problem solving, especially on error-present solutions. These findings suggest that math problem-solving expertise supports stronger assessment performance, but reliable step-level diagnosis also requires additional capabilities such as step tracking, monitoring, and precise error localization. The results have implications for the design and evaluation of AI-supported Adaptive Instructional Systems (AISs) for formative assessment in math education.

关键词: Large Language Models, mathematical problem-solving, assessment performance, step-level diagnosis, math education, AI tutor agents, reasoning errors, Adaptive Instructional Systems

9. ❌ Optimizing Multilingual LLMs via Federated Learning: A Study of Client Language Composition

作者: Aleix Sant, Jordi Luque, Carlos Escolano 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24242v2

评分: 26.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	8.0/10	8.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在联邦学习（FL）环境下的多语言指令微调（instruction-tuning），因此与"Large Language Models"高度相关（10分）。研究明确涉及"Post-training/Supervised Fine-tuning"和"Instruction Tuning”，因为其实验基于多语言指令微调，属于这些范畴，但非唯一核心，故给8分。其他关键词如MoE、SLMs、Scaling Laws、RLHF、PEFT、RAG等，论文未直接涉及或提及，均为0分。

!!! tip deepseek-chat TL;DR

该研究探讨了在联邦学习框架下，客户端语言组成（从单语到多语）如何影响多语言大语言模型的性能、公平性和训练成本，发现增加客户端内部的多语言性可以产生更强、更公平的全局模型，尤其有益于低资源语言，但需要更多优化步骤。

摘要翻译

在多语言环境下进行大语言模型的联邦学习面临着显著挑战，这些挑战主要源于客户端间异构的语言分布以及语言资源可用性的差异。为解决这些问题，我们扩展了联邦学习框架FederatedScope-LLM，以支持大语言模型的多语言指令微调实验。同时，我们提出了一种新颖的客户端特定早停机制——本地动态早停（Local Dynamic Early Stopping, LDES-FL），该机制允许客户端根据本地验证性能暂停和恢复本地训练，从而提升训练效率和可持续性。通过一系列实验，我们研究了客户端的语言构成——从完全单语到逐渐多语化的客户端——如何影响多语言质量、公平性和训练成本。对于单一语言的专业化任务，本地单语微调仍然最为有效，而联邦训练则更适合学习单一平衡的多语言模型。在联邦学习中，增加客户端内部的多语言性能够产生更强且更公平的全局模型，缩小与集中式多语言微调的差距，并为资源较少的语言带来最大的收益，尽管这会以更多的优化步骤为代价。总体而言，我们的研究结果表明，客户端的语言构成是多语言联邦学习中的一个关键设计变量，它影响着性能、公平性和效率。

摘要 (Abstract)

Federated Learning (FL) of Large Language Models (LLMs) in multilingual environments presents significant challenges stemming from heterogeneous language distributions across clients and disparities in language resource availability. To address these challenges, we extended the FederatedScope-LLM framework to support multilingual instruction-tuning experiments with LLMs. We also introduced a novel client-specific early stopping mechanism, Local Dynamic Early Stopping (LDES-FL), which allows clients to pause and resume local training based on client-side validation performance, enhancing training efficiency and sustainability. Through a series of experiments, we studied how client language composition - from fully monolingual to increasingly multilingual clients - affects multilingual quality, fairness and training cost. Monolingual local fine-tuning remains the most effective for single-language specialization, whereas federated training is better suited to learning a single balanced multilingual model. In FL, increasing within-client multilinguality leads to stronger and fairer global models, narrows the gap to centralized multilingual fine-tuning, and yields the largest gains for lower-resource languages, albeit at the cost of more optimization steps. Overall, our results identify client language composition as a key design variable in multilingual FL, shaping performance, fairness and efficiency.

关键词: Federated Learning, Large Language Models, Multilingual, Instruction-tuning, Client Language Composition, Fine-tuning, Training Efficiency, Fairness

10. ❌ Designing Any Imaging System from Natural Language: Agent-Constrained Composition over a Finite Primitive Basis

作者: Chengshuai Yang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25636v1

评分: 26.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	8.0/10	8.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	8.0/10	8.0

评分理由: 该论文主要研究使用自主智能体（Plan、Judge、Execute）将自然语言描述自动转换为经过验证的计算成像系统设计。论文的核心是智能体系统在科学应用（成像系统设计）中的创新应用。因此，与"LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow"高度相关（10分），因为论文明确提出了三个自主智能体。与"Multi-agent Systems” OR “Agent Coordination"相关（8分），因为涉及多个智能体的协调工作。与"AI for Science” OR “Bioinformatics” OR “Cheminformatics"相关（8分），因为论文将AI应用于科学领域的计算成像系统设计。其他关键词主要涉及大模型技术原理（如LLM训练、优化、推理等），论文未直接涉及这些技术，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种使用自主智能体将自然语言描述自动转换为经过验证的计算成像系统设计的方法，在多种真实数据模态上实现了与专家设计相当的质量，并展示了超越单模态工具的合成能力。

摘要翻译

设计一套计算成像系统——包括选择算子、设定参数、验证一致性——针对每种成像模态通常需要耗费专家数周时间，这种专业门槛限制了更广泛的科学界对成像仪器进行原型设计的能力。我们提出了spec.md这一结构化规范格式，以及三个自主智能体——规划（Plan）、评估（Judge）与执行（Execute）——它们能够将一句自然语言描述转化为一个经过验证、且重建误差有界的正向模型。一项从设计到实物的误差定理将总重建误差分解为五个可独立限定的项，每一项均对应一种纠正措施。在涵盖全部五种载体家族的6种真实数据模态上，该自动化流程达到了专家库的质量水平（98.1 +/- 4.2%）。十项新颖的设计——将基础模块组合成从3D到5D的链式结构——展示了其组合能力已超越任何单一模态工具。

摘要 (Abstract)

Designing a computational imaging system – selecting operators, setting parameters, validating consistency – requires weeks of specialist effort per modality, creating an expertise bottleneck that excludes the broader scientific community from prototyping imaging instruments. We introduce spec.md, a structured specification format, and three autonomous agents – Plan, Judge, and Execute – that translate a one-sentence natural-language description into a validated forward model with bounded reconstruction error. A design-to-real error theorem decomposes total reconstruction error into five independently bounded terms, each linked to a corrective action. On 6 real-data modalities spanning all 5 carrier families, the automated pipeline matches expert-library quality (98.1 +/- 4.2%). Ten novel designs – composing primitives into chains from 3D to 5D – demonstrate compositional reach beyond any single-modality tool.

关键词: computational imaging system, autonomous agents, natural language description, forward model, reconstruction error, spec.md, agent coordination, AI for science

11. ❌ SIGMA: Structure-Invariant Generative Molecular Alignment for Chemical Language Models via Autoregressive Contrastive Learning

作者: Xinyu Wang, Fei Dou, Jinbo Bi, Minghu Song 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25062v1

评分: 26.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	8.0/10	8.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文专注于化学语言模型（CLMs），这是大模型在科学领域（特别是化学信息学）的应用，因此与"AI for Science"高度相关（10分）。论文提出了一种名为SIGMA的预训练对齐方法，涉及对比学习，这属于"Pre-training"范畴（8分）。虽然论文未明确使用"Large Language Models"术语，但化学语言模型本质上是特定领域的大语言模型，因此给予8分。其他关键词如MoE、SFT、RLHF等与论文的分子生成和结构对齐核心内容无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文解决了化学语言模型中分子线性化表示导致的结构等价性模糊问题，提出了一种名为SIGMA的结构不变生成分子对齐方法，通过自回归对比学习和同构束搜索，显著提高了分子生成的样本效率和结构多样性。

摘要翻译

线性化字符串表示是可扩展自回归分子生成的基础；然而，其引入了根本性的模态不匹配问题：单个分子图可映射至多个不同的序列。这种模糊性导致了轨迹发散现象，即结构等效的部分子图因其线性化历史不同，其潜在表征在隐空间中逐渐漂移分离。为在不放弃高效字符串框架的前提下解决此问题，我们提出了结构不变生成分子对齐方法（SIGMA）。SIGMA不改变线性表示本身，而是通过词元级对比学习目标，使模型能够严格识别几何对称性，显式地对齐具有相同后缀的前缀序列的潜在状态。此外，我们引入同构束搜索（IsoBeam），通过在推理过程中动态剪枝等价路径来消除同构冗余。在标准基准上的实证评估表明，SIGMA弥合了序列可扩展性与图保真度之间的差距，相较于强基线模型，在多参数优化任务中实现了更优的采样效率和结构多样性。

摘要 (Abstract)

Linearized string representations serve as the foundation of scalable autoregressive molecular generation; however, they introduce a fundamental modality mismatch where a single molecular graph maps to multiple distinct sequences. This ambiguity leads to \textit{trajectory divergence}, where the latent representations of structurally equivalent partial graphs drift apart due to differences in linearization history. To resolve this without abandoning the efficient string formulation, we propose Structure-Invariant Generative Molecular Alignment (SIGMA). Rather than altering the linear representation, SIGMA enables the model to strictly recognize geometric symmetries via a token-level contrastive objective, which explicitly aligns the latent states of prefixes that share identical suffixes. Furthermore, we introduce Isomorphic Beam Search (IsoBeam) to eliminate isomorphic redundancy during inference by dynamically pruning equivalent paths. Empirical evaluations on standard benchmarks demonstrate that SIGMA bridges the gap between sequence scalability and graph fidelity, yielding superior sample efficiency and structural diversity in multi-parameter optimization compared to strong baselines.

关键词: Chemical Language Models, Molecular Generation, Autoregressive Contrastive Learning, Structure-Invariant Alignment, Linearized String Representations, Trajectory Divergence, Isomorphic Beam Search, Multi-parameter Optimization

12. ❌ Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving

作者: Zehao Wang, Huaide Jiang, Shuaiwu Dong, Yuping Wang, Hang Qiu, Jiachen Li 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25740v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出Drive My Way (DMW)，一个个性化的视觉-语言-动作（VLA）驾驶框架，其核心是使自动驾驶系统与用户的长期驾驶习惯和实时指令对齐。因此，与"Alignment"高度相关（10分）。论文涉及学习用户嵌入和条件化策略，这可以视为一种领域适应和微调过程，因此与"Pre-training/Domain Adaptation"和"Post-training/SFT"有一定关联（各5分）。论文是一个VLA模型，虽然主要不是纯LLM，但属于大模型在特定领域（自动驾驶）的应用，因此与"Large Language Models/Foundation Models"有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、推理加速、量化等，论文未明确涉及或不是核心内容，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有自动驾驶系统缺乏个性化适应能力的问题，提出了一个个性化的视觉-语言-动作（VLA）驾驶框架Drive My Way，通过用户嵌入和自然语言指令对齐，实现了对用户长期习惯和短期意图的适应，并在Bench2Drive基准测试和用户研究中验证了其有效性和可识别性。

摘要翻译

人类驾驶行为具有固有的个性化特征，其由长期习惯塑造并受短期意图影响。不同个体在加速、制动、并线、让行与超车等多样化场景中均表现出差异性。然而，现有的端到端自动驾驶系统或针对通用目标进行优化，或依赖固定的驾驶模式，缺乏适应个体偏好或解析自然语言意图的能力。为弥补这一不足，我们提出了“Drive My Way”（DMW），一种个性化的视觉-语言-行动（Vision-Language-Action, VLA）驾驶框架，该框架能与用户的长期驾驶习惯保持一致，并适应其实时指令。DMW从我们收集的多位真实驾驶员、多场景下的个性化驾驶数据集中学习用户嵌入向量，并在规划过程中以此嵌入向量为策略提供条件，而自然语言指令则提供额外的短期引导。在Bench2Drive基准测试上的闭环评估表明，DMW提升了风格指令适应能力；用户研究显示，其生成的行为可被识别为对应驾驶员自身的风格，这凸显了个性化作为以人为本的自动驾驶的一项关键能力。我们的数据与代码公开于 https://dmw-cvpr.github.io/。

摘要 (Abstract)

Human driving behavior is inherently personal, which is shaped by long-term habits and influenced by short-term intentions. Individuals differ in how they accelerate, brake, merge, yield, and overtake across diverse situations. However, existing end-to-end autonomous driving systems either optimize for generic objectives or rely on fixed driving modes, lacking the ability to adapt to individual preferences or interpret natural language intent. To address this gap, we propose Drive My Way (DMW), a personalized Vision-Language-Action (VLA) driving framework that aligns with users’ long-term driving habits and adapts to real-time user instructions. DMW learns a user embedding from our personalized driving dataset collected across multiple real drivers and conditions the policy on this embedding during planning, while natural language instructions provide additional short-term guidance. Closed-loop evaluation on the Bench2Drive benchmark demonstrates that DMW improves style instruction adaptation, and user studies show that its generated behaviors are recognizable as each driver’s own style, highlighting personalization as a key capability for human-centered autonomous driving. Our data and code are available at https://dmw-cvpr.github.io/.

关键词: Personalized Driving, Vision-Language-Action Model, Preference Alignment, User Embedding, Natural Language Instructions, Autonomous Driving, Human-centered AI, Closed-loop Evaluation

13. ❌ Comparing Natural and Synthetic Structured Data: A Study of the Passive Verb Alternation in French and Italian

作者: Giuseppe Samo, Paola Merlo 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25227v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	5.0/10	5.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的训练与评估，直接涉及"Large Language Models"关键词（10分）。研究比较自然数据与合成数据对LLMs的影响，间接关联数据质量（“Scaling Laws” AND “Data Quality”：5分）以及训练方法（“Pre-training”、“Post-training"等：5分）。其他关键词如MoE、SLMs、对齐、推理加速等均未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该研究通过比较自然和合成数据在训练和评估大语言模型（LLMs）中对法语和意大利语被动动词交替现象的影响，发现使用自然数据训练的模型在捕捉抽象语言模式方面表现更稳健，而仅使用合成数据训练的模型泛化能力有限。

摘要翻译

本研究以法语和意大利语中的被动动词交替现象为例，比较了自然数据与合成数据对大型语言模型（Large Language Models, LLMs）训练与评估的影响。我们采用黑鸟语言矩阵（Blackbird Language Matrices, BLMs）——一种旨在通过系列句子探究底层语言知识模式的结构化数据集。我们将从通用依存语料库（Universal Dependencies）中提取的自然句子所实例化的结构化模板，与合成句子的结构化模板进行对比。实验表明，尽管模型在合成数据集上进行训练和测试时能达到上限性能，但它们无法可靠地推广到自然句子。相反，使用自然数据训练的模型在自然与合成测试集中均表现出稳健的性能，这证明其在捕捉抽象语言模式方面具有更优的能力。这些结果印证了自然数据以及结构化评估设置在探究LLMs句法与语义知识方面的价值。

摘要 (Abstract)

This study compares the impact of natural and synthetic data on training and evaluating large language models (LLMs), using the case of passive verb alternation in French and Italian. We use Blackbird Language Matrices (BLMs), structured datasets designed to probe linguistic knowledge of underlying patterns across sentence sets. We compare structured templates instantiated with natural sentences extracted from Universal Dependencies to structured templates of synthetic sentences. Experiments show that while models achieve ceiling performance when trained and tested on synthetic datasets, they do not reliably generalize to natural sentences. In contrast, models trained on natural data exhibit robust performance across both natural and synthetic test suites, demonstrating their superior ability to capture abstract linguistic patterns. These results corroborate the value of natural data and of structured set ups in linguistic evaluation for probing LLMs’ syntactic and semantic knowledge.

关键词: large language models, natural data, synthetic data, training, evaluation, linguistic knowledge, generalization, French and Italian

14. ❌ Adaptive Chunking: Optimizing Chunking-Method Selection for RAG

作者: Paulo Roberto de Moura Júnior, Jean Lelong, Annabelle Blangero 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25333v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文的核心贡献是提出了一种用于RAG系统的自适应分块框架，因此与"Retrieval-Augmented Generation"高度相关（10分）。论文提到了LLM-regex splitter，表明使用了LLM技术，因此与"Large Language Models"有一定关联（5分）。论文未涉及其他关键词的具体技术内容，如MoE、SLMs、Scaling Laws、各种训练方法、推理优化、代理系统等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对检索增强生成（RAG）系统中文档分块策略缺乏评估框架的问题，提出了一个基于五个内在指标的自适应分块框架，实验表明该框架能显著提升RAG的下游性能，将答案正确率从62-64%提高到72%。

摘要翻译

检索增强生成（RAG）的有效性高度依赖于文档分块（chunking）的方式，即如何将文档分割为更小的单元以进行索引与检索。然而，常用的“一刀切”方法往往难以捕捉多样化文本的细微结构及语义。尽管分块处于核心地位，但目前缺乏专门的评估框架，导致难以独立于下游任务性能来评估和比较不同分块策略。我们通过提出自适应分块（Adaptive Chunking）框架来挑战这一范式，该框架基于五个新颖的、面向文档的内在指标——引用完整性（RC）、块内内聚性（ICC）、文档上下文连贯性（DCC）、块完整性（BI）与尺寸合规性（SC）——为每份文档选择最合适的分块策略，这些指标可直接从关键维度评估分块质量。为支持此框架，我们还引入了两种新的分块器：一种基于大语言模型与正则表达式的分割器（LLM-regex splitter），以及一种先分割后合并的递归分割器（split-then-merge recursive splitter），并辅以针对性的后处理技术。在涵盖法律、技术及社会科学领域的多样化语料上，我们基于指标引导的自适应方法显著提升了下游RAG性能。在不改变模型或提示的情况下，该框架改善了RAG结果，将答案正确率提升至72%（原为62-64%），并使成功回答的问题数量增加超过30%（65 vs. 49）。这些结果表明，通过一套互补的内在指标引导、具备文档感知的自适应分块，为构建更稳健的RAG系统提供了一条切实有效的路径。代码发布于 https://github.com/ekimetrics/adaptive-chunking。

摘要 (Abstract)

The effectiveness of Retrieval-Augmented Generation (RAG) is highly dependent on how documents are chunked, that is, segmented into smaller units for indexing and retrieval. Yet, commonly used “one-size-fits-all” approaches often fail to capture the nuanced structure and semantics of diverse texts. Despite its central role, chunking lacks a dedicated evaluation framework, making it difficult to assess and compare strategies independently of downstream performance. We challenge this paradigm by introducing Adaptive Chunking, a framework that selects the most suitable chunking strategy for each document based on a set of five novel intrinsic, document-based metrics: References Completeness (RC), Intrachunk Cohesion (ICC), Document Contextual Coherence (DCC), Block Integrity (BI), and Size Compliance (SC), which directly assess chunking quality across key dimensions. To support this framework, we also introduce two new chunkers, an LLM-regex splitter and a split-then-merge recursive splitter, alongside targeted post-processing techniques. On a diverse corpus spanning legal, technical, and social science domains, our metric-guided adaptive method significantly improves downstream RAG performance. Without changing models or prompts, our framework increases RAG outcomes, raising answers correctness to 72% (from 62-64%) and increasing the number of successfully answered questions by over 30% (65 vs. 49). These results demonstrate that adaptive, document-aware chunking, guided by a complementary suite of intrinsic metrics, offers a practical and effective path to more robust RAG systems. Code available at https://github.com/ekimetrics/adaptive-chunking.

关键词: Retrieval-Augmented Generation, RAG, Adaptive Chunking, Document Chunking, Intrinsic Metrics, LLM-regex Splitter, Downstream Performance

15. ❌ GeoHeight-Bench: Towards Height-Aware Multimodal Reasoning in Remote Sensing

作者: Xuran Hu, Zhitong Xiong, Zhongcheng Hong, Yifang Ban, Xiaoxiang Zhu, Wufan Zhao 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25565v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文主要研究遥感领域的大规模多模态模型（LMMs），属于AI for Science（地球科学/遥感）的应用范畴，因此与"AI for Science"高度相关（10分）。论文提及LMMs，与"Large Language Models"有一定关联（5分），但核心是视觉-语言多模态模型而非纯语言模型。其他关键词主要涉及大模型技术原理（如MoE、训练方法、推理优化等）或特定应用领域（如生物信息学），论文未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对当前遥感大模型忽视垂直维度的问题，提出了一个高度感知的遥感理解评估框架GeoHeight-Bench，并开发了首个高度感知的遥感LMM基线GeoHeightChat，证明了结合视觉语义与高度几何特征能有效提升模型在复杂地形中的推理能力。

摘要翻译

当前地球观测领域的大型多模态模型通常忽视了关键的“垂直”维度，这限制了其在复杂遥感几何结构与灾害场景中的推理能力——在这些场景中，物理空间结构的重要性往往超越平面视觉纹理。为弥补这一空白，我们引入了一个专用于高度感知遥感理解的综合评估框架。首先，为克服标注数据严重匮乏的问题，我们开发了一套可扩展的、基于视觉语言模型的自动化数据生成流程，该流程利用系统性提示工程与元数据提取技术，构建了两个互补的基准数据集：用于相对高度分析的GeoHeight-Bench，以及更具挑战性的、面向整体地形感知推理的GeoHeight-Bench+。此外，为验证高度感知的必要性，我们提出了首个高度感知遥感大型多模态模型基线——GeoHeightChat。作为一项有力的概念验证，该基线模型表明：将视觉语义与隐式注入的高度几何特征相融合，能有效缓解现有模型的“垂直盲区”，成功在现有光学模型中开创了交互式高度推理的新范式。

摘要 (Abstract)

Current Large Multimodal Models (LMMs) in Earth Observation typically neglect the critical “vertical” dimension, limiting their reasoning capabilities in complex remote sensing geometries and disaster scenarios where physical spatial structures often outweigh planar visual textures. To bridge this gap, we introduce a comprehensive evaluation framework dedicated to height-aware remote sensing understanding. First, to overcome the severe scarcity of annotated data, we develop a scalable, VLM-driven data generation pipeline utilizing systematic prompt engineering and metadata extraction. This pipeline constructs two complementary benchmarks: GeoHeight-Bench for relative height analysis, and a more challenging GeoHeight-Bench+ for holistic, terrain-aware reasoning. Furthermore, to validate the necessity of height perception, we propose GeoHeightChat, the first height-aware remote sensing LMM baseline. Serving as a strong proof of concept, our baseline demonstrates that synergizing visual semantics with implicitly injected height geometric features effectively mitigates the “vertical blind spot”, successfully unlocking a new paradigm of interactive height reasoning in existing optical models.

关键词: Large Multimodal Models, Remote Sensing, Height-aware Reasoning, Evaluation Benchmark, Data Generation Pipeline, GeoHeight-Bench, GeoHeightChat, Vertical Dimension

16. ❌ CIV-DG: Conditional Instrumental Variables for Domain Generalization in Medical Imaging

作者: Shaojin Bai, Yuting Su, Weizhi Nie 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25202v1

评分: 13.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	8.0/10	8.0

评分理由: 论文《CIV-DG: Conditional Instrumental Variables for Domain Generalization in Medical Imaging》专注于医学影像领域的领域泛化问题，提出了一种基于条件工具变量的因果框架来解决选择偏差和结构混杂问题。论文的核心是深度学习在医学影像分析中的应用，属于"AI for Science”（特别是生物信息学/医学影像分析）范畴，因此该关键词得8分。论文提到了"Domain Adaptation”（领域适应）作为相关背景，但主要研究的是领域泛化（Domain Generalization），两者有联系但侧重点不同，因此给5分。论文未涉及大语言模型（LLMs）、模型架构（如MoE）、训练技术（如RLHF、PEFT）、推理优化、智能体或其他大模型相关技术，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对医学影像中因患者人口统计学特征导致的选择偏差和站点特异性伪影问题，提出了一种基于条件工具变量的因果框架CIV-DG，通过DeepGMM架构实现，在Camelyon17和胸部X光数据集上显著超越了现有基线方法，证明了条件因果机制在提升医学AI鲁棒性方面的有效性。

摘要翻译

医学人工智能的跨站点泛化能力从根本上受到选择偏倚的制约——这是一种结构性机制，即患者人口统计学特征（如年龄、疾病严重程度）非随机地决定其医院分配。传统的领域泛化范式主要针对图像层面的分布偏移，未能解决站点特异性变异与诊断标签之间由此产生的伪相关性。为克服这一可识别性障碍，我们提出CIV-DG因果框架，该框架利用条件工具变量将病理语义与扫描仪伪影进行解耦。通过放宽标准工具变量方法对随机分配的严格假设，CIV-DG能够适应医院选择由患者人口统计学特征内生驱动的复杂临床场景。我们通过深度广义矩估计架构实例化该理论，采用条件评判器来最小化矩违背，并在人口统计学分层内强制实现工具变量与误差的正交性。在Camelyon17基准和大规模胸部X射线数据集上的大量实验表明，CIV-DG显著优于主流基线方法，验证了条件因果机制在解决结构混杂问题、构建鲁棒医学人工智能方面的有效性。

摘要 (Abstract)

Cross-site generalizability in medical AI is fundamentally compromised by selection bias, a structural mechanism where patient demographics (e.g., age, severity) non-randomly dictate hospital assignment. Conventional Domain Generalization (DG) paradigms, which predominantly target image-level distribution shifts, fail to address the resulting spurious correlations between site-specific variations and diagnostic labels. To surmount this identifiability barrier, we propose CIV-DG, a causal framework that leverages Conditional Instrumental Variables to disentangle pathological semantics from scanner-induced artifacts. By relaxing the strict random assignment assumption of standard IV methods, CIV-DG accommodates complex clinical scenarios where hospital selection is endogenously driven by patient demographics. We instantiate this theory via a Deep Generalized Method of Moments (DeepGMM) architecture, employing a conditional critic to minimize moment violations and enforce instrument-error orthogonality within demographic strata. Extensive experiments on the Camelyon17 benchmark and large-scale Chest X-Ray datasets demonstrate that CIV-DG significantly outperforms leading baselines, validating the efficacy of conditional causal mechanisms in resolving structural confounding for robust medical AI.

关键词: Domain Generalization, Medical Imaging, Conditional Instrumental Variables, Causal Framework, Selection Bias, DeepGMM, Structural Confounding, Robust AI

17. ❌ Factors Influencing the Quality of AI-Generated Code: A Synthesis of Empirical Evidence

作者: Vehid Geruslu, Zulfiyya Aliyeva, Eray Tüzün 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25146v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文明确将LLMs作为AI辅助代码生成工具的核心技术进行讨论，因此与"Large Language Models"高度相关（10分）。论文研究AI生成代码的质量影响因素，属于大模型在软件工程领域的应用研究，符合研究背景中"大模型在不同领域的研究应用"的要求。但论文未深入探讨LLMs的具体技术原理（如MoE、Scaling Laws、训练方法等），也未涉及其他特定技术关键词（如推理加速、幻觉缓解等），因此其他关键词均评为0分。

!!! tip deepseek-chat TL;DR

该研究通过系统文献综述，综合分析了影响AI生成代码质量的关键因素，发现代码质量受人类因素、AI系统特性和人机交互动态的共同影响，其中提示设计、任务规范和开发者专业知识是主要影响因素。

摘要翻译

研究背景：以大型语言模型为代表的人工智能辅助代码生成工具的快速普及，正在深刻改变软件开发实践。尽管这类工具有望显著提升开发效率，但学术界与工业界日益关注其生成代码在质量、可靠性和安全性方面的问题。研究目的：本研究旨在系统梳理现有关于人工智能生成源代码质量影响因素的实证证据，并分析这些因素在不同评估情境下如何影响软件质量结果。研究方法：我们遵循既定规范开展了系统性文献综述，并在人工监督下采用人工智能辅助工作流程予以支持。通过跨主要数字图书馆的结构化检索与筛选，最终纳入24项核心研究。采用基于模式的定性证据综合方法进行数据提取与分析。研究结果：研究发现，人工智能辅助开发中的代码质量受到人为因素、人工智能系统特性以及人机交互动态的共同影响。关键影响因素包括提示词设计、任务规范性和开发者专业水平。结果同时表明，不同研究在代码正确性、安全性、可维护性和复杂性等质量维度上存在显著差异，既观察到质量提升也识别出潜在风险。研究结论：人工智能辅助代码生成标志着软件工程领域的社会技术范式转变，实现高质量产出既依赖于技术因素也取决于人为因素。尽管前景广阔，人工智能生成的代码仍需经过审慎验证并妥善集成至开发工作流程中。

摘要 (Abstract)

Context: The rapid adoption of AI-assisted code generation tools, such as large language models (LLMs), is transforming software development practices. While these tools promise significant productivity gains, concerns regarding the quality, reliability, and security of AI-generated code are increasingly reported in both academia and industry. –Objective: This study aims to systematically synthesize existing empirical evidence on the factors influencing the quality of AI-generated source code and to analyze how these factors impact software quality outcomes across different evaluation contexts. –Method: We conducted a systematic literature review (SLR) following established guidelines, supported by an AI-assisted workflow with human oversight. A total of 24 primary studies were selected through a structured search and screening process across major digital libraries. Data were extracted and analyzed using qualitative, pattern-based evidence synthesis. –Results: The findings reveal that code quality in AI-assisted development is influenced by a combination of human factors, AI system characteristics, and human AI interaction dynamics. Key influencing factors include prompt design, task specification, and developer expertise. The results also show variability in quality outcomes such as correctness, security, maintainability, and complexity across studies, with both improvements and risks reported. –Conclusion: AI-assisted code generation represents a socio-technical shift in software engineering, where achieving high-quality outcomes depends on both technological and human factors. While promising, AI-generated code requires careful validation and integration into development workflows.

关键词: AI-generated code, code quality, large language models, software engineering, systematic literature review, prompt design, developer expertise, human-AI interaction

18. ❌ Learning domain-invariant features through channel-level sparsification for Out-Of Distribution Generalization

作者: Haoran Pei, Yuguang Yang, Kexin Liu, Juan Zhang, Baochang Zhang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25083v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	5.0/10	5.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的分布外泛化问题，提出了一种基于通道级稀疏化的方法（Hierarchical Causal Dropout），与大多数关键词（特别是大语言模型相关技术）无直接关联。仅与两个关键词有中等关联：1）“Mixture of Experts” OR “MoE” OR “Sparse Models”：论文使用通道级因果掩码实现特征稀疏化，与稀疏模型概念相关，但非核心MoE架构，给5分。2）“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”：论文解决领域泛化问题，与领域适应有一定概念重叠，但未涉及预训练技术，给5分。其他关键词均未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对深度学习模型在图像分析中因捕获领域特定特征而导致的分布外泛化性能下降问题，提出了一种基于通道级因果掩码的层次因果丢弃方法，通过强制特征稀疏性分离因果特征与虚假特征，在多个基准测试中优于现有方法。

摘要翻译

分布外泛化已成为评估图像分析系统的一项核心指标。由于深度学习模型倾向于捕捉领域特定的上下文，它们常对这些非因果特征形成捷径依赖，导致在不同数据源上的性能表现不一致。现有技术如不变性学习试图缓解这一问题，但难以在深层隐空间内有效分离高度混杂的特征，这一局限使其无法完全解决捷径学习问题。本文提出分层因果丢弃法，该方法通过通道级因果掩码强制实现特征稀疏性，使模型能够将因果特征与伪相关特征分离，从而在表征层面有效执行因果干预。训练过程以矩阵化互信息为目标函数进行引导，旨在最小化隐特征与领域标签间的互信息，同时最大化其与类别标签的共享信息。为确保稳定性，我们引入基于风格混合的VICReg模块，防止掩码意外过滤关键因果数据。在分布外泛化基准测试上的实验结果表明，本方法优于现有顶尖技术。

摘要 (Abstract)

Out-of-Distribution (OOD) generalization has become a primary metric for evaluating image analysis systems. Since deep learning models tend to capture domain-specific context, they often develop shortcut dependencies on these non-causal features, leading to inconsistent performance across different data sources. Current techniques, such as invariance learning, attempt to mitigate this. However, they struggle to isolate highly mixed features within deep latent spaces. This limitation prevents them from fully resolving the shortcut learning problem.In this paper, we propose Hierarchical Causal Dropout (HCD), a method that uses channel-level causal masks to enforce feature sparsity. This approach allows the model to separate causal features from spurious ones, effectively performing a causal intervention at the representation level. The training is guided by a Matrix-based Mutual Information (MMI) objective to minimize the mutual information between latent features and domain labels, while simultaneously maximizing the information shared with class labels.To ensure stability, we incorporate a StyleMix-driven VICReg module, which prevents the masks from accidentally filtering out essential causal data. Experimental results on OOD benchmarks show that HCD performs better than existing top-tier methods.

关键词: Out-of-Distribution Generalization, Hierarchical Causal Dropout, Channel-level Sparsification, Causal Intervention, Matrix-based Mutual Information, StyleMix-driven VICReg, Shortcut Learning, Domain-invariant Features

19. ❌ To Write or to Automate Linguistic Prompts, That Is the Question

作者: Marina Sánchez-Torrón, Daria Akselrod, Jason Rauchwerk 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25169v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM的提示工程优化方法（手动专家提示与自动优化DSPy/GEPA的比较），直接涉及LLM技术应用，因此"Large Language Models"相关关键词得10分。其他关键词如MoE、SLMs、训练方法、推理加速、对齐、RAG、代理等均未在摘要中提及或相关，故得0分。

!!! tip deepseek-chat TL;DR

该研究系统比较了手动专家提示与自动优化提示（DSPy/GEPA）在翻译、术语插入和语言质量评估等语言任务中的性能，发现结果因任务而异，多数情况下两者无显著统计差异。

摘要翻译

大语言模型（LLM）的性能对提示设计高度敏感，然而在语言学任务中，自动提示优化能否取代专家提示工程仍未被探索。我们首次系统比较了手工设计的零样本专家提示、基础DSPy签名以及GEPA优化的DSPy签名在翻译、术语插入和语言质量评估（LQA）任务中的表现，评估了五种模型配置。结果具有任务依赖性。在术语插入任务中，优化提示与人工提示产生的质量在统计上大多无法区分。在翻译任务中，不同方法在不同模型上各有优势。在语言质量评估中，专家提示实现了更强的错误检测能力，而优化则提升了错误特征描述性能。在所有任务中，GEPA显著提升了基础DSPy签名的表现，且多数专家提示与优化提示的比较未显示出统计学上的显著差异。需要指出的是，这种比较存在不对称性：GEPA优化通过程序化搜索基于黄金标准数据分割进行，而专家提示原则上无需标注数据，其依赖的是领域专业知识与迭代优化过程。

摘要 (Abstract)

LLM performance is highly sensitive to prompt design, yet whether automatic prompt optimization can replace expert prompt engineering in linguistic tasks remains unexplored. We present the first systematic comparison of hand-crafted zero-shot expert prompts, base DSPy signatures, and GEPA-optimized DSPy signatures across translation, terminology insertion, and language quality assessment, evaluating five model configurations. Results are task-dependent. In terminology insertion, optimized and manual prompts produce mostly statistically indistinguishable quality. In translation, each approach wins on different models. In LQA, expert prompts achieve stronger error detection while optimization improves characterization. Across all tasks, GEPA elevates minimal DSPy signatures, and the majority of expert-optimized comparisons show no statistically significant difference. We note that the comparison is asymmetric: GEPA optimization searches programmatically over gold-standard splits, whereas expert prompts require in principle no labeled data, relying instead on domain expertise and iterative refinement.

关键词: LLM, prompt design, automatic prompt optimization, expert prompt engineering, linguistic tasks, DSPy, GEPA, translation

20. ❌ Enabling ab initio geometry optimization of strongly correlated systems with transferable deep quantum Monte Carlo

作者: P. Bernát Szabó, Zeno Schätzle, Frank Noé 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25381v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 该论文专注于量子化学计算领域，提出了一种结合可迁移深度学习变分蒙特卡洛（VMC）与高斯过程回归的方法，用于高精度探索分子势能面。论文的核心是深度学习在科学计算（具体为量子化学）中的应用，属于"AI for Science"范畴。然而，论文并未涉及任何大语言模型（LLM）、模型架构（如MoE）、训练技术（如预训练、微调、对齐）、推理优化、智能体系统或其他列出的通用大模型技术关键词。因此，除"AI for Science"关键词外，其余所有关键词均得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种结合可迁移深度学习变分蒙特卡洛与高斯过程回归的框架，解决了强关联系统分子势能面高精度、高效探索的难题，实现了零样本化学精度下的结构弛豫、过渡态搜索和最小能量路径计算。

摘要翻译

对化学过程的精确描述需要探索分子势能面的广阔区域，这对于强关联体系仍具挑战性。可迁移的深度学习变分蒙特卡洛方法通过高效联合求解不同分子构型下的电子薛定谔方程，提供了在持续高精度下实现这一目标的有效途径，但其随机性使得直接探索分子构型空间变得困难。本文提出了一种高精度从头算探索势能面的框架，该框架将可迁移的深度学习变分蒙特卡洛方法与低成本估算能量、力和 Hessian 矩阵相结合。通过在电子波函数的变分蒙特卡洛优化过程中持续采样核构型，我们获得了可迁移的描述，能够在化学相关的分子几何分布范围内实现零样本化学精度。在后续的分子构型空间表征中，势能面的评估仅需稀疏进行：通过在采样构型处估算变分蒙特卡洛能量和力，并利用高斯过程回归聚合由此产生的含噪声数据，构建局部近似。我们的方法能够对复杂的势能面景观进行精确高效的探索，包括基态和激发态的结构弛豫、过渡态搜索以及最小能量路径。这为研究具有显著多参考态特性的体系中的键断裂、形成及大规模结构重排打开了大门。

摘要 (Abstract)

A faithful description of chemical processes requires exploring extended regions of the molecular potential energy surface (PES), which remains challenging for strongly correlated systems. Transferable deep-learning variational Monte Carlo (VMC) offers a promising route by efficiently solving the electronic Schrödinger equation jointly across molecular geometries at consistently high accuracy, yet its stochastic nature renders direct exploration of molecular configuration space nontrivial. Here, we present a framework for highly accurate ab initio exploration of PESs that combines transferable deep-learning VMC with a cost-effective estimation of energies, forces, and Hessians. By continuously sampling nuclear configurations during VMC optimization of electronic wave functions, we obtain transferable descriptions that achieve zero-shot chemical accuracy within chemically relevant distributions of molecular geometries. Throughout the subsequent characterization of molecular configuration space, the PES is evaluated only sparsely, with local approximations constructed by estimating VMC energies and forces at sampled geometries and aggregating the resulting noisy data using Gaussian process regression. Our method enables accurate and efficient exploration of complex PES landscapes, including structure relaxation, transition-state searches, and minimum-energy pathways, for both ground and excited states. This opens the door to studying bond breaking, formation, and large structural rearrangements in systems with pronounced multi-reference character.

关键词: deep-learning variational Monte Carlo, ab initio geometry optimization, strongly correlated systems, potential energy surface, Gaussian process regression, zero-shot chemical accuracy, molecular configuration space, transferable wave functions

21. ❌ Back to Basics: Revisiting ASR in the Age of Voice Agents

作者: Geeyang Tay, Wentao Ma, Jaewon Lee, Yuzhi Tang, Daniel Lee, Weisu Yin, Dongming Shen, Silin Meng, Yi Zhu, Mu Li, Alex Smola 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25727v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文专注于自动语音识别（ASR）系统的评估和基准测试，特别是针对语音代理在真实世界条件下的鲁棒性。它不涉及大语言模型（LLMs）、深度学习技术原理的创新，或大模型在不同领域的应用。唯一相关的关键词是"Hallucination Mitigation”，因为论文提到了ASR模型在部分或退化输入下会产生幻觉（hallucinate plausible but unspoken content），这构成了安全风险，因此给予5分（有一定关联）。其他所有关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了自动语音识别（ASR）系统在真实世界语音代理中的鲁棒性问题，通过引入多语言诊断基准WildASR，发现现有ASR系统在环境退化、人口统计偏移和语言多样性条件下表现严重且不均匀下降，并存在幻觉风险，从而强调了针对性评估对提高生产系统可靠性的重要性。

摘要翻译

自动语音识别（ASR）系统在精选测试集上已达到接近人类的准确率，但在现实语音助手中，面对当前评估未能系统覆盖的场景时仍频繁失效。由于缺乏能够分离具体故障因素的诊断工具，开发者难以预判何种条件、在何种语言中会导致何种程度的性能下降。本文提出WildASR——一个完全源自真实人类语音的多语言（四种语言）诊断基准，该基准从三个维度分解ASR鲁棒性：环境干扰、人群差异与语言多样性。通过对七种广泛使用的ASR系统进行评估，我们发现其性能存在严重且不均衡的退化，且模型的鲁棒性无法跨语言或跨条件迁移。关键问题在于，面对部分或受损的输入时，模型常产生看似合理但实际未说出的幻觉内容，这给下游智能体行为带来了具体的安全风险。我们的研究结果表明，针对性的、因素隔离的评估对于理解和提升生产系统中ASR的可靠性至关重要。除基准本身外，我们还提供了三种分析工具，可供开发者用于指导部署决策。

摘要 (Abstract)

Automatic speech recognition (ASR) systems have achieved near-human accuracy on curated benchmarks, yet still fail in real-world voice agents under conditions that current evaluations do not systematically cover. Without diagnostic tools that isolate specific failure factors, practitioners cannot anticipate which conditions, in which languages, will cause what degree of degradation. We introduce WildASR, a multilingual (four-language) diagnostic benchmark sourced entirely from real human speech that factorizes ASR robustness along three axes: environmental degradation, demographic shift, and linguistic diversity. Evaluating seven widely used ASR systems, we find severe and uneven performance degradation, and model robustness does not transfer across languages or conditions. Critically, models often hallucinate plausible but unspoken content under partial or degraded inputs, creating concrete safety risks for downstream agent behavior. Our results demonstrate that targeted, factor-isolated evaluation is essential for understanding and improving ASR reliability in production systems. Besides the benchmark itself, we also present three analytical tools that practitioners can use to guide deployment decisions.

关键词: Automatic Speech Recognition, ASR, Voice Agents, Robustness Evaluation, Multilingual Benchmark, Hallucination, Safety Risks, Diagnostic Tools

22. ❌ Insights on back marking for the automated identification of animals

作者: David Brunner, Marie Bordes, Elisabeth Mayrhuber, Stephan M. Winkler, Viktoria Dorfer, Maciej Oczak 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25535v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文研究的是使用ResNet-50神经网络通过背部标记识别猪只个体，属于计算机视觉在动物监测领域的应用。所有关键词均与大语言模型（LLM）、深度学习技术原理创新、模型训练优化方法（如MoE、Scaling Laws、PEFT、RLHF等）、推理技术（如CoT、MCTS）、代理系统或特定AI科学领域（如生物信息学）直接相关。论文仅涉及基础的卷积神经网络（ResNet-50）用于图像分类，未涉及任何大模型、深度学习技术原理创新或上述关键词中的具体方法。唯一略有相关的是"AI for Science”，因为该研究属于AI在动物科学/农业领域的应用，但并非核心匹配生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该研究通过训练ResNet-50模型识别猪只背部标记，分析了在运动模糊、多视角和遮挡条件下有效的标记设计原则，为基于机器学习的个体动物监测提供了优化指南。

摘要翻译

迄今为止，关于如何设计背部标记以最佳支持对猪等外观相似物种的个体水平监测的研究仍十分有限。随着近期基于机器学习的监测解决方案的涌现，尤其需要关于如何设计能被此类算法有效识别的标记的指导原则。本研究基于一个经过训练、旨在通过背部标记区分猪只的机器学习模型的分析，为有效的背部标记设计提供了有价值的见解。具体而言，我们训练了一个ResNet-50类型的神经网络，以对十头具有独特背部标记的猪进行分类。对该模型预测结果的分析突显了某些设计选择的重要性，即使在受控环境中也是如此。最重要的是，必须设计一套背部标记，使得每个标记在由动物行为引起的运动模糊、多样视角和遮挡条件下仍保持明确无误。此外，背部标记的设计必须考虑模型训练中常用的数据增强策略，如颜色、翻转和裁剪增强。这些研究见解可通过优化背部标记设计，为未来研究和实际应用中的个体水平监测提供支持。

摘要 (Abstract)

To date, there is little research on how to design back marks to best support individual-level monitoring of uniform looking species like pigs. With the recent surge of machine learning-based monitoring solutions, there is a particular need for guidelines on the design of marks that can be effectively recognised by such algorithms. This study provides valuable insights on effective back mark design, based on the analysis of a machine learning model, trained to distinguish pigs via their back marks. Specifically, a neural network of type ResNet-50 was trained to classify ten pigs with unique back marks. The analysis of the model’s predictions highlights the significance of certain design choices, even in controlled settings. Most importantly, the set of back marks must be designed such that each mark remains unambiguous under conditions of motion blur, diverse view angles and occlusions, caused by animal behaviour. Further, the back mark design must consider data augmentation strategies commonly employed during model training, like colour, flip and crop augmentations. The generated insights can support individual-level monitoring in future studies and real-world applications by optimizing back mark design.

关键词: back mark design, animal monitoring, ResNet-50, machine learning, individual identification, pigs, data augmentation, computer vision

23. ❌ Training-free Detection and 6D Pose Estimation of Unseen Surgical Instruments

作者: Jonas Hein, Lilian Calvet, Matthias Seibold, Siyu Tang, Marc Pollefeys, Philipp Fürnstahl 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25228v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 该论文专注于计算机视觉和医学图像处理领域，提出了一种无需训练的手术器械检测和6D姿态估计方法。论文的核心技术涉及多视图几何、特征提取、模板匹配和轮廓配准，属于传统的计算机视觉和深度学习应用，而非大语言模型（LLM）或大模型技术原理的创新。所有关键词（如LLMs、MoE、Scaling Laws、RLHF、RAG、CoT、Agents等）均与论文内容无关，因此除最后一个关键词外均得0分。最后一个关键词"AI for Science” OR “Bioinformatics” OR “Cheminformatics"得5分，因为论文涉及AI在医学（手术）领域的应用，属于"AI for Science"的范畴，但并非核心焦点（论文更侧重计算机视觉方法而非生物信息学或化学信息学）。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需训练、基于多视图几何和轮廓配准的管道，用于在手术场景中检测和估计未见手术器械的6D姿态，实现了与监督方法相当的毫米级精度，并具有对未知器械的完全泛化能力。

摘要翻译

目的：手术器械的精确检测与六维姿态估计对众多计算机辅助介入治疗至关重要。然而，监督学习方法对新器械或未见过的器械缺乏灵活性，且需要大量标注数据。本研究提出一种免训练流程，用于对未见手术器械进行精确的多视角六维姿态估计，该流程仅需纹理化CAD模型作为先验知识。方法：我们的流程包含两个主要阶段。首先，在检测阶段，我们在每个视角生成物体掩膜候选区域，并使用预训练特征提取器评估其与渲染模板的相似度得分。检测结果在跨视角间进行匹配，通过三角测量生成三维实例候选，并利用多视角几何一致性进行筛选。其次，在姿态估计阶段，通过结合跨视角注意力机制的特征度量评分，对一组姿态假设进行迭代优化与评分。最优假设将采用一种新颖的多视角、遮挡感知轮廓配准方法进行最终优化，该方法最小化未遮挡轮廓点的重投影误差。结果：所提方法在MVPSP数据集中的真实手术数据上进行了严格评估。该方法在受控条件下实现了毫米级精度的姿态估计，其性能与监督学习方法相当，同时保持对未见器械的完全泛化能力。这些结果证明了在手术场景中实现免训练、无标记检测与跟踪的可行性，并凸显了手术环境中的独特挑战。结论：我们提出了一种新颖且灵活的流程，该流程有效结合了前沿的基础模型、多视角几何以及基于轮廓的优化方法，无需任务特定训练即可实现手术器械的高精度六维姿态估计。该方法为动态临床环境中实现稳健的器械跟踪与场景理解提供了可能。

摘要 (Abstract)

Purpose: Accurate detection and 6D pose estimation of surgical instruments are crucial for many computer-assisted interventions. However, supervised methods lack flexibility for new or unseen tools and require extensive annotated data. This work introduces a training-free pipeline for accurate multi-view 6D pose estimation of unseen surgical instruments, which only requires a textured CAD model as prior knowledge. Methods: Our pipeline consists of two main stages. First, for detection, we generate object mask proposals in each view and score their similarity to rendered templates using a pre-trained feature extractor. Detections are matched across views, triangulated into 3D instance candidates, and filtered using multi-view geometric consistency. Second, for pose estimation, a set of pose hypotheses is iteratively refined and scored using feature-metric scores with cross-view attention. The best hypothesis undergoes a final refinement using a novel multi-view, occlusion-aware contour registration, which minimizes reprojection errors of unoccluded contour points. Results: The proposed method was rigorously evaluated on real-world surgical data from the MVPSP dataset. The method achieves millimeter-accurate pose estimates that are on par with supervised methods under controlled conditions, while maintaining full generalization to unseen instruments. These results demonstrate the feasibility of training-free, marker-less detection and tracking in surgical scenes, and highlight the unique challenges in surgical environments. Conclusion: We present a novel and flexible pipeline that effectively combines state-of-the-art foundational models, multi-view geometry, and contour-based refinement for high-accuracy 6D pose estimation of surgical instruments without task-specific training. This approach enables robust instrument tracking and scene understanding in dynamic clinical environments.

关键词: surgical instrument detection, 6D pose estimation, training-free pipeline, multi-view geometry, contour registration, computer-assisted interventions, MVPSP dataset, generalization to unseen instruments

24. ❌ A Distribution-to-Distribution Neural Probabilistic Forecasting Framework for Dynamical Systems

作者: Tianlin Yang, Hailiang Du, Louis Aslett 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25370v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文提出了一种用于动力系统的分布到分布（D2D）神经概率预测框架，核心是使用核均值嵌入和混合密度网络直接操作和演化预测分布，而非轨迹。所有关键词（共27个）中，仅"AI for Science” OR “Bioinformatics” OR “Cheminformatics"有微弱关联（评5分），因为该框架应用于科学领域（混沌动力系统Lorenz63），属于AI for Science的广义范畴，但论文未涉及大模型、深度学习技术原理创新或生物/化学信息学。其他26个关键词均专注于大模型技术（如LLM、MoE、对齐、推理、代理等）或其特定应用，与本文的通用神经概率预测框架无直接关系，故评0分。加权总分仅5.0分，远低于动态及格分26.6分，表明论文与评审关注的大模型/深度学习主题高度不相关。

!!! tip deepseek-chat TL;DR

该研究提出了一个分布到分布（D2D）神经概率预测框架，用于动力系统的不确定性量化，通过直接演化预测分布而非依赖集合模拟，在Lorenz63混沌系统上验证了其有效性并优于简化完美模型基准。

摘要翻译

概率预测通过将预测表示为概率分布而非确定性轨迹，为动力系统中的不确定性量化提供了原则性框架。然而，现有的预测方法——无论是基于物理模型还是基于神经网络——本质上仍以轨迹为导向：预测分布通常通过集合或采样获得，而非作为动力学对象直接演化。本文开发了一种分布到分布（D2D）神经概率预测框架，可直接对预测分布进行操作。该框架围绕可替换的神经预测模块构建了分布编码与解码结构：利用核均值嵌入表示输入分布，并采用混合密度网络参数化输出预测分布。这种设计使得预测不确定性能够在统一的端到端神经架构内递归传播，模型的训练与评估均直接基于概率预测技能进行。该框架在Lorenz63混沌动力系统中得到验证。结果表明：D2D模型能够捕捉非线性动力学下的非平凡分布演化，无需显式集合模拟即可生成具有良好技能的概率预测，其性能与简化完美模型基准相当，部分情况下甚至更优。这些发现指向了概率预测的新范式——预测分布可直接被学习与演化，而非通过基于集合的不确定性传播间接重构。

摘要 (Abstract)

Probabilistic forecasting provides a principled framework for uncertainty quantification in dynamical systems by representing predictions as probability distributions rather than deterministic trajectories. However, existing forecasting approaches, whether physics-based or neural-network-based, remain fundamentally trajectory-oriented: predictive distributions are usually accessed through ensembles or sampling, rather than evolved directly as dynamical objects. A distribution-to-distribution (D2D) neural probabilistic forecasting framework is developed to operate directly on predictive distributions. The framework introduces a distributional encoding and decoding structure around a replaceable neural forecasting module, using kernel mean embeddings to represent input distributions and mixture density networks to parameterise output predictive distributions. This design enables recursive propagation of predictive uncertainty within a unified end-to-end neural architecture, with model training and evaluation carried out directly in terms of probabilistic forecast skill. The framework is demonstrated on the Lorenz63 chaotic dynamical system. Results show that the D2D model captures nontrivial distributional evolution under nonlinear dynamics, produces skillful probabilistic forecasts without explicit ensemble simulation, and remains competitive with, and in some cases outperforms, a simplified perfect model benchmark. These findings point to a new paradigm for probabilistic forecasting, in which predictive distributions are learned and evolved directly rather than reconstructed indirectly through ensemble-based uncertainty propagation.

关键词: probabilistic forecasting, dynamical systems, distribution-to-distribution, kernel mean embeddings, mixture density networks, uncertainty quantification, neural forecasting, Lorenz63

25. ❌ Training the Knowledge Base through Evidence Distillation and Write-Back Enrichment

作者: Yuxing Lu, Xukai Zhao, Wei Wu, Jinzhuo Wang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25737v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RAG系统的知识库优化，与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（10分），因为直接提出WriteBack-RAG框架改进RAG。论文使用LLM作为RAG的骨干模型，与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分），但未深入LLM技术本身。其他关键词如MoE、SFT、RLHF等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文针对RAG系统中知识库静态、信息分散的问题，提出WriteBack-RAG框架，通过证据蒸馏和写回增强来训练知识库，在多种RAG方法、基准和LLM骨干上平均提升2.14%。

摘要翻译

检索增强生成（RAG）系统中的知识库通常一经构建便不再更新，然而查询所需的事实往往分散在不同文档中，并埋没于无关内容之内。我们认为知识库应被视为可训练的组件，并提出了WriteBack-RAG框架。该框架利用标注示例识别检索成功案例，从中剥离相关文档，并将其提炼为紧凑的知识单元，与原始语料库共同建立索引。由于该方法仅修改语料库，可作为离线预处理步骤一次性应用，并能与任何RAG流程结合。在四种RAG方法、六个基准测试和两种大语言模型（LLM）基座的实验中，WriteBack-RAG在所有评估场景中均取得提升，平均增益达+2.14%。跨方法迁移实验进一步表明，经提炼的知识单元对非原始生成流程的其他RAG系统同样有效，这证实了改进效果根植于语料库本身。

摘要 (Abstract)

The knowledge base in a retrieval-augmented generation (RAG) system is typically assembled once and never revised, even though the facts a query requires are often fragmented across documents and buried in irrelevant content. We argue that the knowledge base should be treated as a trainable component and propose WriteBack-RAG, a framework that uses labeled examples to identify where retrieval succeeds, isolate the relevant documents, and distill them into compact knowledge units that are indexed alongside the original corpus. Because the method modifies only the corpus, it can be applied once as an offline preprocessing step and combined with any RAG pipeline. Across four RAG methods, six benchmarks, and two LLM backbones, WriteBack-RAG improves every evaluated setting, with gains averaging +2.14%. Cross-method transfer experiments further show that the distilled knowledge benefits RAG pipelines other than the one used to produce it, confirming that the improvement resides in the corpus itself.

关键词: Retrieval-Augmented Generation, RAG, knowledge base, evidence distillation, write-back enrichment, LLM, corpus indexing, offline preprocessing

26. ❌ Vega: Learning to Drive with Natural Language Instructions

作者: Sicheng Zuo, Yuxuan Li, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25741v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	8.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Vega模型，这是一个用于自动驾驶的视觉-语言-世界-动作模型，核心是处理自然语言指令并生成轨迹。与关键词的相关性分析：1）与’Large Language Models’（5分）相关，因为模型处理语言指令，但未明确使用LLM；2）与’Pre-training’和’Post-training/SFT’（各5分）相关，涉及模型训练；3）与’Instruction Tuning’（8分）高度相关，因为模型专门设计用于遵循驾驶指令；4）与’Chain of Thought’（5分）有一定关联，涉及推理过程；5）与’LLM Agents’（8分）高度相关，因为模型作为自主驾驶代理；6）与’World Models’（8分）高度相关，因为模型包含世界建模组件用于未来预测。其他关键词如MoE、SLMs、RAG、量化等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究如何让自动驾驶系统通过自然语言指令进行个性化驾驶，提出了Vega模型，该模型整合视觉、语言和世界建模，在大型数据集上训练后实现了优越的规划性能和指令跟随能力。

摘要翻译

视觉-语言-行动模型重塑了自动驾驶领域，将语言模态纳入决策过程。然而，现有的大多数流程仅将语言模态用于场景描述或推理，缺乏遵循多样化用户指令以实现个性化驾驶的灵活性。为解决这一问题，我们首先构建了一个大规模驾驶数据集（InstructScene），包含约10万个场景，每个场景均标注了多样化的驾驶指令及对应的轨迹。随后，我们提出了一种统一的视觉-语言-世界-行动模型（Vision-Language-World-Action model）——Vega，用于基于指令的生成与规划。我们采用自回归范式处理视觉输入（视觉）和语言指令（语言），并采用扩散范式生成未来预测（世界建模）和轨迹（行动）。通过联合注意力机制实现多模态间的交互，并为不同模态使用独立的投影层以增强模型能力。大量实验表明，我们的方法不仅实现了卓越的规划性能，还展现出强大的指令遵循能力，为更智能、个性化的驾驶系统开辟了道路。

摘要 (Abstract)

Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world modeling) and trajectories (action). We perform joint attention to enable interactions between the modalities and use individual projection layers for different modalities for more capabilities. Extensive experiments demonstrate that our method not only achieves superior planning performance but also exhibits strong instruction-following abilities, paving the way for more intelligent and personalized driving systems.

关键词: Vision-Language-Action Models, Autonomous Driving, Natural Language Instructions, World Modeling, Trajectory Generation, Instruction-Following, Personalized Driving, Diffusion Paradigm

27. ❌ PixelSmile: Toward Fine-Grained Facial Expression Editing

作者: Jiabin Hua, Hengyuan Xu, Aojie Li, Wei Cheng, Gang Yu, Xingjun Ma, Yu-Gang Jiang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25728v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的细粒度面部表情编辑，提出PixelSmile扩散框架和FFE数据集，研究内容涉及图像生成、语义解耦和对比学习，但完全不涉及大语言模型、深度学习技术原理创新或科学领域应用。所有关键词均与大模型、深度学习技术原理或AI for Science相关，而本文是纯粹的计算机视觉/图像处理研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了细粒度面部表情编辑中的语义重叠问题，通过提出PixelSmile扩散框架和构建FFE数据集，实现了精确、可控的表情编辑和身份保持。

摘要翻译

细粒度面部表情编辑长期以来受限于内在的语义重叠问题。为解决此问题，我们构建了带有连续情感标注的Flex面部表情数据集，并建立了FFE-Bench评估体系，用于衡量结构混淆度、编辑准确性、线性可控性以及表情编辑与身份保持之间的平衡。我们提出PixelSmile——一种通过完全对称联合训练实现表情语义解耦的扩散框架。该框架结合强度监督与对比学习，以生成更强烈且更具区分度的表情，并通过文本潜在空间插值实现精确稳定的线性表情控制。大量实验表明，PixelSmile实现了卓越的解耦效果与鲁棒的身份保持能力，证实了其在连续、可控、细粒度表情编辑方面的有效性，同时天然支持平滑的表情融合。

摘要 (Abstract)

Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation. We propose PixelSmile, a diffusion framework that disentangles expression semantics via fully symmetric joint training. PixelSmile combines intensity supervision with contrastive learning to produce stronger and more distinguishable expressions, achieving precise and stable linear expression control through textual latent interpolation. Extensive experiments demonstrate that PixelSmile achieves superior disentanglement and robust identity preservation, confirming its effectiveness for continuous, controllable, and fine-grained expression editing, while naturally supporting smooth expression blending.

关键词: facial expression editing, diffusion framework, semantic disentanglement, contrastive learning, linear controllability, identity preservation, fine-grained editing, continuous expression control

28. ❌ PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

作者: Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li, Mingmin Chi, Kaipeng Zhang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25730v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频扩散模型，特别是解决长视频生成中的KV-cache增长问题。与关键词列表高度相关的只有"KV Cache Compression OR Linear Attention OR FlashAttention”（得10分），因为论文的核心创新是提出了一种三部分KV-cache策略来压缩历史上下文，直接涉及KV缓存压缩技术。其他关键词均与论文内容无关，论文未涉及大语言模型、对齐、推理、代理、科学AI等主题。

!!! tip deepseek-chat TL;DR

论文提出了PackForcing框架，通过一种新颖的三部分KV-cache策略和动态上下文选择机制，解决了自回归视频扩散模型在长视频生成中的KV缓存线性增长、时间重复和错误累积问题，实现了在单GPU上生成高质量2分钟视频，仅需5秒短视频训练即可达到24倍时间外推。

摘要翻译

自回归视频扩散模型已取得显著进展，但在生成长视频时仍受限于难以处理的线性KV缓存增长、时间重复性以及误差累积问题。为应对这些挑战，我们提出了PackForcing——一个通过新颖的三分区KV缓存策略高效管理生成历史的统一框架。具体而言，我们将历史上下文划分为三种不同类型：(1) 锚定令牌(Sink tokens)，以完整分辨率保留早期锚定帧以维持全局语义；(2) 中间令牌(Mid tokens)，通过融合渐进式3D卷积与低分辨率VAE重编码的双分支网络，实现大规模时空压缩（令牌量减少32倍）；(3) 近期令牌(Recent tokens)，保持完整分辨率以确保局部时间连贯性。为在保证质量的同时严格限制内存占用，我们为中间令牌引入了动态top-$k$上下文选择机制，并结合连续时间旋转位置编码调整(Temporal RoPE Adjustment)，以可忽略的开销无缝重新对齐因令牌丢弃产生的位置间隙。凭借这种层次化上下文压缩机制，PackForcing可在单张H200 GPU上生成连贯的2分钟、832x480分辨率、16帧/秒的视频。其KV缓存严格限制在4GB，并实现惊人的24倍时间外推能力（从5秒到120秒），无论是零样本运行还是仅用5秒片段训练均能有效工作。在VBench上的大量实验结果表明，该方法在时间一致性（26.07）和动态程度（56.25）指标上达到最先进水平，证明短视频监督足以实现高质量的长视频合成。https://github.com/ShandaAI/PackForcing

摘要 (Abstract)

Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-$k$ context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. https://github.com/ShandaAI/PackForcing

关键词: video diffusion models, KV-cache compression, long-video generation, autoregressive models, temporal consistency, context window management, inference efficiency, hierarchical context compression

29. ❌ Natural-Language Agent Harnesses

作者: Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, Hai-Tao Zheng 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25723v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于智能体（agent）的harness工程，提出自然语言agent harnesses（NLAHs）和智能harness运行时（IHR），核心涉及LLM agents、工具使用和agent协调。与LLM agents高度相关（10分），因为研究agent harness设计；与tool use相关（8分），涉及agent控制逻辑和API使用；与multi-agent systems有一定关联（5分），涉及agent协调和模块化设计。其他关键词如LLMs（8分）作为基础技术相关，但论文不涉及具体LLM技术细节。大部分关键词（如MoE、量化、推理加速等）与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何将agent的高层控制逻辑外部化为可移植的自然语言harness，提出了自然语言agent harnesses（NLAHs）和智能harness运行时（IHR），并通过实验验证了其操作可行性和模块化优势。

摘要翻译

智能体性能日益依赖于线束工程，然而线束设计通常深嵌于控制器代码和运行时特定规范中，使其难以作为科学对象进行迁移、比较和研究。我们探讨是否可以将智能体线束的高层控制逻辑外部化为一种可移植的可执行构件。我们提出自然语言智能体线束（Natural-Language Agent Harnesses，NLAHs），该框架以可编辑的自然语言表达线束行为；同时引入智能线束运行时（Intelligent Harness Runtime，IHR），这是一个通过显式契约、持久化构件和轻量适配器来执行这些线束的共享运行时环境。在编程与计算机使用基准测试中，我们对操作可行性、模块消融以及代码到文本的线束迁移进行了对照评估。

摘要 (Abstract)

Agent performance increasingly depends on \emph{harness engineering}, yet harness design is usually buried in controller code and runtime-specific conventions, making it hard to transfer, compare, and study as a scientific object. We ask whether the high-level control logic of an agent harness can instead be externalized as a portable executable artifact. We introduce \textbf{Natural-Language Agent Harnesses} (NLAHs), which express harness behavior in editable natural language, and \textbf{Intelligent Harness Runtime} (IHR), a shared runtime that executes these harnesses through explicit contracts, durable artifacts, and lightweight adapters. Across coding and computer-use benchmarks, we conduct controlled evaluations of operational viability, module ablation, and code-to-text harness migration.

关键词: agent harness, natural-language harness, intelligent harness runtime, agent control logic, portable executable artifact, harness engineering, agent performance, module ablation

30. ❌ R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning

作者: Zirui Zhang, Haoyu Dong, Kexin Pei, Chengzhi Mao 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25720v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种强化学习框架RC2，通过跨模态循环一致性来改善多模态推理，核心关注推理过程（Chain of Thought/System 2 Thinking）和自我改进（Self-Correction），但与大多数关键词（如LLM技术、训练方法、优化技术等）无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对多模态模型中视觉和文本表示不一致的问题，提出了一种基于强化学习的循环一致性框架RC2，通过跨模态一致性约束自主对齐内部表示，将推理准确率提高了7.6个百分点。

摘要翻译

稳健的感知与推理需要跨感官模态的一致性。然而当前的多模态模型常违背这一原则，对同一概念的视觉与文本表征产生矛盾预测。不同于采用可能放大系统性偏差的标准投票机制掩盖这些缺陷，我们证明跨模态不一致性为学习提供了丰富而自然的信号。我们提出RC2——一种通过强制跨模态循环一致性来解决内部冲突的强化学习框架。通过要求模型执行反向推理、切换模态并可靠地通过前向推理重构答案，我们获得了密集且无需标注的奖励。这种循环约束促使模型自主对齐其内部表征。针对该结构的优化减轻了特定模态的误差，并将推理准确率最高提升7.6个百分点。我们的研究结果表明，高级推理能力的涌现不仅源于数据规模的扩展，更得益于对世界构建结构一致的理解。

摘要 (Abstract)

Robust perception and reasoning require consistency across sensory modalities. Yet current multimodal models often violate this principle, yielding contradictory predictions for visual and textual representations of the same concept. Rather than masking these failures with standard voting mechanisms, which can amplify systematic biases, we show that cross-modal inconsistency provides a rich and natural signal for learning. We introduce RC2, a reinforcement learning framework that resolves internal conflicts by enforcing cross-modal cycle consistency. By requiring a model to perform backward inference, switch modalities, and reliably reconstruct the answer through forward inference, we obtain a dense, label-free reward. This cyclic constraint encourages the model to align its internal representations autonomously. Optimizing for this structure mitigates modality-specific errors and improves reasoning accuracy by up to 7.6 points. Our results suggest that advanced reasoning emerges not only from scaling data, but also from enforcing a structurally consistent understanding of the world.

关键词: multimodal reasoning, reinforcement learning, cycle consistency, cross-modal inconsistency, internal representation alignment, reasoning accuracy, label-free reward, backward inference

31. ❌ Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?

作者: Abhishek Bhandwaldar, Mihir Choudhury, Ruchir Puri, Akash Srivastava 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25719v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究通用编码代理在硬件优化中的应用，核心是构建和协调多个自主优化代理的工厂式流水线。这与’LLM Agents OR Autonomous Agents OR Agentic Workflow’和’Multi-agent Systems OR Agent Coordination’高度相关，因为论文明确涉及自主代理、代理协调和代理工作流。其他关键词主要涉及大模型技术原理、训练方法、推理优化、对齐、压缩等，论文未涉及这些具体技术，仅使用Claude Code作为工具，未讨论其内部机制或相关技术，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文研究了通用编码代理在硬件设计优化中的应用，通过构建两阶段代理工厂流水线协调多个自主代理，实现了平均8.27倍的加速，证明了代理扩展是硬件高级合成优化的有效方法。

摘要翻译

本文通过实证研究探讨了未经硬件特定训练的通用编码代理能在多大程度上基于高层算法描述优化硬件设计。我们提出了一种代理工厂——一个两阶段流水线，用于构建并协调多个自主优化代理。
在第一阶段，流水线将设计分解为子内核，通过编译制导指令（pragma）和代码级转换独立优化每个子内核，并构建整数线性规划（ILP）模型，在面积约束下整合具有全局潜力的配置方案。
在第二阶段，流水线针对ILP筛选出的最优解启动N个专家代理，每个代理探索跨函数优化策略，例如编译制导指令重组、循环融合和内存重构等子内核分解未能涵盖的优化维度。
我们在HLS-Eval和Rodinia-HLS的12个内核上使用Claude Code（Opus 4.5/4.6）与AMD Vitis HLS工具链进行评估。将代理数量从1个增至10个时，平均可获得基线设计8.27倍的加速效果，且在复杂基准测试中提升更为显著：streamcluster超过20倍，kmeans达到约10倍。
所有基准测试中，代理在未经领域特定训练的情况下均能复现已知的硬件优化模式，且最优设计往往并非来自ILP排名最高的候选方案，这表明全局优化能发掘子内核搜索遗漏的改进空间。这些结果证实了代理规模扩展可作为高层次综合（HLS）优化中实用且有效的技术路径。

摘要 (Abstract)

We present an empirical study of how far general-purpose coding agents – without hardware-specific training – can optimize hardware designs from high-level algorithmic specifications. We introduce an agent factory, a two-stage pipeline that constructs and coordinates multiple autonomous optimization agents. In Stage1, the pipeline decomposes a design into sub-kernels, independently optimizes each using pragma and code-level transformations, and formulates an Integer Linear Program (ILP) to assemble globally promising configurations under an area constraint. In Stage2, it launches $N$ expert agents over the top ILP solutions, each exploring cross-function optimizations such as pragma recombination, loop fusion, and memory restructuring that are not captured by sub-kernel decomposition. We evaluate the approach on 12 kernels from HLS-Eval and Rodinia-HLS using Claude Code (Opus~4.5/4.6) with AMD Vitis HLS. Scaling from 1 to 10 agents yields a mean $8.27\times$ speedup over baseline, with larger gains on harder benchmarks: streamcluster exceeds $20\times$ and kmeans reaches approximately $10\times$. Across benchmarks, agents consistently rediscover known hardware optimization patterns without domain-specific training, and the best designs often do not originate from top-ranked ILP candidates, indicating that global optimization exposes improvements missed by sub-kernel search. These results establish agent scaling as a practical and effective axis for HLS optimization.

关键词: general-purpose coding agents, hardware optimization, agent factory, autonomous optimization agents, multi-agent systems, high-level synthesis, Integer Linear Program, Claude Code

32. ❌ Neural Network Conversion of Machine Learning Pipelines

作者: Man-Ling Sung, Jan Silovsky, Man-Hung Siu, Herbert Gish, Chinnu Pittapally 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25699v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是传统机器学习管道（随机森林）到神经网络的转换，属于深度学习中的知识蒸馏和迁移学习范畴，但所有关键词都聚焦于大语言模型（LLMs）及其相关技术（如MoE、RLHF、RAG、量化等），而本文完全不涉及LLMs、大模型或任何语言模型技术，也未涉及AI for Science的具体应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过知识蒸馏将随机森林分类器转换为神经网络学生模型，并在100个OpenML任务上验证了该方法在多数情况下能成功模仿教师模型的性能。

摘要翻译

迁移学习与知识蒸馏近来在深度学习领域备受关注。其中一种迁移方法——师生学习——已被证明能成功创建“小型”学生神经网络，以模仿规模更大、结构更复杂的“教师”网络的性能。本文研究了该方法的扩展应用：将基于非神经网络的机器学习流程作为教师模型，向神经网络学生模型进行知识迁移，从而实现对流程中各组件的联合优化，并为多种机器学习任务提供统一的推理引擎。具体而言，我们探索通过迁移学习将随机森林分类器替换为学生神经网络。我们在100个OpenML任务上测试了多种神经网络拓扑结构，这些任务中随机森林原本是最优解决方案之一。实验结果表明，在大多数任务中，若能选择恰当的神经网络超参数，学生神经网络确实能够有效模仿教师模型。我们还研究了利用随机森林辅助选择神经网络超参数的方法。

摘要 (Abstract)

Transfer learning and knowledge distillation has recently gained a lot of attention in the deep learning community. One transfer approach, the student-teacher learning, has been shown to successfully create small'' student neural networks that mimic the performance of a much bigger and more complex teacher’’ networks. In this paper, we investigate an extension to this approach and transfer from a non-neural-based machine learning pipeline as teacher to a neural network (NN) student, which would allow for joint optimization of the various pipeline components and a single unified inference engine for multiple ML tasks. In particular, we explore replacing the random forest classifier by transfer learning to a student NN. We experimented with various NN topologies on 100 OpenML tasks in which random forest has been one of the best solutions. Our results show that for the majority of the tasks, the student NN can indeed mimic the teacher if one can select the right NN hyper-parameters. We also investigated the use of random forest for selecting the right NN hyper-parameters.

关键词: knowledge distillation, transfer learning, neural network, random forest, student-teacher learning, machine learning pipeline, hyper-parameter selection, OpenML tasks

33. ❌ Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

作者: Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, Xiang Bai 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25716v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频世界模型（Video World Models）的特定内存机制问题，提出Hybrid Memory范式和HyDRA架构来解决动态主体在视野外时的连续性问题。虽然论文涉及世界模型（World Models），但仅限于视频领域的具体应用，并未涉及大语言模型（LLMs）、深度学习技术原理创新、或AI for Science等关键词。唯一高度相关的关键词是’World Models AND General World Models’，因为论文明确研究视频世界模型，但并非通用世界模型。其他所有关键词均与论文内容无关，论文未涉及语言模型、训练技术、推理方法、代理系统、模型优化等主题。

!!! tip deepseek-chat TL;DR

该论文针对视频世界模型中动态主体在视野外时出现冻结、扭曲或消失的问题，提出了Hybrid Memory范式和HyDRA内存架构，通过构建HM-World数据集并实验验证，显著提升了动态主体一致性和整体生成质量。

摘要翻译

视频世界模型在模拟物理世界方面展现出巨大潜力，但现有记忆机制主要将环境视为静态画布。当动态目标暂时移出视野并随后重新出现时，当前方法往往难以应对，导致目标出现冻结、扭曲或消失等问题。为此，我们提出混合记忆（Hybrid Memory）这一新范式，要求模型同时充当静态背景的精确记录者与动态目标的敏锐追踪者，确保目标在离开视野期间的运动连续性。为推进该方向研究，我们构建了首个专注于混合记忆的大规模视频数据集HM-World，包含5.9万条高保真视频片段，其相机轨迹与目标运动轨迹完全解耦，涵盖17类多样场景、49种不同目标，并通过精心设计的进出视野事件以严格评估混合连贯性。此外，我们提出专用记忆架构HyDRA，该架构将记忆压缩为令牌，并采用时空相关性驱动的检索机制。通过选择性关注相关运动线索，HyDRA能有效保持隐藏目标的身份特征与运动状态。在HM-World上的大量实验表明，我们的方法在动态目标一致性与整体生成质量方面均显著优于现有先进方法。

摘要 (Abstract)

Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.

关键词: Video World Models, Hybrid Memory, Dynamic Subjects, Memory Architecture, Spatiotemporal Retrieval, HM-World Dataset, Motion Continuity, HyDRA

34. ❌ The Kitchen Loop: User-Spec-Driven Development for a Self-Evolving Codebase

作者: Yannick Roy 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25697v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Kitchen Loop框架，核心是利用LLM代理（LLM agent）作为合成用户进行软件测试和验证，实现自主、自演化的软件开发。这与’LLM Agents’高度相关（10分），因为LLM代理是系统的核心组件；与’Large Language Models’相关（8分），因为系统依赖LLM；与’Tool Use’相关（8分），因为LLM代理执行测试任务；与’Self-Correction’高度相关（10分），因为系统展示了多迭代自我纠正链和自主修复。其他关键词如MoE、SLMs、训练方法、推理优化、科学AI等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了Kitchen Loop框架，利用LLM代理作为合成用户进行高速测试和验证，实现了自主、自演化的软件开发，在285+次迭代中产生了1094+个合并请求且零回归。

摘要翻译

代码生产如今已成为一种商品；当前的瓶颈在于明确构建目标并验证其有效性。本文提出Kitchen Loop框架——一种基于统一信任模型的自主自演进软件架构，该模型包含四个核心组件：(1) 规格说明层：明确枚举产品声明支持的功能范围；(2) “千倍用户模拟”：通过LLM智能体以千倍于人类操作频度模拟超级用户对功能层进行验证；(3) 不可篡改测试：采用开发者无法伪造的基准真实验证机制；(4) 漂移控制：配备自动暂停阀门的持续质量监测体系。我们在两个生产系统中进行了285+次迭代验证，累计生成1,094+个合并请求，回归测试预言机（方法论见第6.1节）未检测到任何回归缺陷。我们观察到大规模运行中涌现的特性：多迭代自我修正链、自主基础设施修复能力以及单调改进的质量阀门。这些基础组件并非创新，我们的贡献在于将其整合为经过生产验证的系统，并通过严格的运维规范确保长期自主演进的安全性。

摘要 (Abstract)

Code production is now a commodity; the bottleneck is knowing what to build and proving it works. We present the Kitchen Loop, a framework for autonomous, self-evolving software built on a unified trust model: (1) a specification surface enumerating what the product claims to support; (2) ‘As a User x 1000’, where an LLM agent exercises that surface as a synthetic power user at 1,000x human cadence; (3) Unbeatable Tests, ground-truth verification the code author cannot fake; and (4) Drift Control, continuous quality measurement with automated pause gates. We validate across two production systems over 285+ iterations, producing 1,094+ merged pull requests with zero regressions detected by the regression oracle (methodology in Section 6.1). We observe emergent properties at scale: multi-iteration self-correction chains, autonomous infrastructure healing, and monotonically improving quality gates. The primitives are not new; our contribution is their composition into a production-tested system with the operational discipline that makes long-running autonomous evolution safe.

关键词: autonomous software, self-evolving codebase, LLM agent, synthetic power user, trust model, regression oracle, quality gates, multi-iteration self-correction

35. ❌ A Unified Memory Perspective for Probabilistic Trustworthy AI

作者: Xueji Zhao, Likai Pei, Jianbo Liu, Kai Ni, Ningyuan Cao 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25692v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于可信AI的硬件架构和内存系统优化，特别是针对概率计算的内存访问效率问题。虽然涉及AI系统，但所有关键词均与大模型、深度学习技术原理、AI应用或科学AI无关。论文未提及任何语言模型、训练方法、推理技术、代理系统或特定领域应用，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文研究了可信AI系统中概率计算的内存访问效率问题，提出了将确定性访问视为随机采样特例的统一框架，并基于此定义了内存级评估标准来分析传统架构的局限性和新兴概率内存计算方法的潜力。

摘要翻译

可信人工智能日益依赖概率计算以实现鲁棒性、可解释性、安全性与隐私保护。在实际系统中，此类工作负载将确定性数据访问与跨模型、数据路径及系统功能的重复随机采样交织在一起，使性能瓶颈从算术单元转移至必须同时提供数据与随机性的存储系统。本文提出一种统一的数据访问视角，将确定性访问视为随机采样的极限情况，使两种模式能在共同框架内被分析。这一视角揭示：随机性需求的增长会降低有效数据访问效率，并可能驱动系统进入熵受限运行状态。基于此洞见，我们定义了存储器层级的评估标准，包括统一操作、分布可编程性、效率、对硬件非理想性的鲁棒性以及并行兼容性。运用这些标准，我们分析了传统架构的局限性，并考察了将采样与内存访问相结合的新兴概率存内计算方法，从而为可信人工智能的可扩展硬件发展指明了路径。

摘要 (Abstract)

Trustworthy artificial intelligence increasingly relies on probabilistic computation to achieve robustness, interpretability, security and privacy. In practical systems, such workloads interleave deterministic data access with repeated stochastic sampling across models, data paths and system functions, shifting performance bottlenecks from arithmetic units to memory systems that must deliver both data and randomness. Here we present a unified data-access perspective in which deterministic access is treated as a limiting case of stochastic sampling, enabling both modes to be analyzed within a common framework. This view reveals that increasing stochastic demand reduces effective data-access efficiency and can drive systems into entropy-limited operation. Based on this insight, we define memory-level evaluation criteria, including unified operation, distribution programmability, efficiency, robustness to hardware non-idealities and parallel compatibility. Using these criteria, we analyze limitations of conventional architectures and examine emerging probabilistic compute-in-memory approaches that integrate sampling with memory access, outlining pathways toward scalable hardware for trustworthy AI.

关键词: Trustworthy AI, probabilistic computation, memory systems, stochastic sampling, hardware architecture, compute-in-memory, data-access efficiency, entropy-limited operation

36. ❌ Just Zoom In: Cross-View Geo-Localization via Autoregressive Zooming

作者: Yunus Talha Erzurumlu, Jiyong Kwag, Alper Yilmaz 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25686v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究跨视角地理定位（CVGL），通过自回归缩放方法在城市尺度卫星地图上进行定位，属于计算机视觉和地理信息系统领域。论文未涉及任何大语言模型、深度学习技术原理、AI for Science或相关关键词中的技术，所有关键词均与大模型、深度学习、AI科学应用无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Just Zoom In的自回归缩放方法，用于解决跨视角地理定位问题，通过序列化的粗到细空间推理在真实基准上实现了最先进的性能，将50米内的Recall@1提高了5.5%，100米内的Recall@1提高了9.6%。

摘要翻译

跨视角地理定位（CVGL）通过将街景图像与地理参照的俯视图像进行匹配来估计相机位置，实现在无GPS信号环境下的定位与导航。现有方法几乎普遍将CVGL构建为对比训练嵌入空间中的图像检索问题。这使得模型性能依赖于大批次训练和困难负样本挖掘，同时忽略了地图的几何结构以及街景与俯视图像之间的覆盖范围不匹配问题。具体而言，从街景视角可见的显著地标可能落在固定的卫星图像裁剪区域之外，导致检索目标模糊并限制了在地图上进行显式空间推理的能力。我们提出“精准缩放”方法，这是一种通过在城市尺度俯视地图上进行自回归缩放来实现CVGL的新范式。该方法从粗粒度卫星视图开始，模型通过一系列简短的缩放决策序列，在目标分辨率下选择终端卫星图像单元，无需依赖对比损失或困难负样本挖掘。我们进一步引入一个包含众包街景数据和高分辨率卫星影像的现实基准数据集，该数据集反映了真实的采集条件。在此基准测试中，“精准缩放”方法取得了最先进的性能，相较于最强的对比检索基线模型，在50米范围内的Recall@1指标提升了5.5%，在100米范围内的Recall@1指标提升了9.6%。这些结果证明了从粗到细的序列化空间推理在跨视角地理定位任务中的有效性。

摘要 (Abstract)

Cross-view geo-localization (CVGL) estimates a camera’s location by matching a street-view image to geo-referenced overhead imagery, enabling GPS-denied localization and navigation. Existing methods almost universally formulate CVGL as an image-retrieval problem in a contrastively trained embedding space. This ties performance to large batches and hard negative mining, and it ignores both the geometric structure of maps and the coverage mismatch between street-view and overhead imagery. In particular, salient landmarks visible from the street view can fall outside a fixed satellite crop, making retrieval targets ambiguous and limiting explicit spatial inference over the map. We propose Just Zoom In, an alternative formulation that performs CVGL via autoregressive zooming over a city-scale overhead map. Starting from a coarse satellite view, the model takes a short sequence of zoom-in decisions to select a terminal satellite cell at a target resolution, without contrastive losses or hard negative mining. We further introduce a realistic benchmark with crowd-sourced street views and high-resolution satellite imagery that reflects real capture conditions. On this benchmark, Just Zoom In achieves state-of-the-art performance, improving Recall@1 within 50 m by 5.5% and Recall@1 within 100 m by 9.6% over the strongest contrastive-retrieval baseline. These results demonstrate the effectiveness of sequential coarse-to-fine spatial reasoning for cross-view geo-localization.

关键词: cross-view geo-localization, autoregressive zooming, satellite imagery, spatial reasoning, image retrieval, coarse-to-fine, benchmark, state-of-the-art

37. ❌ Measuring What Matters – or What’s Convenient?: Robustness of LLM-Based Scoring Systems to Construct-Irrelevant Factors

作者: Cole Walsh, Rodica Ivan 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25674v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在自动评分系统中的鲁棒性，特别是对与评估结构无关因素的抵抗能力，因此与’Large Language Models’高度相关（10分）。论文明确讨论LLM评分系统中的’hallucinations’和鲁棒性问题，与’Hallucination Mitigation’高度相关（10分）。其他关键词涉及具体技术原理、训练方法、推理技术、应用领域等，论文未涉及这些具体方面，因此评分为0分。

!!! tip deepseek-chat TL;DR

本研究评估了基于大语言模型的自动评分系统对与评估结构无关因素（如无意义文本填充、拼写错误等）的鲁棒性，发现系统在大多数情况下表现稳健，但对文本重复和离题回答有显著惩罚。

摘要翻译

在教育测评行业中，自动化系统已被广泛应用于开放性回答评估与作文评分。这些系统通常能达到与训练有素的人工评分员相当或更优的性能水平，但研究也多次证明其容易受到构念无关因素（即与所评估构念无关的回答特征）和对抗性条件的影响。随着大型语言模型在自动化评分系统中的使用日益增多，学界重新开始关注“幻觉”问题以及这些基于LLM的自动化评分方法对构念无关因素的鲁棒性。本研究探讨了构念无关因素对一种双架构LLM评分系统的影响，该系统设计用于对情境判断测试中短文式开放性回答题目进行评分。研究发现，评分系统总体上对填充无意义文本、拼写错误和写作复杂度变化表现出较强的鲁棒性。然而，大段文本重复会导致系统预测分数降低，这一结果与以往非LLM评分系统的研究结论相悖；同时，离题回答会受到评分系统的严重惩罚。这些结果为未来基于LLM的评分系统在注重构念相关性的设计前提下实现鲁棒性提供了令人鼓舞的支持。

摘要 (Abstract)

Automated systems have been widely adopted across the educational testing industry for open-response assessment and essay scoring. These systems commonly achieve performance levels comparable to or superior than trained human raters, but have frequently been demonstrated to be vulnerable to the influence of construct-irrelevant factors (i.e., features of responses that are unrelated to the construct assessed) and adversarial conditions. Given the rising usage of large language models in automated scoring systems, there is a renewed focus on ``hallucinations’’ and the robustness of these LLM-based automated scoring approaches to construct-irrelevant factors. This study investigates the effects of construct-irrelevant factors on a dual-architecture LLM-based scoring system designed to score short essay-like open-response items in a situational judgment test. It was found that the scoring system was generally robust to padding responses with meaningless text, spelling errors, and writing sophistication. Duplicating large passages of text resulted in lower scores predicted by the system, on average, contradicting results from previous studies of non-LLM-based scoring systems, while off-topic responses were heavily penalized by the scoring system. These results provide encouraging support for the robustness of future LLM-based scoring systems when designed with construct relevance in mind.

关键词: large language models, automated scoring systems, robustness, construct-irrelevant factors, hallucinations, educational testing, situational judgment test, LLM-based scoring

38. ❌ A Mentalistic Interface for Probing Folk-Psychological Attribution to Non-Humanoid Robots

作者: Giulio Pisaneschi, Pierpaolo Serio, Estelle Gerbier, Andrea Dan Ryals, Lorenzo Pollini, Mario G. C. A. Cimino 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25646v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文摘要明确提到使用’large language model-based explanatory layers’，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分）。论文研究机器人心理学中的意图归因，属于大模型在特定领域（人机交互/心理学）的应用，符合研究背景中’大模型在不同领域的研究应用’的描述。其他关键词涉及具体技术细节（如MoE、量化、推理加速等）或特定应用领域（如生物信息学），论文未提及，故均为0分。

!!! tip deepseek-chat TL;DR

该研究开发了一个实验平台，通过结合模拟机器人、任务环境和基于大语言模型的解释层，探究语言和框架如何影响人们对非人形机器人的意图归因。

摘要翻译

本文提出了一种用于研究对非人形机器人进行意向状态归因的实验平台。该系统整合了模拟机器人、真实任务环境以及基于大语言模型的解释层，该解释层能够以心智化、目的论或机械论术语对同一行为进行表述。通过保持行为不变而改变解释框架，该平台为研究语言和框架如何影响人们在机器人学中采取意向立场提供了受控研究方法。

摘要 (Abstract)

This paper presents an experimental platform for studying intentional-state attribution toward a non-humanoid robot. The system combines a simulated robot, realistic task environments, and large language model-based explanatory layers that can express the same behavior in mentalistic, teleological, or mechanistic terms. By holding behavior constant while varying the explanatory frame, the platform provides a controlled way to investigate how language and framing shape the adoption of the intentional stance in robotics.

关键词: intentional-state attribution, non-humanoid robot, large language model, explanatory layers, mentalistic terms, teleological terms, mechanistic terms, intentional stance

39. ❌ Beyond Via: Analysis and Estimation of the Impact of Large Language Models in Academic Papers

作者: Mingmeng Geng, Yuhang Dong, Thierry Poibeau 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25638v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLMs对学术写作的影响，直接涉及’Large Language Models’关键词（10分）。论文分析arXiv论文，属于科学领域应用，与’AI for Science’有一定关联（5分）。其他关键词涉及具体技术原理、训练方法、推理优化、应用场景等，论文未涉及这些具体技术细节，均为0分。

!!! tip deepseek-chat TL;DR

该论文通过分析arXiv论文，研究了大型语言模型对学术写作风格的影响，发现LLMs导致特定词汇使用频率变化，并证明当前分类器难以准确识别具体模型生成的文本。

摘要翻译

通过对arXiv论文的分析，我们报告了若干可能由大语言模型（LLMs）驱动但此前未获充分关注的词汇使用变化，例如标题中“beyond”和“via”使用频率的上升，以及摘要中“the”和“of”使用频率的下降。由于不同大语言模型之间的相似性，实验表明当前分类器在多类别分类任务中难以准确判断特定文本由何种模型生成。与此同时，大语言模型之间的差异也导致学术论文中的词汇使用模式不断演变。通过采用直接且高度可解释的线性方法，并考虑不同模型与提示词之间的差异，我们定量评估了这些影响，并证明现实世界中的大语言模型使用具有异质性和动态性。

摘要 (Abstract)

Through an analysis of arXiv papers, we report several shifts in word usage that are likely driven by large language models (LLMs) but have not previously received sufficient attention, such as the increased frequency of “beyond” and “via” in titles and the decreased frequency of “the” and “of” in abstracts. Due to the similarities among different LLMs, experiments show that current classifiers struggle to accurately determine which specific model generated a given text in multi-class classification tasks. Meanwhile, variations across LLMs also result in evolving patterns of word usage in academic papers. By adopting a direct and highly interpretable linear approach and accounting for differences between models and prompts, we quantitatively assess these effects and show that real-world LLM usage is heterogeneous and dynamic.

关键词: large language models, LLMs, academic papers, arXiv, word usage, text classification, writing style, impact analysis

40. ❌ Visual or Textual: Effects of Explanation Format and Personal Characteristics on the Perception of Explanations in an Educational Recommender System

作者: Qurat Ul Ain, Mohamed Amine Chatti, Nasim Yazdian Varjani, Farah Kamal, Astrid Rosenthal-von der Pütten 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25624v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究教育推荐系统中视觉与文本解释格式对用户感知的影响，属于人机交互和推荐系统领域。论文未涉及大模型、深度学习技术原理或科学应用，仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（涉及解释性AI），但非核心内容，因此该关键词给5分，其余关键词均无关给0分。

!!! tip deepseek-chat TL;DR

该研究通过用户实验比较了教育推荐系统中视觉与文本解释格式对用户感知控制、透明度、信任和满意度的影响，发现设计良好的视觉解释对大多数用户更有效，并提出了设计指南。

摘要翻译

解释对于提升推荐系统（RS）的透明度、信任度和用户满意度至关重要，然而目前尚不清楚不同的解释形式（可视化与文本）如何适配具有不同个人特征（PCs）的用户。为此，我们报告了一项被试内用户研究（n=54），比较了可视化与文本解释，并探究了解释形式与个人特征如何共同影响教育推荐系统（ERS）中的感知控制、透明度、信任和满意度。通过使用稳健的混合效应模型，我们分析了广泛个人特征的调节作用，包括大五人格特质、认知需求、决策风格、可视化熟悉度以及技术专长。研究结果表明，对于大多数用户而言，无论其个人特征如何，一个设计精良的可视化解释——即简单、交互式、选择性、易于理解，并能清晰直观地展示用户偏好如何与推荐相关联的可视化方案——能够促进其在教育推荐系统中的感知控制、透明度、适度信任和满意度。此外，我们提出了一套设计指南，以支持教育推荐系统中解释功能的有效设计。

摘要 (Abstract)

Explanations are central to improving transparency, trust, and user satisfaction in recommender systems (RS), yet it remains unclear how different explanation formats (visual vs. textual) are suited to users with different personal characteristics (PCs). To this end, we report a within-subject user study (n=54) comparing visual and textual explanations and examine how explanation format and PCs jointly influence perceived control, transparency, trust, and satisfaction in an educational recommender system (ERS). Using robust mixed-effects models, we analyze the moderating effects of a wide range of PCs, including Big Five traits, need for cognition, decision making style, visualization familiarity, and technical expertise. Our results show that a well-designed visual, simple, interactive, selective, easy to understand visualization that clearly and intuitively communicates how user preferences are linked to recommendations, fosters perceived control, transparency, appropriate trust, and satisfaction in the ERS for most users, independent of their PCs. Moreover, we derive a set of guidelines to support the effective design of explanations in ERSs.

关键词: explanation formats, visual explanations, textual explanations, educational recommender systems, user perception, personal characteristics, transparency, trust

41. ❌ Demographic Fairness in Multimodal LLMs: A Benchmark of Gender and Ethnicity Bias in Face Verification

作者: Ünsal Öztürk, Hatef Otroshi Shahreza, Sébastien Marcel 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25613v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多模态大语言模型（MLLMs）在人脸验证任务中的公平性问题，核心涉及大模型（LLMs）在特定应用场景下的评估，因此与’Large Language Models’高度相关（10分）。论文评估的模型参数规模为2B-8B，属于较小规模模型，与’Small Language Models’有一定关联（5分）。其他关键词如MoE、Scaling Laws、训练方法、推理优化、AI for Science等均未在论文中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文评估了9个开源多模态大语言模型在人脸验证任务中的性别和种族偏见，发现专门的人脸模型FaceLLM-8B性能优于通用MLLMs，且最准确的模型不一定最公平。

摘要翻译

多模态大语言模型（MLLMs）近期被探索作为人脸验证系统，用于判定两张人脸图像是否属于同一人。与专用的人脸识别系统不同，MLLMs通过视觉提示来处理此任务，并依赖其通用的视觉与推理能力。然而，这些模型的人口统计学公平性在很大程度上仍未得到充分研究。本文提出了一项基准测试研究，在IJB-C和RFW两个人脸验证协议下，评估了来自六个模型家族的九个开源MLLMs（参数量从2B到8B），覆盖四个种族群体和两个性别群体。我们通过等错误率以及针对每个人口统计群体在多个操作点下的真实匹配率来衡量验证准确性，并使用四个基于错误接受率的公平性指标来量化人口统计学差异。我们的结果表明，在本研究中唯一专注于人脸任务的模型FaceLLM-8B，在两个基准测试上均显著优于通用型MLLMs。我们观察到的偏差模式与传统人脸识别中常见的报告有所不同，受影响最大的群体因基准测试和模型的不同而异。我们还注意到，最准确的模型不一定是最公平的，而整体准确性较差的模型可能仅仅因为其在所有人口统计群体中均产生一致的高错误率而显得公平。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) have recently been explored as face verification systems that determine whether two face images are of the same person. Unlike dedicated face recognition systems, MLLMs approach this task through visual prompting and rely on general visual and reasoning abilities. However, the demographic fairness of these models remains largely unexplored. In this paper, we present a benchmarking study that evaluates nine open-source MLLMs from six model families, ranging from 2B to 8B parameters, on the IJB-C and RFW face verification protocols across four ethnicity groups and two gender groups. We measure verification accuracy with the Equal Error Rate and True Match Rate at multiple operating points per demographic group, and we quantify demographic disparity with four FMR-based fairness metrics. Our results show that FaceLLM-8B, the only face-specialised model in our study, substantially outperforms general-purpose MLLMs on both benchmarks. The bias patterns we observe differ from those commonly reported for traditional face recognition, with different groups being most affected depending on the benchmark and the model. We also note that the most accurate models are not necessarily the fairest and that models with poor overall accuracy can appear fair simply because they produce uniformly high error rates across all demographic groups.

关键词: Multimodal Large Language Models, face verification, demographic fairness, gender bias, ethnicity bias, benchmarking, MLLMs, FaceLLM

42. ❌ DeepFAN, a transformer-based deep learning model for human-artificial intelligence collaborative assessment of incidental pulmonary nodules in CT scans: a multi-reader, multi-case trial

作者: Zhenchen Zhu, Ge Hu, Weixiong Tan, Kai Gao, Chao Sun, Zhen Zhou, Kepei Xu, Wei Han, Meixia Shang, Xiaoming Qiu, Yiqing Tan, Jinhua Wang, Zhoumeng Ying, Li Peng, Wei Song, Lan Song, Zhengyu Jin, Nan Hong, Yizhou Yu 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25607v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于医学影像分析，使用transformer架构开发了DeepFAN模型用于肺结节分类，并通过临床试验验证其辅助放射科医生的效果。论文与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、代理系统等）完全无关，因为这些关键词特指自然语言处理领域的大语言模型技术。仅与两个关键词相关：1. “Mechanistic Interpretability OR Explainable AI”：论文进行了可解释性分析，评估了全局和局部特征的贡献，因此给予5分（中等关联）。2. “AI for Science OR Bioinformatics OR Cheminformatics”：论文属于AI在生物医学（具体为放射学/医学影像）领域的应用，是"AI for Science"的典型实例，因此给予10分（高度相关，核心内容）。

!!! tip deepseek-chat TL;DR

该研究开发了基于transformer的DeepFAN模型，用于CT扫描中肺结节的良恶性分类，并通过多中心临床试验证明该模型能显著提升初级放射科医生的诊断性能（AUC提高10.9%）和诊断一致性。

摘要翻译

CT的广泛应用显著增加了检出的肺结节数量。然而，当前用于区分良恶性结节的深度学习方法往往未能全面整合全局与局部特征，且多数未经过临床试验验证。为此，我们开发了基于Transformer架构的DeepFAN模型，该模型使用超过1万个病理确诊结节进行训练，并进一步开展了多阅片者、多病例的临床试验，以评估其辅助初级放射科医师的诊断效能。DeepFAN在内部测试集上取得了0.939（95% CI 0.930-0.948）的诊断曲线下面积（AUC），在涵盖三家独立医疗机构400例病例的临床试验数据集上AUC达到0.954（95% CI 0.934-0.973）。可解释性分析表明，全局特征的贡献度高于局部特征。12位阅片者的平均诊断性能显著提升：AUC提高10.9%（95% CI 8.3%-13.5%），准确率提高10.0%（95% CI 8.9%-11.1%），敏感度提高7.6%（95% CI 6.1%-9.2%），特异度提高12.6%（95% CI 10.9%-14.3%）（所有指标P<0.001）。结节层面的阅片者间诊断一致性从一般提升至中等（总体kappa系数：0.313 vs. 0.421；P=0.019）。综上所述，DeepFAN能有效辅助初级放射科医师，并可能有助于统一诊断质量、减少不确定性肺结节的不必要随访。中国临床试验注册号：ChiCTR2400084624。

摘要 (Abstract)

The widespread adoption of CT has notably increased the number of detected lung nodules. However, current deep learning methods for classifying benign and malignant nodules often fail to comprehensively integrate global and local features, and most of them have not been validated through clinical trials. To address this, we developed DeepFAN, a transformer-based model trained on over 10K pathology-confirmed nodules and further conducted a multi-reader, multi-case clinical trial to evaluate its efficacy in assisting junior radiologists. DeepFAN achieved diagnostic area under the curve (AUC) of 0.939 (95% CI 0.930-0.948) on an internal test set and 0.954 (95% CI 0.934-0.973) on the clinical trial dataset involving 400 cases across three independent medical institutions. Explainability analysis indicated higher contributions from global than local features. Twelve readers’ average performance significantly improved by 10.9% (95% CI 8.3%-13.5%) in AUC, 10.0% (95% CI 8.9%-11.1%) in accuracy, 7.6% (95% CI 6.1%-9.2%) in sensitivity, and 12.6% (95% CI 10.9%-14.3%) in specificity (P<0.001 for all). Nodule-level inter-reader diagnostic consistency improved from fair to moderate (overall k: 0.313 vs. 0.421; P=0.019). In conclusion, DeepFAN effectively assisted junior radiologists and may help homogenize diagnostic quality and reduce unnecessary follow-up of indeterminate pulmonary nodules. Chinese Clinical Trial Registry: ChiCTR2400084624.

关键词: DeepFAN, transformer, lung nodules, CT scans, clinical trial, radiologist assistance, diagnostic performance, explainability analysis

43. ❌ Are LLMs Overkill for Databases?: A Study on the Finiteness of SQL

作者: Yue Li, David Mimno, Unso Eun Seo Jo 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25568v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究LLMs在SQL生成任务中的应用局限性，发现实际SQL查询复杂度有限且高度可预测，因此仅与’Large Language Models OR LLMs OR Foundation Models’关键词高度相关（8分），因为论文核心是评估LLMs在数据库访问领域的适用性。其他关键词涉及具体技术原理（如MoE、量化、推理加速等）、训练方法（如预训练、微调、对齐等）或特定应用领域（如科学AI、代理系统等），论文均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLMs在自然语言转SQL任务中的必要性，通过对376个数据库的分析发现SQL查询复杂度有限且高度可预测，表明LLMs可能在该领域过度复杂，而模板方法可能更安全、廉价且可审计。

摘要翻译

得益于代码生成大语言模型，将自然语言翻译为SQL以进行数据检索已变得更加便捷。但生成SQL代码究竟有多困难？尽管数据库的复杂度可能趋于无限，但查询的复杂度实际上受现实应用需求与人类使用场景的约束。通过对376个数据库样本的分析，我们发现作为自然语言问题翻译的SQL查询在实际复杂度上是有限的。数据库表数量的增加与SQL查询复杂度的提升之间并不存在明确的单调关系。从模板形式来看，SQL查询遵循类似幂律的频率分布：在我们测试的查询中，仅需覆盖13%的模板类型即可满足70%的查询需求，这表明绝大多数SQL查询具有可预测性。这意味着，尽管代码生成大语言模型具有实用价值，但在数据库访问领域，它们可能仅在一个狭窄且高度程式化的空间中运作，而使用模板可能更为安全、经济且具备可审计性。

摘要 (Abstract)

Translating natural language to SQL for data retrieval has become more accessible thanks to code generation LLMs. But how hard is it to generate SQL code? While databases can become unbounded in complexity, the complexity of queries is bounded by real life utility and human needs. With a sample of 376 databases, we show that SQL queries, as translations of natural language questions are finite in practical complexity. There is no clear monotonic relationship between increases in database table count and increases in complexity of SQL queries. In their template forms, SQL queries follow a Power Law-like distribution of frequency where 70% of our tested queries can be covered with just 13% of all template types, indicating that the high majority of SQL queries are predictable. This suggests that while LLMs for code generation can be useful, in the domain of database access, they may be operating in a narrow, highly formulaic space where templates could be safer, cheaper, and auditable.

关键词: LLMs, SQL generation, database access, query complexity, natural language to SQL, code generation, template-based approach, predictable queries

44. ❌ TAAC: A gate into Trustable Audio Affective Computing

作者: Xintao Hu, Feng-Qi Cui 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25570v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于音频情感计算和抑郁症诊断，提出了一种基于对抗损失子空间分解的隐私保护框架TAAC。虽然属于AI在医疗领域的应用，但论文未涉及大模型、深度学习技术原理创新或任何评分关键词中的具体技术（如LLM、MoE、Scaling Laws等）。仅与’AI for Science’有一定关联（应用于医疗诊断），但非核心内容，因此该关键词给5分，其余关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为TAAC的可信音频情感计算框架，通过对抗损失子空间分解技术，在保护用户身份隐私的同时实现基于音频的抑郁症自动检测。

摘要翻译

随着人工智能技术在抑郁症诊断领域的应用，抑郁症筛查的高需求与有限供给之间的矛盾已得到显著缓解。在各种模态数据中，基于音频的抑郁症诊断受到学术界与工业界日益增长的关注，因为音频是情感传递最常见的载体。然而，音频数据同时包含用户敏感身份信息（User-sensitive Identity Information, ID），这类信息极为脆弱，在智能诊断过程中可能被恶意利用。在现有方法中，抑郁症特征与敏感特征的有效区分始终是一个难点。同时，引入仅对敏感特征进行加密的安全加密方法，以及构建能够准确诊断抑郁症的强大分类器，对该问题的解决至关重要。为应对这些挑战，我们利用基于对抗损失的子空间分解技术，首次提出了一个实用框架——可信音频情感计算（Trustable Audio Affective Computing, TAAC），旨在可信环境中通过音频实现自动化抑郁症检测。TAAC的核心组件包括：用于特征分解的差异化特征子空间分解器（Differentiating Features Subspace Decompositor, DFSD）、用于身份信息加密的灵活噪声加密器（Flexible Noise Encryptor, FNE），以及用于性能提升的分阶段训练范式。通过与现有加密方法的广泛实验对比，本框架在抑郁症检测、身份信息保留和音频重建方面均表现出卓越性能。同时，多场景实验验证了模型在不同加密强度下的稳定性。这证明了本框架在机密性、准确性、可追溯性和可调节性方面的优越性。

摘要 (Abstract)

With the emergence of AI techniques for depression diagnosis, the conflict between high demand and limited supply for depression screening has been significantly alleviated. Among various modal data, audio-based depression diagnosis has received increasing attention from both academia and industry since audio is the most common carrier of emotion transmission. Unfortunately, audio data also contains User-sensitive Identity Information (ID), which is extremely vulnerable and may be maliciously used during the smart diagnosis process. Among previous methods, the clarification between depression features and sensitive features has always serve as a barrier. It is also critical to the problem for introducing a safe encryption methodology that only encrypts the sensitive features and a powerful classifier that can correctly diagnose the depression. To track these challenges, by leveraging adversarial loss-based Subspace Decomposition, we propose a first practical framework \name presented for Trustable Audio Affective Computing, to perform automated depression detection through audio within a trustable environment. The key enablers of TAAC are Differentiating Features Subspace Decompositor (DFSD), Flexible Noise Encryptor (FNE) and Staged Training Paradigm, used for decomposition, ID encryption and performance enhancement, respectively. Extensive experiments with existing encryption methods demonstrate our framework’s preeminent performance in depression detection, ID reservation and audio reconstruction. Meanwhile, the experiments across various setting demonstrates our model’s stability under different encryption strengths. Thus proving our framework’s excellence in Confidentiality, Accuracy, Traceability, and Adjustability.

关键词: Trustable Audio Affective Computing, depression diagnosis, audio-based detection, privacy protection, adversarial loss, subspace decomposition, sensitive feature encryption, medical AI

45. ❌ Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

作者: Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, Dongbin Zhao 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25562v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究On-policy distillation (OPD)在大型语言模型(LLM)后训练(post-training)中的应用，直接涉及’Large Language Models’和’Post-training’关键词，给予10分。论文在数学推理和智能体任务上进行实验，与’Chain of Thought’和’LLM Agents’有一定关联，给予5分。其他关键词如MoE、量化、RAG等未在论文中涉及，给予0分。

!!! tip deepseek-chat TL;DR

该论文研究了在长视野任务中，基于采样令牌的On-policy Distillation (OPD)方法存在的失败模式，并提出通过教师Top-K局部支持匹配（采用截断反向KL和Top-p采样）来稳定优化并提升下游性能。

摘要翻译

在线蒸馏（On-policy distillation, OPD）因其基于学生模型生成的轨迹而非固定的教师模型轨迹来评估教师反馈，在大语言模型（LLM）的后训练中颇具吸引力。然而，在长序列任务中，常用的基于采样令牌的变体（sampled-token variant）是脆弱的：它将分布匹配简化为单令牌信号，并且随着生成轨迹逐渐偏离教师模型常访问的前缀，其可靠性会不断下降。我们从估计器与实现两个角度重新审视在线蒸馏。理论上，令牌级别的在线蒸馏相对于序列级别的反向KL散度是有偏的，但其最坏情况下的方差边界更紧；我们的模拟研究在实证中显示了同样的权衡——未来奖励的耦合越强，梯度方差越大，学习过程越不稳定。实证上，我们识别了采样令牌在线蒸馏的三种失效模式：不平衡的单令牌信号、对学生生成前缀的教师指导不可靠，以及分词器或特殊令牌不匹配导致的失真。我们通过教师模型前K项局部支持匹配来解决这些问题，具体实现为带截断的反向KL散度，结合基于top-p的轨迹采样和特殊令牌掩码。在单任务数学推理和多任务（智能体任务与数学任务）训练中，该目标相比采样令牌在线蒸馏能带来更稳定的优化和更优的下游性能。

摘要 (Abstract)

On-policy distillation (OPD) is appealing for large language model (LLM) post-training because it evaluates teacher feedback on student-generated rollouts rather than fixed teacher traces. In long-horizon settings, however, the common sampled-token variant is fragile: it reduces distribution matching to a one-token signal and becomes increasingly unreliable as rollouts drift away from prefixes the teacher commonly visits. We revisit OPD from the estimator and implementation sides. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL, but it has a much tighter worst-case variance bound; our toy study shows the same tradeoff empirically, with stronger future-reward coupling producing higher gradient variance and less stable learning. Empirically, we identify three failure modes of sampled-token OPD: an imbalanced one-token signal, unreliable teacher guidance on student-generated prefixes, and distortions caused by tokenizer or special-token mismatch. We address these issues with teacher top-K local support matching, implemented as truncated reverse-KL with top-p rollout sampling and special-token masking. Across single-task math reasoning and multi-task agentic-plus-math training, this objective yields more stable optimization and better downstream performance than sampled-token OPD.

关键词: On-policy distillation, Large language models, Post-training, Reverse-KL, Math reasoning, Agentic training, Teacher-student learning, Rollout sampling

46. ❌ Voxtral TTS

作者: Alexander H. Liu, Alexis Tacnet, Andy Ehrenberg, Andy Lo, Chen-Yo Sun, Guillaume Lample, Henry Lagarde, Jean-Malo Delignon, Jaeyoung Kim, John Harvill, Khyathi Raghavi Chandu, Lorenzo Signoretti, Margaret Jennings, Patrick von Platen, Pavankumar Reddy Muddireddy, Rohin Arora, Sanchit Gandhi, Samuel Humeau, Soham Ghosh, Srijan Mishra, Van Phung, Abdelaziz Bounhar, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, Alexandre Sablayrolles, Amélie Héliou, Amos You, Andrew Bai, Andrew Zhao, Angele Lenglemetz, Anmol Agarwal, Anton Eliseev, Antonia Calvi, Arjun Majumdar, Arthur Fournier, Artjom Joosen, Avi Sooriyarachchi, Aysenur Karaduman Utkur, Baptiste Bout, Baptiste Rozière, Baudouin De Monicault, Benjamin Tibi, Bowen Yang, Charlotte Cronjäger, Clémence Lanfranchi, Connor Chen, Corentin Barreau, Corentin Sautier, Cyprien Courtot, Darius Dabert, Diego de las Casas, Elizaveta Demyanenko, Elliot Chane-Sane, Emmanuel Gottlob, Enguerrand Paquin, Etienne Goffinet, Fabien Niel, Faruk Ahmed, Federico Baldassarre, Gabrielle Berrada, Gaëtan Ecrepont, Gauthier Guinet, Genevieve Hayes, Georgii Novikov, Giada Pistilli, Guillaume Kunsch, Guillaume Martin, Guillaume Raille, Gunjan Dhanuka, Gunshi Gupta, Han Zhou, Harshil Shah, Hope McGovern, Hugo Thimonier, Indraneel Mukherjee, Irene Zhang, Jacques Sun, Jan Ludziejewski, Jason Rute, Jérémie Dentan, Joachim Studnia, Jonas Amar, Joséphine Delas, Josselin Somerville Roberts, Julien Tauran, Karmesh Yadav, Kartik Khandelwal, Kilian Tep, Kush Jain, Laurence Aitchison, Laurent Fainsin, Léonard Blier, Lingxiao Zhao, Louis Martin, Lucile Saulnier, Luyu Gao, Maarten Buyl, Manan Sharma, Marie Pellat, Mark Prins, Martin Alexandre, Mathieu Poirée, Mathieu Schmitt, Mathilde Guillaumin, Matthieu Dinot, Matthieu Futeral, Maxime Darrin, Maximilian Augustin, Mert Unsal, Mia Chiquier, Mikhail Biriuchinskii, Minh-Quang Pham, Mircea Lica, Morgane Rivière, Nathan Grinsztajn, Neha Gupta, Olivier Bousquet, Olivier Duchenne, Patricia Wang, Paul Jacob, Paul Wambergue, Paula Kurylowicz, Philippe Pinel, Philomène Chagniot, Pierre Stock, Piotr Miłoś, Prateek Gupta, Pravesh Agrawal, Quentin Torroba, Ram Ramrakhya, Randall Isenhour, Rishi Shah, Romain Sauvestre, Roman Soletskyi, Rosalie Millner, Rupert Menneer, Sagar Vaze, Samuel Barry, Samuel Belkadi, Sandeep Subramanian, Sean Cha, Shashwat Verma, Siddhant Waghjale, Siddharth Gandhi, Simon Lepage, Sumukh Aithal, Szymon Antoniak, Tarun Kumar Vangani, Teven Le Scao, Théo Cachet, Theo Simon Sorg, Thibaut Lavril, Thomas Chabal, Thomas Foubert, Thomas Robert, Thomas Wang, Tim Lawson, Tom Bewley, Tom Edwards, Tyler Wang, Umar Jamil, Umberto Tomasini, Valeriia Nemychnikova, Vedant Nanda, Victor Jouault, Vincent Maladière, Vincent Pfister, Virgile Richard, Vladislav Bataev, Wassim Bouaziz, Wen-Ding Li, William Havard, William Marshall, Xinghui Li, Xingran Guo, Xinyu Yang, Yannic Neuhaus, Yassine El Ouahidi, Yassir Bendou, Yihan Wang, Yimu Pan, Zaccharie Ramzi, Zhenlin Xu 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25551v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文Voxtral TTS专注于文本到语音（TTS）技术，采用混合架构（自回归生成语义语音令牌和流匹配声学令牌）和自定义语音分词器（Voxtral Codec），属于语音生成领域。所有评分关键词均针对大语言模型（LLMs）及其相关技术（如训练、对齐、推理、代理、科学应用等），而本文未涉及LLMs、深度学习原理创新或大模型在不同领域的应用，因此与所有关键词完全无关，均得0分。

!!! tip deepseek-chat TL;DR

Voxtral TTS是一个多语言文本到语音模型，通过混合自回归生成和流匹配架构，仅需3秒参考音频即可生成自然且富有表现力的语音，在人类评估中优于ElevenLabs Flash v2.5。

摘要翻译

我们推出Voxtral TTS，这是一个富有表现力的多语言文本转语音模型，能够仅通过3秒参考音频生成自然语音。Voxtral TTS采用混合架构，结合了语义语音标记的自回归生成与声学标记的流匹配技术。这些标记通过Voxtral Codec进行编码和解码——这是一个基于混合VQ-FSQ量化方案从头训练的语音标记器。在母语者进行的人工评估中，Voxtral TTS因其自然度和表现力在多语言语音克隆任务中更受青睐，相较于ElevenLabs Flash v2.5模型获得了68.4%的胜率。我们以CC BY-NC许可协议公开发布模型权重。

摘要 (Abstract)

We introduce Voxtral TTS, an expressive multilingual text-to-speech model that generates natural speech from as little as 3 seconds of reference audio. Voxtral TTS adopts a hybrid architecture that combines auto-regressive generation of semantic speech tokens with flow-matching for acoustic tokens. These tokens are encoded and decoded with Voxtral Codec, a speech tokenizer trained from scratch with a hybrid VQ-FSQ quantization scheme. In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4% win rate over ElevenLabs Flash v2.5. We release the model weights under a CC BY-NC license.

关键词: Voxtral TTS, text-to-speech, multilingual, voice cloning, auto-regressive generation, flow-matching, speech tokenizer, VQ-FSQ quantization

47. ❌ CHIRP dataset: towards long-term, individual-level, behavioral monitoring of bird populations in the wild

作者: Alex Hoi Hang Chan, Neha Singhal, Onur Kocahan, Andrea Meltzer, Saverio Lubrano, Miyako H. Warrington, Michel Griesser, Fumihiro Kano, Hemal Naik 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25524v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算机视觉在野生动物行为监测中的应用，特别是鸟类个体重识别和行为分析。论文内容涉及数据集构建（CHIRP）、计算机视觉方法（CORVID pipeline）、生物指标评估等，属于AI在生物学/生态学领域的应用。所有关键词中，只有"AI for Science OR Bioinformatics OR Cheminformatics"与论文主题有一定关联（AI在科学领域的应用），但论文未涉及大模型、深度学习技术原理创新或任何其他关键词相关技术（如LLMs、MoE、训练方法、推理优化等）。因此，除最后一个关键词得5分外，其余均得0分。

!!! tip deepseek-chat TL;DR

该研究解决了野生鸟类长期个体行为监测的挑战，通过构建CHIRP数据集和开发CORVID个体重识别方法，实现了对西伯利亚松鸦的自动化行为分析，并展示了该方法在生物指标评估上的优越性能。

摘要翻译

对个体动物进行长期行为监测对于研究不同时间尺度上发生的行为变化至关重要，尤其在保护生物学和进化生物学领域。计算机视觉方法已被证明有利于生物多样性监测，但对野生种群进行自动化行为监测仍具挑战性。这主要源于缺乏能够覆盖一系列计算机视觉任务的数据集，而这些任务对于提取具有生物学意义的个体动物测量数据是必需的。为此，我们引入了一个新的数据集（CHIRP）及一种新方法（CORVID），用于野生鸟类的个体重识别。CHIRP（结合行为、个体重识别与姿态的数据集）数据集整理自瑞典拉普兰地区长期研究的野生西伯利亚松鸦种群，支持重识别（re-id）、行为识别、2D关键点估计、目标检测和实例分割等任务。除了传统的针对特定任务的基准测试外，我们还引入了基于生物学相关指标（如取食率、共现率）的应用导向型基准测试，以评估模型在真实世界用例中的性能。最后，我们提出了CORVID（基于颜色的视频重识别），这是一种基于彩色脚环分割与分类的新型流程，用于鸟类个体识别——彩色脚环是视觉识别个体鸟类的广泛应用方法。CORVID通过将检测到的彩色脚环组合与数据库进行匹配，提供了一种基于概率的身份追踪方法。我们利用应用导向型基准测试表明，CORVID的性能优于当前最先进的重识别方法。我们希望这项工作能为学界提供一个蓝图，指导如何从符合伦理规范的生物学研究中整理真实世界数据集，以弥合计算机视觉研究与生物学应用之间的鸿沟。

摘要 (Abstract)

Long-term behavioral monitoring of individual animals is crucial for studying behavioral changes that occur over different time scales, especially for conservation and evolutionary biology. Computer vision methods have proven to benefit biodiversity monitoring, but automated behavior monitoring in wild populations remains challenging. This stems from the lack of datasets that cover a range of computer vision tasks necessary to extract biologically meaningful measurements of individual animals. Here, we introduce such a dataset (CHIRP) with a new method (CORVID) for individual re-identification of wild birds. The CHIRP (Combining beHaviour, Individual Re-identification and Postures) dataset is curated from a long-term population of wild Siberian jays studied in Swedish Lapland, supporting re-identification (re-id), action recognition, 2D keypoint estimation, object detection, and instance segmentation. In addition to traditional task-specific benchmarking, we introduce application-specific benchmarking with biologically relevant metrics (feeding rates, co-occurrence rates) to evaluate the performance of models in real-world use cases. Finally, we present CORVID (COlouR-based Video re-ID), a novel pipeline for individual identification of birds based on the segmentation and classification of colored leg rings, a widespread approach for visual identification of individual birds. CORVID offers a probability-based id tracking method by matching the detected combination of color rings with a database. We use application-specific benchmarking to show that CORVID outperforms state-of-the-art re-id methods. We hope this work offers the community a blueprint for curating real-world datasets from ethically approved biological studies to bridge the gap between computer vision research and biological applications.

关键词: behavioral monitoring, individual re-identification, computer vision, wild birds, dataset curation, action recognition, biological applications, conservation biology

48. ❌ Challenges in Hyperspectral Imaging for Autonomous Driving: The HSI-Drive Case

作者: Koldo Basterretxea, Jon Gutiérrez-Zaballa, Javier Echanobe 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25510v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于高光谱成像（HSI）在自动驾驶中的应用挑战，包括传感器技术选择、算法开发和实时处理等计算机视觉问题。论文内容完全不涉及大语言模型、深度学习技术原理、模型训练优化、推理加速、AI对齐、智能体系统或科学AI应用等关键词领域。所有关键词均与大模型和深度学习技术直接相关，而本文是纯粹的计算机视觉和传感器应用研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文分析了高光谱成像在自动驾驶应用中面临的技术挑战，包括光照条件、动态场景和实时处理限制，并基于HSI-Drive数据集评估了相关视觉算法。

摘要翻译

高光谱成像（HSI）在自动驾驶（AD）领域的应用虽前景广阔，却面临着与该应用领域特性及需求相关的诸多挑战。一方面，存在非受控且多变的照明条件、大范围的景深跨度以及包含高速运动物体的动态场景；另一方面，则需满足实时操作要求，并受限于嵌入式平台有限的计算资源。这些因素共同决定了选择合适高光谱成像技术的标准，也推动了利用传感器获取的光谱与空间信息开发定制化视觉算法的进程。本文以基于最新版HSI-Drive数据集实验所得结果为例，分析了在面向自动驾驶的高光谱视觉系统研究中探索的若干技术。

摘要 (Abstract)

The use of hyperspectral imaging (HSI) in autonomous driving (AD), while promising, faces many challenges related to the specifics and requirements of this application domain. On the one hand, non-controlled and variable lighting conditions, the wide depth-of-field ranges, and dynamic scenes with fast-moving objects. On the other hand, the requirements for real-time operation and the limited computational resources of embedded platforms. The combination of these factors determines both the criteria for selecting appropriate HSI technologies and the development of custom vision algorithms that leverage the spectral and spatial information obtained from the sensors. In this article, we analyse several techniques explored in the research of HSI-based vision systems with application to AD, using as an example results obtained from experiments using data from the most recent version of the HSI-Drive dataset.

关键词: hyperspectral imaging, autonomous driving, HSI-Drive dataset, real-time processing, vision algorithms, sensor technology, dynamic scenes, computational resources

49. ❌ NERO-Net: A Neuroevolutionary Approach for the Design of Adversarially Robust CNNs

作者: Inês Valentim, Nuno Antunes, Nuno Lourenço 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25517v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《NERO-Net: A Neuroevolutionary Approach for the Design of Adversarially Robust CNNs》专注于使用神经进化方法设计对抗性鲁棒的卷积神经网络（CNNs），研究领域为计算机视觉和对抗性机器学习。所有评分关键词均围绕大语言模型（LLMs）及其相关技术（如训练方法、推理优化、应用等），而本文研究的是CNN架构设计，与LLMs无直接关联。因此，所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为NERO-Net的神经进化方法，用于设计具有内在对抗鲁棒性的卷积神经网络，在CIFAR-10数据集上验证了其有效性，使模型在保持高清洁准确率的同时提升了对抗攻击下的性能。

摘要翻译

神经进化技术自动化了神经网络设计的复杂任务，但往往忽视了进化模型固有的对抗脆弱性，这阻碍了其在安全关键场景中的应用。尽管鲁棒性训练方法已受到广泛关注，但具有内在鲁棒性的架构设计在很大程度上仍未得到探索。本文提出NERO-Net，一种通过神经进化设计卷积神经网络的方法，旨在使其更好地抵御对抗攻击。我们的搜索策略通过在进化循环中避免对抗训练，从而隔离了架构对鲁棒性的影响。因此，我们的适应度函数鼓励那些即使采用标准（非鲁棒）方法训练，也能在不牺牲干净样本准确率的前提下获得较高攻击后准确率的候选架构。我们在CIFAR-10数据集上评估NERO-Net，特别关注$L_\infty$-鲁棒性。具体而言，进化搜索中出现的最优个体在对抗FGSM攻击（在搜索阶段用作鲁棒性的高效估计器）时达到了33%的准确率，同时保持了87%的干净样本准确率。对该个体进行进一步的标准训练后，这些指标提升至对抗准确率47%和干净准确率93%，表明其具有内在的架构鲁棒性。对抗训练则使该模型在对抗AutoAttack时的整体准确率达到40%。

摘要 (Abstract)

Neuroevolution automates the complex task of neural network design but often ignores the inherent adversarial fragility of evolved models which is a barrier to adoption in safety-critical scenarios. While robust training methods have received significant attention, the design of architectures exhibiting intrinsic robustness remains largely unexplored. In this paper, we propose NERO-Net, a neuroevolutionary approach to design convolutional neural networks better equipped to resist adversarial attacks. Our search strategy isolates architectural influence on robustness by avoiding adversarial training during the evolutionary loop. As such, our fitness function promotes candidates that, even trained with standard (non-robust) methods, achieve high post-attack accuracy without sacrificing the accuracy on clean samples. We assess NERO-Net on CIFAR-10 with a specific focus on $L_\infty$-robustness. In particular, the fittest individual emerged from evolutionary search with 33% accuracy against FGSM, used as an efficient estimator for robustness during the search phase, while maintaining 87% clean accuracy. Further standard training of this individual boosted these metrics to 47% adversarial and 93% clean accuracy, suggesting inherent architectural robustness. Adversarial training brings the overall accuracy of the model up to 40% against AutoAttack.

关键词: Neuroevolution, Adversarial Robustness, Convolutional Neural Networks, Architecture Design, FGSM, AutoAttack, CIFAR-10

50. ❌ Lightweight GenAI for Network Traffic Synthesis: Fidelity, Augmentation, and Classification

作者: Giampaolo Bovenzi, Domenico Ciuonzo, Jonatan Krolikowski, Antonio Montieri, Alfredo Nascita, Antonio Pescapè, Dario Rossi 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25507v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究网络流量生成与分类，使用轻量级生成式AI（包括transformer、状态空间和扩散模型），但未涉及大语言模型（LLMs）或深度学习技术原理的创新，也未应用于科学领域（如生物信息学）。所有关键词均与大语言模型技术、科学AI应用或深度学习原理直接相关，而本文专注于特定领域的生成式AI应用，与给定关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文研究使用轻量级生成式AI模型（包括transformer、状态空间和扩散模型）进行网络流量合成，以解决数据稀缺问题，实验表明这些模型能有效生成高保真流量数据，提升分类性能并降低计算成本。

摘要翻译

精确的网络流量分类（NTC）日益受到有限标注数据和严格隐私要求的制约。虽然网络流量生成（NTG）为缓解数据稀缺提供了有效手段，但传统生成方法难以建模现代流量的复杂时序动态，且/或通常产生高昂的计算成本。本文采用轻量级生成式人工智能（GenAI）架构来解决NTG任务，包括基于Transformer、状态空间和扩散模型的设计，旨在实现实际部署。我们从四个维度进行了系统性评估：（i）（合成）流量保真度，（ii）纯合成数据训练，（iii）低数据条件下的数据增强，以及（iv）计算效率。在两个异构数据集上的实验表明，轻量级GenAI模型能同时保持静态和时序流量特征，其中Transformer和状态空间模型在一整套保真度指标上均与真实分布高度吻合。仅使用合成流量训练的分类器在真实数据上达到了最高87%的F1分数。在低数据场景下，GenAI驱动的增强将NTC性能提升高达+40%，显著缩小了与全数据训练的差距。总体而言，基于Transformer的模型在保真度与效率之间提供了最佳平衡，能够以适度的计算开销实现高质量、隐私保护的流量合成。

摘要 (Abstract)

Accurate Network Traffic Classification (NTC) is increasingly constrained by limited labeled data and strict privacy requirements. While Network Traffic Generation (NTG) provides an effective means to mitigate data scarcity, conventional generative methods struggle to model the complex temporal dynamics of modern traffic or/and often incur significant computational cost. In this article, we address the NTG task using lightweight Generative Artificial Intelligence (GenAI) architectures, including transformer-based, state-space, and diffusion models designed for practical deployment. We conduct a systematic evaluation along four axes: (i) (synthetic) traffic fidelity, (ii) synthetic-only training, (iii) data augmentation under low-data regimes, and (iv) computational efficiency. Experiments on two heterogeneous datasets show that lightweight GenAI models preserve both static and temporal traffic characteristics, with transformer and state-space models closely matching real distributions across a complete set of fidelity metrics. Classifiers trained solely on synthetic traffic achieve up to 87% F1-score on real data. In low-data settings, GenAI-driven augmentation improves NTC performance by up to +40%, substantially reducing the gap with full-data training. Overall, transformer-based models provide the best trade-off between fidelity and efficiency, enabling high-quality, privacy-aware traffic synthesis with modest computational overhead.

关键词: Network Traffic Classification, Network Traffic Generation, Generative AI, Transformer models, State-space models, Data augmentation, Computational efficiency, Privacy-aware synthesis

51. ❌ EcoThink: A Green Adaptive Inference Framework for Sustainable and Accessible Agents

作者: Linxiao Li, Zhixiang Lu 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25498v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM推理效率问题，直接涉及LLMs（10分）、Chain of Thought（10分）和LLM Agents（10分），因为论文聚焦于优化LLM代理的推理过程，特别是减少CoT的过度计算。System 2 Thinking（5分）和Speculative Decoding（5分）有一定关联，因为论文涉及深度推理和推理加速，但非核心。其他关键词如MoE、SLMs、RAG等未在摘要中提及，完全无关（0分）。

!!! tip deepseek-chat TL;DR

论文针对LLM代理推理中的能源浪费问题，提出了EcoThink自适应推理框架，通过动态评估查询复杂度来跳过不必要的深度推理，在保持性能的同时平均减少40.4%的推理能耗。

摘要翻译

随着网络从静态检索向生成式交互转型，大型语言模型日益增长的环境足迹构成了严峻的可持续性挑战。当前范式不加区分地对每日数十亿次查询应用思维链等计算密集型策略，导致模型过度思考——这种冗余不仅加剧碳排放，也抬高了运营门槛。此种低效模式直接阻碍资源受限地区公平获取人工智能服务，从而影响联合国可持续发展目标13（气候行动）与目标10（减少不平等）的实现。为此，我们提出EcoThink：一种能源感知的自适应推理框架，旨在协调高性能AI智能与环境责任。该框架采用基于蒸馏的轻量级路由机制，动态评估查询复杂度，对事实检索类查询跳过冗余推理，同时为复杂逻辑任务保留深度计算。在9个多样化基准测试中的广泛评估表明，EcoThink平均降低40.4%的推理能耗（网络知识检索任务最高可达81.9%），且未出现统计学显著的性能损失。通过减少算法浪费，EcoThink为构建可持续、包容且高能效的生成式AI智能体提供了可扩展路径。

摘要 (Abstract)

As the Web transitions from static retrieval to generative interaction, the escalating environmental footprint of Large Language Models (LLMs) presents a critical sustainability challenge. Current paradigms indiscriminately apply computation-intensive strategies like Chain-of-Thought (CoT) to billions of daily queries, causing LLM overthinking, a redundancy that amplifies carbon emissions and operational barriers. This inefficiency directly undermines UN Sustainable Development Goals 13 (Climate Action) and 10 (Reduced Inequalities) by hindering equitable AI access in resource-constrained regions. To address this, we introduce EcoThink, an energy-aware adaptive inference framework designed to reconcile high-performance AI intelligence with environmental responsibility. EcoThink employs a lightweight, distillation-based router to dynamically assess query complexity, skipping unnecessary reasoning for factoid retrieval while reserving deep computation for complex logic. Extensive evaluations across 9 diverse benchmarks demonstrate that EcoThink reduces inference energy by 40.4% on average (up to 81.9% for web knowledge retrieval) without statistically significant performance loss. By mitigating algorithmic waste, EcoThink offers a scalable path toward a sustainable, inclusive, and energy-efficient generative AI Agent.

关键词: Large Language Models, Chain-of-Thought, LLM Agents, Adaptive Inference, Energy Efficiency, Sustainable AI, Inference Optimization, Green Computing

52. ❌ Retraining as Approximate Bayesian Inference

作者: Harrison Katz 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25480v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究模型重训练的决策理论框架，将其视为计算约束下的近似贝叶斯推断，并引入’学习债务’概念。虽然涉及机器学习模型维护，但未具体讨论大模型、深度学习技术原理或科学应用，也未涉及评分关键词中的任何具体技术（如LLM、MoE、SFT、RAG等）。论文内容更偏向一般机器学习模型维护的决策理论，与提供的关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一个决策理论框架，将模型重训练视为计算约束下的近似贝叶斯推断问题，并基于损失函数推导出可审计的重训练触发机制。

摘要翻译

模型重训练通常被视为一项持续性维护任务。但正如哈里森·卡茨当前所论证的，在计算约束条件下，重训练可被更准确地理解为近似贝叶斯推断。持续更新的信念状态与已冻结的部署模型之间的差距即“学习债务”，而重训练决策则是一个成本最小化问题，其阈值由损失函数推导得出。本文中，卡茨提出了一个用于制定重训练策略的决策理论框架。该框架可生成基于证据的触发机制，以取代传统的时间计划表，并使治理过程具备可审计性。针对不熟悉贝叶斯与决策理论术语的读者，文末附有核心词汇表以供参考。

摘要 (Abstract)

Model retraining is usually treated as an ongoing maintenance task. But as Harrison Katz now argues, retraining can be better understood as approximate Bayesian inference under computational constraints. The gap between a continuously updated belief state and your frozen deployed model is “learning debt,” and the retraining decision is a cost minimization problem with a threshold that falls out of your loss function. In this article Katz provides a decision-theoretic framework for retraining policies. The result is evidence-based triggers that replace calendar schedules and make governance auditable. For readers less familiar with the Bayesian and decision-theoretic language, key terms are defined in a glossary at the end of the article.

关键词: model retraining, approximate Bayesian inference, decision-theoretic framework, learning debt, retraining policies, evidence-based triggers, computational constraints, cost minimization

53. ❌ Interpretable PM2.5 Forecasting for Urban Air Quality: A Comparative Study of Operational Time-Series Models

作者: Moazzam Umer Gondal, Hamad ul Qudous, Asma Ahmad Farhan, Sultan Alamri 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25495v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 该论文研究的是城市空气质量PM2.5的短期预测，使用SARIMAX、Facebook Prophet和NeuralProphet等传统时间序列模型进行比较评估，重点关注预测准确性、计算效率和可解释性。论文内容完全属于传统时间序列分析和环境科学领域，未涉及任何大语言模型、深度学习、AI for Science或其他指定的大模型相关技术。所有关键词均与大模型技术原理、训练方法、推理优化、AI应用等主题相关，而本文研究的是经典统计模型和轻量级预测方法，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究比较了SARIMAX、Facebook Prophet和NeuralProphet三种轻量级可解释时间序列模型在北京PM2.5小时预测中的性能，发现Facebook Prophet在准确性和计算效率方面表现最佳，而在线残差校正能显著提升SARIMAX和Facebook Prophet在冻结模型下的预测精度。

摘要翻译

精准的短期空气质量预报对于公共卫生保护与城市管理至关重要，然而当前许多预报框架依赖于复杂、数据密集且计算需求高的模型。本研究探讨了轻量级且可解释的预报方法能否在中国北京地区的小时级PM2.5预测中提供具有竞争力的性能。基于多年的污染物与气象时间序列数据，我们开发了一种防泄漏预报工作流程，该流程在完美预后（Perfect Prognosis）设定下结合了时序数据划分、预处理、特征选择和外生驱动因子建模。我们评估了三类预报模型族：SARIMAX、Facebook Prophet和NeuralProphet。为评估实际部署表现，模型在两种自适应机制下进行测试：每周向前滚动重拟合（walk-forward refitting）以及采用在线残差校正的冻结模型预报（frozen forecasting）。结果显示，各类模型在预测精度与计算效率上均存在明显差异。在向前滚动重拟合机制下，Facebook Prophet取得了最佳的综合性能，其平均绝对误差（MAE）为$37.61$，均方根误差（RMSE）为$50.10$，同时其所需的执行时间也远少于NeuralProphet。在冻结模型机制下，在线残差校正提升了Facebook Prophet和SARIMAX的预报效果，其中校正后的SARIMAX获得了最低的整体误差（MAE $32.50$；RMSE $46.85$）。NeuralProphet在两种机制下均表现欠佳，稳定性较低，且残差校正未能改善其预报结果。值得注意的是，校正后的Facebook Prophet达到了与其滚动重拟合版本几乎相同的误差水平，同时将运行时间从$15$分$21.91$秒大幅缩短至$46.60$秒。这些结果表明，轻量级的加法型预报策略在城市空气质量预测中仍能保持高度竞争力，在准确性、可解释性与计算效率之间提供了实用的平衡。

摘要 (Abstract)

Accurate short-term air-quality forecasting is essential for public health protection and urban management, yet many recent forecasting frameworks rely on complex, data-intensive, and computationally demanding models. This study investigates whether lightweight and interpretable forecasting approaches can provide competitive performance for hourly PM2.5 prediction in Beijing, China. Using multi-year pollutant and meteorological time-series data, we developed a leakage-aware forecasting workflow that combined chronological data partitioning, preprocessing, feature selection, and exogenous-driver modeling under the Perfect Prognosis setting. Three forecasting families were evaluated: SARIMAX, Facebook Prophet, and NeuralProphet. To assess practical deployment behavior, the models were tested under two adaptive regimes: weekly walk-forward refitting and frozen forecasting with online residual correction. Results showed clear differences in both predictive accuracy and computational efficiency. Under walk-forward refitting, Facebook Prophet achieved the strongest completed performance, with an MAE of $37.61$ and an RMSE of $50.10$, while also requiring substantially less execution time than NeuralProphet. In the frozen-model regime, online residual correction improved Facebook Prophet and SARIMAX, with corrected SARIMAX yielding the lowest overall error (MAE $32.50$; RMSE $46.85$). NeuralProphet remained less accurate and less stable across both regimes, and residual correction did not improve its forecasts. Notably, corrected Facebook Prophet reached nearly the same error as its walk-forward counterpart while reducing runtime from $15$ min $21.91$ sec to $46.60$ sec. These findings show that lightweight additive forecasting strategies can remain highly competitive for urban air-quality prediction, offering a practical balance between accuracy, interpretability, …

关键词: PM2.5 forecasting, time-series models, SARIMAX, Facebook Prophet, NeuralProphet, air quality prediction, interpretable forecasting, computational efficiency

54. ❌ Maximum Entropy Behavior Exploration for Sim2Real Zero-Shot Reinforcement Learning

作者: Jiajun Hu, Nuria Armengol Urpi, Jin Cheng, Stelian Coros 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25464v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是强化学习（RL）算法，特别是零样本强化学习和机器人控制，专注于行为探索、熵最大化和模拟到现实的部署。论文未涉及任何大语言模型（LLM）、深度学习技术原理或AI在科学领域的应用，所有关键词均与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为FB-MEBE的在线零样本强化学习算法，通过最大化行为分布的熵来促进探索，并结合正则化批评器来生成更自然、物理上合理的行为，从而在模拟任务中实现更好的性能，并能够无缝部署到真实机器人硬件上。

摘要翻译

零样本强化学习（Zero-shot RL）算法旨在从无奖励数据集中学习一系列策略，并在测试时直接针对任意奖励函数恢复出最优策略。显然，预训练数据集的质量决定了所恢复策略在不同任务中的性能。然而，在缺乏对下游目标任务先验知识的情况下，预先收集相关且多样化的数据集仍是一个挑战。本研究基于前向-后向（Forward-Backward, FB）算法，针对真实机器人系统中的四足运动控制问题，探索在线（online）零样本强化学习方法。我们发现，无导向的探索会产生低多样性的数据，导致下游性能不佳，且所得策略难以直接部署于实际硬件。为此，我们提出FB-MEBE算法——一种结合无监督行为探索策略与正则化评判器的在线零样本强化学习方法。FB-MEBE通过最大化已实现行为分布的熵来促进探索。此外，正则化评判器引导恢复出的策略趋向更自然、物理上更合理的行为。我们通过实验证明，在一系列模拟下游任务中，FB-MEBE相比其他探索策略取得了更优的性能，并且能生成可直接部署于硬件、无需进一步微调的自然策略。相关视频与代码已发布于项目网站。

摘要 (Abstract)

Zero-shot reinforcement learning (RL) algorithms aim to learn a family of policies from a reward-free dataset, and recover optimal policies for any reward function directly at test time. Naturally, the quality of the pretraining dataset determines the performance of the recovered policies across tasks. However, pre-collecting a relevant, diverse dataset without prior knowledge of the downstream tasks of interest remains a challenge. In this work, we study $\textit{online}$ zero-shot RL for quadrupedal control on real robotic systems, building upon the Forward-Backward (FB) algorithm. We observe that undirected exploration yields low-diversity data, leading to poor downstream performance and rendering policies impractical for direct hardware deployment. Therefore, we introduce FB-MEBE, an online zero-shot RL algorithm that combines an unsupervised behavior exploration strategy with a regularization critic. FB-MEBE promotes exploration by maximizing the entropy of the achieved behavior distribution. Additionally, a regularization critic shapes the recovered policies toward more natural and physically plausible behaviors. We empirically demonstrate that FB-MEBE achieves and improved performance compared to other exploration strategies in a range of simulated downstream tasks, and that it renders natural policies that can be seamlessly deployed to hardware without further finetuning. Videos and code available on our website.

关键词: Zero-shot Reinforcement Learning, Online RL, Behavior Exploration, Maximum Entropy, Sim2Real, Quadrupedal Control, Forward-Backward Algorithm, Hardware Deployment

55. ❌ Temporally Decoupled Diffusion Planning for Autonomous Driving

作者: Xiang Li, Bikun Wang, John Zhang, Jianjun Wang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25462v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于自动驾驶领域的运动规划，提出了一种基于扩散模型的轨迹生成方法（TDDM）。虽然论文涉及深度学习（扩散模型）在特定应用（自动驾驶）中的创新，但所有给定的关键词都明确针对大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理系统等）。论文内容完全不涉及语言模型、文本生成或任何关键词中提到的具体LLM技术。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对自动驾驶中运动规划需要平衡即时安全与长期目标的问题，提出了一种时间解耦扩散模型（TDDM），通过噪声掩码范式和不对称时间分类器自由引导，在nuPlan基准测试中达到或超越了最先进基线。

摘要翻译

在动态城市环境中进行运动规划，需要在即时安全性与长期目标之间取得平衡。尽管扩散模型能有效捕捉多模态决策过程，但现有方法将轨迹视为单一整体，忽略了异质性的时间依赖性——近期规划受瞬时动力学约束，而远期规划则受导航目标主导。为解决这一问题，我们提出时间解耦扩散模型（Temporally Decoupled Diffusion Model, TDDM），通过噪声即掩码范式重构轨迹生成过程。该方法将轨迹划分为具有独立噪声水平的片段，隐式地将高噪声视为信息缺失区域，弱噪声作为上下文线索。这迫使模型通过利用与保留更完整的时间上下文之间的内部关联，来重建受损的近期状态。在架构层面，我们引入了时间解耦自适应层归一化（Temporally Decoupled Adaptive Layer Normalization, TD-AdaLN）以注入片段特定的时间步信息。在推理阶段，我们的非对称时间分类器自由引导机制利用弱噪声化的远期先验信息来指导即时路径生成。在nuPlan基准测试上的评估表明，TDDM达到或超越了当前最优基线模型，尤其在具有挑战性的Test14-hard子集中表现突出。

摘要 (Abstract)

Motion planning in dynamic urban environments requires balancing immediate safety with long-term goals. While diffusion models effectively capture multi-modal decision-making, existing approaches treat trajectories as monolithic entities, overlooking heterogeneous temporal dependencies where near-term plans are constrained by instantaneous dynamics and far-term plans by navigational goals. To address this, we propose Temporally Decoupled Diffusion Model (TDDM), which reformulates trajectory generation via a noise-as-mask paradigm. By partitioning trajectories into segments with independent noise levels, we implicitly treat high noise as information voids and weak noise as contextual cues. This compels the model to reconstruct corrupted near-term states by leveraging internal correlations with better-preserved temporal contexts. Architecturally, we introduce a Temporally Decoupled Adaptive Layer Normalization (TD-AdaLN) to inject segment-specific timesteps. During inference, our Asymmetric Temporal Classifier-Free Guidance utilizes weakly noised far-term priors to guide immediate path generation. Evaluations on the nuPlan benchmark show TDDM approaches or exceeds state-of-the-art baselines, particularly excelling in the challenging Test14-hard subset.

关键词: autonomous driving, motion planning, diffusion models, trajectory generation, temporal decoupling, noise-as-mask, nuPlan benchmark, TDDM

56. ❌ Cross-Model Disagreement as a Label-Free Correctness Signal

作者: Matt Gorbett, Suman Jana 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25450v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究语言模型的无标签正确性检测，直接涉及大模型部署安全（LLMs）、幻觉缓解（Hallucination Mitigation）和自我改进（Self-Correction）等关键词。论文提出跨模型困惑度（CMP）和跨模型熵（CME）方法，属于模型自我评估和错误检测技术，与Chain of Thought推理和可解释AI有一定关联。其他关键词如MoE、量化、RAG等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文研究如何在没有真实标签的情况下检测语言模型的错误，提出了一种基于跨模型分歧的训练免费方法（CMP和CME），在多个基准测试中优于模型内部不确定性基线，可用于部署监控和模型路由。

摘要翻译

在缺乏真实标签的情况下检测语言模型何时出错，是其安全部署面临的根本性挑战。现有方法依赖于模型自身的不确定性——例如词元熵或置信度分数——但这些信号在最危险的失效模式上存在严重缺陷：即自信错误，即模型出错却表现得高度确信。本研究引入跨模型分歧作为正确性指标——这是一种简单、无需训练的信号，可直接嵌入现有生产系统、路由流水线和部署监控基础设施中，无需任何修改。给定一个模型生成的答案，跨模型分歧通过单次前向传播，计算第二个验证模型在读取该答案时的“惊讶”或不确定程度。该方法无需验证模型生成新内容，也无需任何正确性标签。我们将这一原理具体化为跨模型困惑度（Cross-Model Perplexity, CMP）和跨模型熵（Cross-Model Entropy, CME）：CMP衡量验证模型对生成模型所输出答案词元的惊讶程度，CME则衡量验证模型在这些位置上的不确定性。在涵盖推理、检索和数学问题求解（MMLU、TriviaQA和GSM8K）的多个基准测试中，CMP和CME均优于基于模型内部不确定性的基线方法。在MMLU上，CMP实现了0.75的平均AUROC，而模型内部熵基线的结果仅为0.59。这些结果表明，跨模型分歧是一种实用、无需训练的无标签正确性估计方法，可直接应用于部署监控、模型路由、选择性预测、数据过滤以及生产级语言模型系统的可扩展监督中。

摘要 (Abstract)

Detecting when a language model is wrong without ground truth labels is a fundamental challenge for safe deployment. Existing approaches rely on a model’s own uncertainty – such as token entropy or confidence scores – but these signals fail critically on the most dangerous failure mode: confident errors, where a model is wrong but certain. In this work we introduce cross-model disagreement as a correctness indicator – a simple, training-free signal that can be dropped into existing production systems, routing pipelines, and deployment monitoring infrastructure without modification. Given a model’s generated answer, cross-model disagreement computes how surprised or uncertain a second verifier model is when reading that answer via a single forward pass. No generation from the verifying model is required, and no correctness labels are needed. We instantiate this principle as Cross-Model Perplexity (CMP), which measures the verifying model’s surprise at the generating model’s answer tokens, and Cross-Model Entropy (CME), which measures the verifying model’s uncertainty at those positions. Both CMP and CME outperform within-model uncertainty baselines across benchmarks spanning reasoning, retrieval, and mathematical problem solving (MMLU, TriviaQA, and GSM8K). On MMLU, CMP achieves a mean AUROC of 0.75 against a within-model entropy baseline of 0.59. These results establish cross-model disagreement as a practical, training-free approach to label-free correctness estimation, with direct applications in deployment monitoring, model routing, selective prediction, data filtering, and scalable oversight of production language model systems.

关键词: cross-model disagreement, label-free correctness detection, language model errors, confident errors, deployment monitoring, model routing, perplexity, entropy

57. ❌ From Manipulation to Mistrust: Explaining Diverse Micro-Video Misinformation for Robust Debunking in the Wild

作者: Zhi Zeng, Yifei Yang, Jiaying Wu, Xulang Zhang, Xiangzheng Kong, Herun Wan, Zihan Ma, Minnan Luo 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25423v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	8.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究微视频虚假信息检测，提出了WildFakeBench基准和FakeAgent多智能体推理框架。与关键词的相关性分析如下：1）与LLM技术原理相关的关键词（如MoE、Scaling Laws、训练方法等）完全无关，论文未涉及大模型底层技术；2）与RAG、CoT、System 2 Thinking有中等关联（5分），因为FakeAgent涉及外部证据检索和多步推理；3）与LLM Agents、Multi-agent Systems、Hallucination Mitigation、Explainable AI高度相关（8分），因为论文核心是多智能体框架用于虚假信息检测和可解释性分析；4）其他关键词如AI for Science等完全不相关。

!!! tip deepseek-chat TL;DR

该论文针对微视频虚假信息检测缺乏细粒度归因和多样性覆盖的问题，提出了包含10,000+真实案例的WildFakeBench基准和基于多智能体推理的FakeAgent框架，实验表明该框架在所有虚假信息类型上均优于现有MLLM方法。

摘要翻译

微视频的兴起重塑了虚假信息的传播方式，显著提升了其传播速度、覆盖范围及对公众信任的影响。现有基准测试通常聚焦单一欺骗类型，忽视了现实案例中涉及多模态操纵、AI生成内容、认知偏差和脱离语境重复使用等多样化特征。同时，大多数检测模型缺乏细粒度归因分析能力，限制了可解释性与实际应用价值。为弥补这些不足，我们提出了WildFakeBench——一个包含超过1万个现实世界微视频的大规模基准数据集，涵盖多样化的虚假信息类型与来源，每个样本均标注有专家定义的归因标签。在此基础上，我们开发了FakeAgent，这是一个受德尔菲法启发的多智能体推理框架，通过整合多模态理解与外部证据，实现基于归因的深度分析。FakeAgent协同分析内容与检索证据，以识别操纵痕迹、认知偏差与AI生成模式，并检测脱离语境的虚假信息。大量实验表明，FakeAgent在所有虚假信息类型检测中均持续优于现有多模态大语言模型（MLLMs），而WildFakeBench则为推进可解释的微视频虚假信息检测提供了真实且具有挑战性的测试平台。数据与代码公开于：https://github.com/Aiyistan/FakeAgent。

摘要 (Abstract)

The rise of micro-videos has reshaped how misinformation spreads, amplifying its speed, reach, and impact on public trust. Existing benchmarks typically focus on a single deception type, overlooking the diversity of real-world cases that involve multimodal manipulation, AI-generated content, cognitive bias, and out-of-context reuse. Meanwhile, most detection models lack fine-grained attribution, limiting interpretability and practical utility. To address these gaps, we introduce WildFakeBench, a large-scale benchmark of over 10,000 real-world micro-videos covering diverse misinformation types and sources, each annotated with expert-defined attribution labels. Building on this foundation, we develop FakeAgent, a Delphi-inspired multi-agent reasoning framework that integrates multimodal understanding with external evidence for attribution-grounded analysis. FakeAgent jointly analyzes content and retrieved evidence to identify manipulation, recognize cognitive and AI-generated patterns, and detect out-of-context misinformation. Extensive experiments show that FakeAgent consistently outperforms existing MLLMs across all misinformation types, while WildFakeBench provides a realistic and challenging testbed for advancing explainable micro-video misinformation detection. Data and code are available at: https://github.com/Aiyistan/FakeAgent.

关键词: micro-video misinformation, multimodal manipulation, multi-agent reasoning, attribution analysis, AI-generated content, explainable detection, WildFakeBench, FakeAgent

作者: Roman Kueble, Marco Hueller, Mrunmai Phatak, Rainer Lienhart, Joerg Haehner 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25415v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	5.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究强化学习导航用于具身语义场景图生成，属于机器人/具身AI领域，与大多数关键词（主要关于大模型技术原理、训练方法、推理优化等）完全无关。仅与’World Models AND General World Models’有一定关联（5分），因为论文提到语义世界模型（semantic world models）用于具身智能体，但并非通用世界模型。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文研究如何通过改进强化学习策略优化和动作表示来提升具身智能体在有限动作预算下构建语义场景图的完整性和效率，结果显示现代优化算法和细粒度因子化动作表示能显著改善性能。

摘要翻译

语义世界模型使具身智能体能够超越纯几何表征，对物体、关系及空间语境进行推理。在有机计算领域，此类模型是在不确定性与资源约束下实现目标驱动自适应能力的关键赋能技术。其核心挑战在于如何在有限行动预算内，获取能最大化模型质量与下游应用价值的观测信息。语义场景图为实现这一目标提供了结构化且紧凑的表征形式。然而，在有限行动步长内构建语义场景图需要探索策略，以权衡信息增益与导航成本，并判断何时继续行动将产生收益递减效应。本研究提出了一种用于具身语义场景图生成的模块化导航组件，通过替换策略优化方法并重构离散行动表述，实现了决策机制的现代化。我们研究了紧凑型与更细粒度的大型离散动作集合，并比较了基于原子动作的单头策略与基于动作组件的分解式多头策略。我们评估了课程学习与可选的基于深度的碰撞监督方法，并对语义场景图的完整性、执行安全性及导航行为进行了量化分析。实验结果表明：在相同奖励塑形条件下，仅替换优化算法即可使语义场景图完整性相对基线提升21%。深度信息主要影响执行安全性（无碰撞运动），而对完整性影响甚微。将现代优化算法与细粒度分解式动作表征相结合，能够实现最优的整体完整性-效率权衡。

摘要 (Abstract)

Semantic world models enable embodied agents to reason about objects, relations, and spatial context beyond purely geometric representations. In Organic Computing, such models are a key enabler for objective-driven self-adaptation under uncertainty and resource constraints. The core challenge is to acquire observations maximising model quality and downstream usefulness within a limited action budget. Semantic scene graphs (SSGs) provide a structured and compact representation for this purpose. However, constructing them within a finite action horizon requires exploration strategies that trade off information gain against navigation cost and decide when additional actions yield diminishing returns. This work presents a modular navigation component for Embodied Semantic Scene Graph Generation and modernises its decision-making by replacing the policy-optimisation method and revisiting the discrete action formulation. We study compact and finer-grained, larger discrete motion sets and compare a single-head policy over atomic actions with a factorised multi-head policy over action components. We evaluate curriculum learning and optional depth-based collision supervision, and assess SSG completeness, execution safety, and navigation behaviour. Results show that replacing the optimisation algorithm alone improves SSG completeness by 21% relative to the baseline under identical reward shaping. Depth mainly affects execution safety (collision-free motion), while completeness remains largely unchanged. Combining modern optimisation with a finer-grained, factorised action representation yields the strongest overall completeness–efficiency trade-off.

关键词: Embodied AI, Semantic Scene Graph Generation, Reinforcement Learning Navigation, Policy Optimization, Action Representation, Curriculum Learning, Collision Avoidance, Information Gain

59. ❌ Decidable By Construction: Design-Time Verification for Trustworthy AI

作者: Houston Haynes 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25414v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于AI模型的设计时验证框架，涉及类型系统、代数结构、计算正确性和可靠性，但完全不涉及大模型、深度学习技术原理或科学应用。所有关键词均与大模型技术、训练方法、推理优化、对齐、应用领域等相关，而本文是形式化验证和理论计算机科学方向，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种在AI模型设计阶段（而非训练后）验证数值稳定性、计算正确性和物理一致性的框架，通过基于有限生成阿贝尔群的可判定约束、类型系统和程序超图来实现，从而消除传统可靠性方法带来的计算开销。

摘要翻译

机器学习领域一个普遍假设是：模型的正确性必须在事后强制保证。我们观察到，决定AI模型是否数值稳定、计算正确或符合物理领域约束的特性，并不必然需要事后验证。这些特性可以在设计阶段、训练开始前以边际计算成本进行验证，这对于部署于高影响力决策支持和科学约束环境中的模型尤为重要。这些特性共享一种特定的代数结构：它们可表达为有限生成阿贝尔群 $\mathbb{Z}^n$ 上的约束，其中推理在多项式时间内可判定且主类型唯一。基于这一观察构建的框架融合了三项已有研究成果（arXiv:2603.16437, arXiv:2603.17627, arXiv:2603.18104）：一个在模型细化过程中携带任意注解作为持久化共数据的维度类型系统；一个仅从类型签名即可推断克利福德代数阶数并推导几何乘积稀疏性的程序超图；以及一种通过前向模态共效应分析和精确posit累加在训练过程中保持上述不变量的自适应领域模型架构。我们认为这种融合产生了一个新颖的信息论结果：阿贝尔群上的Hindley-Milner统一化在Solomonoff通用先验的可计算限制下计算了最大后验假设，使得该框架的类型推断与通用归纳建立在相同的形式化基础上。我们比较了四种当代AI可靠性方法，并证明每种方法都会产生可能跨部署、层级和推理请求累积的系统开销。该框架通过构造消除了这种开销。

摘要 (Abstract)

A prevailing assumption in machine learning is that model correctness must be enforced after the fact. We observe that the properties determining whether an AI model is numerically stable, computationally correct, or consistent with a physical domain do not necessarily demand post hoc enforcement. They can be verified at design time, before training begins, at marginal computational cost, with particular relevance to models deployed in high-leverage decision support and scientifically constrained settings. These properties share a specific algebraic structure: they are expressible as constraints over finitely generated abelian groups $\mathbb{Z}^n$, where inference is decidable in polynomial time and the principal type is unique. A framework built on this observation composes three prior results (arXiv:2603.16437, arXiv:2603.17627, arXiv:2603.18104): a dimensional type system carrying arbitrary annotations as persistent codata through model elaboration; a program hypergraph that infers Clifford algebra grade and derives geometric product sparsity from type signatures alone; and an adaptive domain model architecture preserving both invariants through training via forward-mode coeffect analysis and exact posit accumulation. We believe this composition yields a novel information-theoretic result: Hindley-Milner unification over abelian groups computes the maximum a posteriori hypothesis under a computable restriction of Solomonoff’s universal prior, placing the framework’s type inference on the same formal ground as universal induction. We compare four contemporary approaches to AI reliability and show that each imposes overhead that can compound across deployments, layers, and inference requests. This framework eliminates that overhead by construction.

关键词: design-time verification, trustworthy AI, numerical stability, computational correctness, abelian groups, type system, program hypergraph, reliability framework

60. ❌ Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

作者: Xunguang Wang, Yuguang Zhou, Qingyue Wang, Zongjie Li, Ruixuan Huang, Zhenlan Ji, Pingchuan Ma, Shuai Wang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25412v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的推理安全，与’Large Language Models’和’Chain of Thought Reasoning’高度相关（10分），直接涉及推理过程。与’System 2 Thinking’、‘Self-Correction’、‘Hallucination Mitigation’、‘Mechanistic Interpretability’有一定关联（5分），因论文讨论深度推理、错误检测/修正、事实性、可解释性方面。其他关键词未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了大语言模型推理安全的新概念，定义了九类不安全推理行为，并开发了一个实时监控系统来检测和分类推理错误，在基准测试中达到了84.88%的步骤定位准确率和85.37%的错误类型分类准确率。

摘要翻译

大型语言模型（LLM）日益依赖显式的思维链（CoT）推理来解决复杂任务，然而推理过程本身的安全性在很大程度上仍未得到关注。现有关于LLM安全性的研究主要集中于内容安全——检测有害、有偏见或事实错误的输出——并将推理链视为不透明的中间产物。我们提出推理安全性作为一个正交且同等关键的安全维度：即要求模型的推理轨迹在逻辑上一致、计算高效并能抵抗对抗性操纵。我们做出三项贡献。首先，我们正式定义了推理安全性，并提出了一个包含九类不安全推理行为的分类体系，涵盖输入解析错误、推理执行错误和过程管理错误。其次，我们开展了一项大规模普遍性研究，对来自自然推理基准和四种对抗攻击方法（推理劫持和拒绝服务攻击）的4111条推理链进行了标注，确认所有九类错误在实践中均会出现，且每种攻击都会产生一种机制上可解释的特征模式。第三，我们提出了一种推理安全监控器：这是一个基于外部LLM的组件，与目标模型并行运行，通过嵌入分类体系的提示词实时检查每个推理步骤，并在检测到不安全行为时发出中断信号。在一个包含450条推理链的静态基准上的评估表明，我们的监控器实现了高达84.88%的步骤级定位准确率和85.37%的错误类型分类准确率，显著优于幻觉检测器和过程奖励模型基线方法。这些结果表明，推理层面的监控既是必要的，也是实际可行的，并将推理安全性确立为大型推理模型安全部署的一个基础性问题。

摘要 (Abstract)

Large language models (LLMs) increasingly rely on explicit chain-of-thought (CoT) reasoning to solve complex tasks, yet the safety of the reasoning process itself remains largely unaddressed. Existing work on LLM safety focuses on content safety–detecting harmful, biased, or factually incorrect outputs – and treats the reasoning chain as an opaque intermediate artifact. We identify reasoning safety as an orthogonal and equally critical security dimension: the requirement that a model’s reasoning trajectory be logically consistent, computationally efficient, and resistant to adversarial manipulation. We make three contributions. First, we formally define reasoning safety and introduce a nine-category taxonomy of unsafe reasoning behaviors, covering input parsing errors, reasoning execution errors, and process management errors. Second, we conduct a large-scale prevalence study annotating 4111 reasoning chains from both natural reasoning benchmarks and four adversarial attack methods (reasoning hijacking and denial-of-service), confirming that all nine error types occur in practice and that each attack induces a mechanistically interpretable signature. Third, we propose a Reasoning Safety Monitor: an external LLM-based component that runs in parallel with the target model, inspects each reasoning step in real time via a taxonomy-embedded prompt, and dispatches an interrupt signal upon detecting unsafe behavior. Evaluation on a 450-chain static benchmark shows that our monitor achieves up to 84.88% step-level localization accuracy and 85.37% error-type classification accuracy, outperforming hallucination detectors and process reward model baselines by substantial margins. These results demonstrate that reasoning-level monitoring is both necessary and practically achievable, and establish reasoning safety as a foundational concern for the secure deployment of large reasoning models.

关键词: Large Language Models, Chain-of-Thought Reasoning, Reasoning Safety, Adversarial Attacks, Real-time Monitoring, Mechanistic Interpretability, Hallucination Detection, Security Deployment

61. ❌ System Design for Maintaining Internal State Consistency in Long-Horizon Robotic Tabletop Games

作者: Guangyu Zhao, Ceyao Zhang, Chengdong Ma, Tao Wu, Yiyang Song, Haoxuan Ru, Yifan Zhong, Ruilin Yan, Lingfeng Li, Ruochong Li, Yu Li, Xuyuan Han, Yun Ding, Ruizhang Jiang, Xiaochuan Zhang, Yichao Li, Yuanpei Chen, Yaodong Yang, Yitao Liang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25405v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于机器人系统设计，研究在长时程桌面游戏中维持内部状态一致性的问题，涉及感知、执行、交互状态管理、系统架构和恢复机制。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文未提及任何大模型、深度学习或AI for Science相关内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在长时程机器人桌面游戏中如何通过系统设计维持内部状态一致性的问题，并提出了一种集成架构，通过状态分区、监控和恢复机制显著提升了端到端可靠性。

摘要翻译

长时程桌面游戏对机器人系统提出了独特的挑战：微小的感知或执行误差可能导致累积任务状态失效，误差在决策模块间传播，并最终破坏交互过程。本文研究如何通过系统性设计（而非孤立组件改进）在回合制多人机器人桌面游戏中维持内部状态一致性。以麻将作为代表性长时程场景，我们提出一种集成架构：该架构显式维护感知、执行与交互状态，将高层语义推理与时间敏感的感知控制相分离，并整合了经验证的动作基元与触觉触发恢复机制，以防止状态过早损坏。我们进一步引入交互层监控机制，用于检测违反回合规则及破坏执行假设的隐藏信息泄露行为。除展示完整游戏运行外，我们通过实证分析了部署过程中观察到的故障模式、恢复效能、跨模块错误传播以及硬件-算法权衡。结果表明：显式模块划分、受监控的状态转换及恢复机制对于维持长时程游戏中的可执行一致性至关重要，而单体化或未经验证的流水线会导致端到端可靠性的显著下降。本系统为研究长时程回合制交互中的系统级设计原则提供了实证平台。

摘要 (Abstract)

Long-horizon tabletop games pose a distinct systems challenge for robotics: small perceptual or execution errors can invalidate accumulated task state, propagate across decision-making modules, and ultimately derail interaction. This paper studies how to maintain internal state consistency in turn-based, multi-human robotic tabletop games through deliberate system design rather than isolated component improvement. Using Mahjong as a representative long-horizon setting, we present an integrated architecture that explicitly maintains perceptual, execution, and interaction state, partitions high-level semantic reasoning from time-critical perception and control, and incorporates verified action primitives with tactile-triggered recovery to prevent premature state corruption. We further introduce interaction-level monitoring mechanisms to detect turn violations and hidden-information breaches that threaten execution assumptions. Beyond demonstrating complete-game operation, we provide an empirical characterization of failure modes, recovery effectiveness, cross-module error propagation, and hardware-algorithm trade-offs observed during deployment. Our results show that explicit partitioning, monitored state transitions, and recovery mechanisms are critical for sustaining executable consistency over extended play, whereas monolithic or unverified pipelines lead to measurable degradation in end-to-end reliability. The proposed system serves as an empirical platform for studying system-level design principles in long-horizon, turn-based interaction.

关键词: robotic tabletop games, internal state consistency, system design, long-horizon interaction, perceptual execution interaction state, monitored state transitions, recovery mechanisms, end-to-end reliability

62. ❌ Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models

作者: Eyal Hadad, Mordechai Guri 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25403v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究本地视觉语言模型（VLMs）的安全漏洞，与"Small Language Models OR SLMs OR On-device AI"高度相关（10分），因为论文明确关注on-device VLMs。与"Large Language Models OR LLMs OR Foundation Models"有一定关联（5分），因为VLMs属于大模型范畴。与"AI for Science OR Bioinformatics OR Cheminformatics"有一定关联（5分），因为论文提到医疗X射线等科学应用场景。其他关键词（如MoE、Scaling Laws、训练方法、推理优化、代理系统等）与论文的侧信道攻击安全研究主题完全无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文揭示了本地视觉语言模型中动态高分辨率预处理引入的算法侧信道漏洞，通过双层级攻击框架（利用执行时间变化和缓存争用）能够推断输入图像的几何形状和语义内容，从而威胁隐私敏感数据的安全。

摘要翻译

设备端视觉语言模型（VLMs）通过本地执行承诺保障数据隐私。然而，我们发现，向动态高分辨率预处理（例如AnyRes）的架构转变引入了一种固有的算法侧信道。与静态模型不同，动态预处理会根据图像的长宽比将其分解为数量可变的图像块，从而创建与工作负载相关的输入。我们提出了一种针对本地VLMs的双层攻击框架。在第一层中，未授权攻击者可以利用标准无权限操作系统指标，通过显著的执行时间差异可靠地推断输入图像的几何尺寸。在第二层中，攻击者通过剖析末级缓存（LLC）争用情况，能够解析相同几何尺寸下的语义模糊性，从而区分视觉密集内容（如医学X光片）与稀疏内容（如文本文档）。通过评估LLaVA-NeXT和Qwen2-VL等先进模型，我们证明结合这些信号能够可靠地推断出涉及隐私的敏感上下文。最后，我们分析了缓解此漏洞的安全工程权衡，揭示了采用恒定工作量填充方案所带来的显著性能开销，并为安全的边缘人工智能部署提出了实用的设计建议。

摘要 (Abstract)

On-device Vision-Language Models (VLMs) promise data privacy via local execution. However, we show that the architectural shift toward Dynamic High-Resolution preprocessing (e.g., AnyRes) introduces an inherent algorithmic side-channel. Unlike static models, dynamic preprocessing decomposes images into a variable number of patches based on their aspect ratio, creating workload-dependent inputs. We demonstrate a dual-layer attack framework against local VLMs. In Tier 1, an unprivileged attacker can exploit significant execution-time variations using standard unprivileged OS metrics to reliably fingerprint the input’s geometry. In Tier 2, by profiling Last-Level Cache (LLC) contention, the attacker can resolve semantic ambiguity within identical geometries, distinguishing between visually dense (e.g., medical X-rays) and sparse (e.g., text documents) content. By evaluating state-of-the-art models such as LLaVA-NeXT and Qwen2-VL, we show that combining these signals enables reliable inference of privacy-sensitive contexts. Finally, we analyze the security engineering trade-offs of mitigating this vulnerability, reveal substantial performance overhead with constant-work padding, and propose practical design recommendations for secure Edge AI deployments.

关键词: Vision-Language Models, On-device AI, Side-channel Attacks, Dynamic High-Resolution Preprocessing, Local Execution, Privacy Security, Edge AI, Cache Contention

63. ❌ A Causal Framework for Evaluating ICU Discharge Strategies

作者: Sagar Nagaraj Simha, Juliette Ortholand, Dave Dongelmans, Jessica D. Workum, Olivier W. M. Thijssens, Ameen Abu-Hanna, Giovanni Cinà 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25397v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医疗决策中的因果推断方法，使用g-formula框架和MIMIC-IV数据集评估ICU出院策略。论文内容与大多数关键词（涉及大模型技术、训练方法、推理优化等）完全无关，仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为其应用AI/统计方法解决医疗科学问题，但未涉及生物信息学或化学信息学的具体技术。

!!! tip deepseek-chat TL;DR

该论文提出一个基于因果推断的框架，使用g-formula和MIMIC-IV数据集来评估ICU患者出院策略，旨在优化干预时长和医疗结果。

摘要翻译

在本应用研究中，我们探讨了重症监护病房患者出院时机选择这一具有挑战性的开放性问题。该问题可被构建为一个最优停止问题，并面临三重额外挑战：1）基于观测数据评估停止策略本身即是一个复杂的因果推断问题；2）复合目标需同时最小化干预时长与最大化治疗结果，但二者无法简化为单一维度；3）干预终止时变量记录即同步停止。我们的贡献包含两个方面。首先，我们拓展了g-formula Python工具包的实现框架，为具有上述结构的问题（包括正性检验与覆盖性检验）提供了停止策略评估体系。其次，通过全开源分析流程，我们将该方法应用于公共ICU数据集MIMIC-IV，证明了改进现有临床策略的潜在可能性。

摘要 (Abstract)

In this applied paper, we address the difficult open problem of when to discharge patients from the Intensive Care Unit. This can be conceived as an optimal stopping scenario with three added challenges: 1) the evaluation of a stopping strategy from observational data is itself a complex causal inference problem, 2) the composite objective is to minimize the length of intervention and maximize the outcome, but the two cannot be collapsed to a single dimension, and 3) the recording of variables stops when the intervention is discontinued. Our contributions are two-fold. First, we generalize the implementation of the g-formula Python package, providing a framework to evaluate stopping strategies for problems with the aforementioned structure, including positivity and coverage checks. Second, with a fully open-source pipeline, we apply this approach to MIMIC-IV, a public ICU dataset, demonstrating the potential for strategies that improve upon current care.

关键词: ICU discharge, causal inference, optimal stopping, g-formula, MIMIC-IV, observational data, healthcare decision-making, evaluation framework

64. ❌ GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLMs

作者: Selim An, Il hong Suh, Yeseong Kim 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25385v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文GlowQ专注于量化大语言模型（LLMs）的技术创新，与’Large Language Models’和’Quantization’高度相关（10分），因为其核心是解决量化LLMs的精度损失问题。与’Speculative Decoding OR Inference Acceleration’有一定关联（5分），因为论文提到减少延迟（TTFB）和提高吞吐量，属于推理加速范畴。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、Alignment等均未在论文中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

论文提出GlowQ，一种用于量化大语言模型的组共享低秩近似方法，通过缓存共享因子和选择性恢复层，在减少延迟和提高吞吐量的同时保持了模型精度。

摘要翻译

比特量化技术（如BitsAndBytes、AWQ和GPTQ）作为部署大语言模型的标准方法被广泛使用，但在使用低比特表示（如4比特）时往往会导致精度下降。低秩校正方法（例如LQER、QERA、ASER）已被提出以缓解此问题，然而这些方法会恢复所有层并在每个解码器块中插入误差校正模块，从而增加延迟和内存开销。为应对这一局限，我们提出GlowQ——一种面向量化大语言模型的组共享低秩近似方法，该方法在每个输入共享组中缓存一个共享右因子，并仅恢复那些能带来最高精度收益的组或层。GlowQ为每个输入共享组计算一次高精度投影，并在其所有模块中重复使用，从而减少参数和内存开销，同时保留逐层校正的表达能力。我们还提出一种选择性变体GlowQ-S，其仅在能带来最大收益的位置应用缓存的共享模块。与强基线方法相比，我们的方法平均降低首次字节时间（TTFB）5.6%、提升吞吐量9.6%，同时在WikiText-2数据集上降低困惑度（perplexity）0.17%并提升下游任务准确率0.42个百分点。选择性模型GlowQ-S进一步降低了延迟，将首次字节时间减少23.4%、吞吐量提升37.4%，同时将平均准确率损失控制在0.2个百分点以内。

摘要 (Abstract)

Quantization techniques such as BitsAndBytes, AWQ, and GPTQ are widely used as a standard method in deploying large language models but often degrades accuracy when using low-bit representations, e.g., 4 bits. Low-rank correction methods (e.g., LQER, QERA, ASER) has been proposed to mitigate this issue, however, they restore all layers and insert error-correction modules into every decoder block, which increases latency and memory overhead. To address this limitation, we propose GlowQ, a group-shared low-rank approximation for quantized LLMs that caches a single shared right factor per input-sharing group and restores only the groups or layers that yield the highest accuracy benefit. GlowQ computes the high-precision projection once per input-sharing group and reuses it across its modules, reducing parameter and memory overhead, and retaining the expressivity of layer-specific corrections. We also propose a selective variant, GlowQ-S, that applies the cached shared module only where it provides the largest benefit. Compared with strong baselines, our approach reduces TTFB by (5.6%) and increases throughput by (9.6%) on average, while reducing perplexity on WikiText-2 by (0.17%) and increasing downstream accuracy by 0.42 percentage points. The selective model GlowQ-S further reduces latency, cutting TTFB by (23.4%) and increasing throughput by (37.4%), while maintaining accuracy within 0.2 percentage points on average.

关键词: Quantization, Large Language Models, Low-rank Approximation, Model Compression, Inference Acceleration, Parameter Efficiency, Accuracy Restoration, Group-Shared Modules

65. ❌ Does Structured Intent Representation Generalize? A Cross-Language, Cross-Model Empirical Study of 5W3H Prompting

作者: Peng Gang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25379v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究5W3H结构化提示框架在跨语言、跨模型环境下的泛化能力，核心涉及LLMs的提示工程和意图对齐，与’Large Language Models’高度相关（10分），与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（8分），因为研究结构化提示如何改善意图对齐；其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理技术、压缩、科学AI应用等均未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了5W3H结构化提示框架在跨语言和跨模型环境中的泛化能力，发现AI辅助扩展的提示在意图对齐上与手动构建的提示无显著差异，并能降低跨模型输出方差，从而提高非专家用户的可访问性。

摘要翻译

结构化意图表征能否跨语言与模型泛化？本研究以人机交互中基于5W3H框架的结构化意图表征方法——提示协议规范（PPS，Prompt Protocol Specification）为对象，在先前仅针对中文证据的基础上沿三个维度进行拓展：新增两种语言（英语与日语）、引入第四种实验条件（用户通过AI辅助创作界面将简单提示自动扩展为完整5W3H规范），并提出了关于跨模型输出一致性的新研究问题。通过对2,160组模型输出（3种语言×4种条件×3个大语言模型×60项任务）的分析，我们发现：在所有三种语言中，AI扩展生成的5W3H提示（条件D）与人工构建的5W3H提示（条件C）在目标对齐度上无统计学显著差异，而前者仅需用户输入单句提示。结构化PPS条件常能降低或重塑跨模型输出方差，但该效果在不同语言与度量指标间并不均匀；最强证据来自对无约束基线中虚假低方差的识别。研究还表明，非结构化提示存在系统性双重膨胀偏差：复合分数被人为抬高，而跨模型方差表象被人为压低。这些发现证明，结构化5W3H表征能提升跨语言与跨模型的意图对齐度和可及性，尤其在AI辅助创作降低非专业用户使用门槛时效果更为显著。

摘要 (Abstract)

Does structured intent representation generalize across languages and models? We study PPS (Prompt Protocol Specification), a 5W3H-based framework for structured intent representation in human-AI interaction, and extend prior Chinese-only evidence along three dimensions: two additional languages (English and Japanese), a fourth condition in which a user’s simple prompt is automatically expanded into a full 5W3H specification by an AI-assisted authoring interface, and a new research question on cross-model output consistency. Across 2,160 model outputs (3 languages x 4 conditions x 3 LLMs x 60 tasks), we find that AI-expanded 5W3H prompts (Condition D) show no statistically significant difference in goal alignment from manually crafted 5W3H prompts (Condition C) across all three languages, while requiring only a single-sentence input from the user. Structured PPS conditions often reduce or reshape cross-model output variance, though this effect is not uniform across languages and metrics; the strongest evidence comes from identifying spurious low variance in unconstrained baselines. We also show that unstructured prompts exhibit a systematic dual-inflation bias: artificially high composite scores and artificially low apparent cross-model variance. These findings suggest that structured 5W3H representations can improve intent alignment and accessibility across languages and models, especially when AI-assisted authoring lowers the barrier for non-expert users.

关键词: structured intent representation, 5W3H prompting, cross-language generalization, cross-model consistency, AI-assisted authoring, intent alignment, output variance, human-AI interaction

66. ❌ Integrating Deep RL and Bayesian Inference for ObjectNav in Mobile Robotics

作者: João Castelo-Branco, José Santos-Victor, Alexandre Bernardino 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25366v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究移动机器人室内物体搜索问题，采用贝叶斯推理与深度强化学习结合的混合方法。虽然涉及深度学习（深度强化学习），但所有关键词均专注于大语言模型（LLM）相关技术、训练方法、优化技术或特定应用领域（如生物信息学）。论文未提及任何语言模型、大模型技术原理、LLM应用或AI for Science的具体内容，与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合贝叶斯推理和深度强化学习的混合框架，用于解决移动机器人在部分可观测室内环境中的物体搜索问题，实验表明该方法提高了搜索成功率并减少了搜索成本。

摘要翻译

自主目标搜索对于在室内环境中运行的移动机器人具有挑战性，这源于部分可观测性、感知不确定性以及需要在探索与导航效率之间进行权衡。经典概率方法显式地表示不确定性，但通常依赖于手工设计的动作选择启发式策略；而深度强化学习能够实现自适应策略，却常面临收敛速度慢和可解释性有限的问题。本文提出一种混合目标搜索框架，将贝叶斯推理与深度强化学习相结合。该方法通过校准的目标检测在线进行贝叶斯推理更新，维持一个关于目标位置的空间置信度地图，并训练一个强化学习策略直接从该概率表征中选择导航动作。研究使用Habitat 3.0在真实室内仿真环境中对该方法进行评估，并与已开发的基线策略进行比较。在两个室内场景中的实验表明，所提方法在降低搜索成本的同时提高了成功率。总体而言，结果验证了将贝叶斯置信度估计与学习型动作选择相结合的价值，能够在部分可观测条件下实现更高效、更可靠的目标搜索行为。

摘要 (Abstract)

Autonomous object search is challenging for mobile robots operating in indoor environments due to partial observability, perceptual uncertainty, and the need to trade off exploration and navigation efficiency. Classical probabilistic approaches explicitly represent uncertainty but typically rely on handcrafted action-selection heuristics, while deep reinforcement learning enables adaptive policies but often suffers from slow convergence and limited interpretability. This paper proposes a hybrid object-search framework that integrates Bayesian inference with deep reinforcement learning. The method maintains a spatial belief map over target locations, updated online through Bayesian inference from calibrated object detections, and trains a reinforcement learning policy to select navigation actions directly from this probabilistic representation. The approach is evaluated in realistic indoor simulation using Habitat 3.0 and compared against developed baseline strategies. Across two indoor environments, the proposed method improves success rate while reducing search effort. Overall, the results support the value of combining Bayesian belief estimation with learned action selection to achieve more efficient and reliable objectsearch behavior under partial observability.

关键词: ObjectNav, Mobile Robotics, Deep Reinforcement Learning, Bayesian Inference, Partial Observability, Spatial Belief Map, Habitat 3.0, Indoor Navigation

67. ❌ 4OPS: Structural Difficulty Modeling in Integer Arithmetic Puzzles

作者: Yunus E. Zeytuncu 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25356v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究整数算术谜题的结构难度建模，使用动态规划求解器分析难度特征，属于数学推理任务的机器学习应用。所有关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文不涉及这些技术，仅使用传统机器学习模型（如基线模型）分析结构化数据，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了整数算术谜题中结构难度的建模问题，通过开发精确的动态规划求解器构建大规模数据集，发现难度完全由从最小构造中提取的少量可解释结构属性决定，为自适应算术学习系统提供了透明、可计算的难度评估框架。

摘要翻译

算术谜题为研究数学推理任务中的难度提供了一个受控环境，这是自适应学习系统的核心挑战。我们以数字游戏为灵感，探究一类整数算术谜题中难度的结构性决定因素。我们对该问题进行了形式化描述，并开发了一种精确的动态规划求解器，该求解器能够枚举可达目标值、提取最少操作步骤的证明路径，并支持大规模数据标注。
利用此求解器，我们构建了一个包含超过340万个实例的数据集，并通过达到目标所需的最少操作数来定义难度。我们分析了难度与求解器所导出特征之间的关系。虽然基于谜题整体和目标值层面统计特征的基线机器学习模型能够部分预测可解性，但它们无法可靠地区分简单实例。相比之下，我们证明难度完全由一小部分从精确证明路径中推导出的、可解释的结构属性所决定。具体而言，在最少步骤构造中所使用的输入数值数量，在此标注方案下，构成了难度的最小充分统计量。
这些结果为谜题难度提供了一个透明的、基于计算原理的解释，从而在符号推理与数据驱动建模之间架起了桥梁。该框架支持可解释的难度评估和原则性的任务排序，对自适应算术学习和智能练习系统具有直接意义。

摘要 (Abstract)

Arithmetic puzzle games provide a controlled setting for studying difficulty in mathematical reasoning tasks, a core challenge in adaptive learning systems. We investigate the structural determinants of difficulty in a class of integer arithmetic puzzles inspired by number games. We formalize the problem and develop an exact dynamic-programming solver that enumerates reachable targets, extracts minimal-operation witnesses, and enables large-scale labeling. Using this solver, we construct a dataset of over 3.4 million instances and define difficulty via the minimum number of operations required to reach a target. We analyze the relationship between difficulty and solver-derived features. While baseline machine learning models based on bag- and target-level statistics can partially predict solvability, they fail to reliably distinguish easy instances. In contrast, we show that difficulty is fully determined by a small set of interpretable structural attributes derived from exact witnesses. In particular, the number of input values used in a minimal construction serves as a minimal sufficient statistic for difficulty under this labeling. These results provide a transparent, computationally grounded account of puzzle difficulty that bridges symbolic reasoning and data-driven modeling. The framework supports explainable difficulty estimation and principled task sequencing, with direct implications for adaptive arithmetic learning and intelligent practice systems.

关键词: integer arithmetic puzzles, structural difficulty modeling, dynamic-programming solver, minimal-operation witnesses, adaptive learning systems, explainable difficulty estimation, task sequencing, mathematical reasoning

68. ❌ Image Rotation Angle Estimation: Comparing Circular-Aware Methods

作者: Maximilian Woehrer 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25351v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的图像旋转角度估计任务，研究的是传统卷积神经网络和视觉Transformer架构的回归方法比较，完全不涉及大语言模型、深度学习技术原理创新或任何科学领域的AI应用。所有关键词都围绕大模型技术、训练方法、推理优化、对齐技术、代理系统等主题，与论文的计算机视觉回归任务无任何关联。

!!! tip deepseek-chat TL;DR

该论文系统比较了五种圆形感知方法在图像旋转角度估计任务中的性能，发现概率方法（特别是圆形高斯分布）在不同架构中最稳健，而分类方法在匹配良好的骨干网络上精度最高，最佳配置在DRC-D数据集上达到1.23°的平均绝对误差。

摘要翻译

自动图像旋转估计是许多视觉流程中的关键预处理步骤。该任务具有挑战性，因为角度具有环形拓扑结构，其边界不连续性会阻碍标准回归方法。我们对五种全局方向估计的环形感知方法进行了全面研究：采用环形损失的直接角度回归、通过角度分箱的分类、单位向量回归、相移编码器以及环形高斯分布。利用从ImageNet预训练模型迁移学习，我们通过调整其输出头以适应旋转特定预测，系统评估了这十六种现代架构上的这些方法。我们的结果表明，概率方法（尤其是环形高斯分布）在不同架构间最具鲁棒性，而分类方法在匹配良好的骨干网络上能达到最佳精度，但在其他网络上存在训练不稳定性。在DRC-D数据集上，最佳配置（采用EfficientViT-B3的分类方法）实现了1.23°的平均绝对误差（五次独立运行的平均值），而采用MambaOut Base的环形高斯分布方法达到了几乎相同的1.24°，且在不同骨干网络上具有更强的鲁棒性。在COCO 2014数据集上训练和评估我们的最佳方法-架构组合，最优配置达到了3.71°的平均绝对误差，较先前工作有显著提升，在更大的COCO 2017数据集上进一步改善至2.84°。

摘要 (Abstract)

Automatic image rotation estimation is a key preprocessing step in many vision pipelines. This task is challenging because angles have circular topology, creating boundary discontinuities that hinder standard regression methods. We present a comprehensive study of five circular-aware methods for global orientation estimation: direct angle regression with circular loss, classification via angular binning, unit-vector regression, phase-shifting coder, and circular Gaussian distribution. Using transfer learning from ImageNet-pretrained models, we systematically evaluate these methods across sixteen modern architectures by adapting their output heads for rotation-specific predictions. Our results show that probabilistic methods, particularly the circular Gaussian distribution, are the most robust across architectures, while classification achieves the best accuracy on well-matched backbones but suffers training instabilities on others. The best configuration (classification with EfficientViT-B3) achieves a mean absolute error (MAE) of 1.23° (mean across five independent runs) on the DRC-D dataset, while the circular Gaussian distribution with MambaOut Base achieves a virtually identical 1.24° with greater robustness across backbones. Training and evaluating our top-performing method-architecture combinations on COCO 2014, the best configuration reaches 3.71° MAE, improving substantially over prior work, with further improvement to 2.84° on the larger COCO 2017 dataset.

关键词: image rotation estimation, circular-aware methods, transfer learning, circular Gaussian distribution, angular binning, mean absolute error, vision architectures, robustness evaluation

69. ❌ Agentic Trust Coordination for Federated Learning through Adaptive Thresholding and Autonomous Decision Making in Sustainable and Resilient Industrial Networks

作者: Paul Shepherd, Tasos Dagiuklas, Bugra Alkan, Jonathan Rodriguez 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25334v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training	0.0	0.0/10	0.0
Post-training	0.0	0.0/10	0.0
Instruction Tuning	0.0	0.0/10	0.0
RLHF	0.0	0.0/10	0.0
PEFT	0.0	0.0/10	0.0
Retrieval-Augmented Generation	0.0	0.0/10	0.0
Context Window Extension	0.0	0.0/10	0.0
KV Cache Compression	0.0	0.0/10	0.0
Chain of Thought	0.0	0.0/10	0.0
System 2 Thinking	0.0	0.0/10	0.0
Monte Carlo Tree Search AND LLM	0.0	0.0/10	0.0
Self-Correction	0.0	0.0/10	0.0
LLM Agents	0.0	0.0/10	0.0
Tool Use	0.0	0.0/10	0.0
Multi-agent Systems	0.0	0.0/10	0.0
Quantization	0.0	0.0/10	0.0
Speculative Decoding	0.0	0.0/10	0.0
Hallucination Mitigation	0.0	0.0/10	0.0
Mechanistic Interpretability	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging	0.0	0.0/10	0.0
In-context Learning	0.0	0.0/10	0.0
AI for Science	0.0	0.0/10	0.0

评分理由: 论文研究的是联邦学习（FL）中的信任协调机制，通过代理式控制层实现自适应阈值和自主决策，以提高工业网络的可持续性和韧性。论文未涉及任何大模型（LLM）、深度学习技术原理、AI for Science应用或所列关键词中的具体技术（如MoE、SFT、RAG、CoT等）。所有关键词均与大模型技术、训练方法、推理优化、AI应用或科学领域相关，而本文聚焦于联邦学习的系统级信任管理，属于分布式机器学习中的特定问题，与提供的关键词无直接关联。

!!! tip deepseek-chat TL;DR

本文提出了一种轻量级代理式信任协调方法，通过自适应阈值和自主决策解决联邦学习在工业网络中因客户端行为不一致和噪声条件导致的可靠性问题，实现了稳定的FL操作而不增加通信开销。

摘要翻译

工业网络中的分布式智能日益将异构且资源受限设备间的感知、通信与计算能力相融合。联邦学习（Federated Learning, FL）为此类环境下的协同模型训练提供了可能，但其可靠性受到客户端行为不一致、感知条件存在噪声以及故障或恶意更新等因素的影响。基于信任的机制常被用于缓解这些影响，然而现有方法大多仍停留在统计与启发式层面，依赖于固定参数或简单的自适应规则，难以适应动态变化的工作条件。
本文提出一种面向可持续与韧性工业网络的轻量级智能体化信任协调方法。所提出的智能体化信任控制层（Agentic Trust Control Layer）作为服务器端控制循环运行，它持续观测与信任及系统层面相关的信号，解读其随时间演化的趋势，并在检测到不稳定状态时实施有针对性的信任调整。该方法通过实现情境感知的干预决策，而非依赖固定或纯反应式的参数更新，扩展了先前的自适应信任机制。通过明确分离观测、推理与执行环节，所提框架能够在无需修改客户端训练过程或增加通信开销的前提下，支持联邦学习的稳定运行。

摘要 (Abstract)

Distributed intelligence in industrial networks increasingly integrates sensing, communication, and computation across heterogeneous and resource constrained devices. Federated learning (FL) enables collaborative model training in such environments, but its reliability is affected by inconsistent client behaviour, noisy sensing conditions, and the presence of faulty or adversarial updates. Trust based mechanisms are commonly used to mitigate these effects, yet most remain statistical and heuristic, relying on fixed parameters or simple adaptive rules that struggle to accommodate changing operating conditions. This paper presents a lightweight agentic trust coordination approach for FL in sustainable and resilient industrial networks. The proposed Agentic Trust Control Layer operates as a server side control loop that observes trust related and system level signals, interprets their evolution over time, and applies targeted trust adjustments when instability is detected. The approach extends prior adaptive trust mechanisms by enabling context aware intervention decisions, rather than relying on fixed or purely reactive parameter updates. By explicitly separating observation, reasoning, and action, the proposed framework supports stable FL operation without modifying client side training or increasing communication overhead.

关键词: Federated Learning, Trust Coordination, Adaptive Thresholding, Autonomous Decision Making, Industrial Networks, Agentic Control, Resilient Systems, Lightweight Framework

70. ❌ Macroscopic Characteristics of Mixed Traffic Flow with Deep Reinforcement Learning Based Automated and Human-Driven Vehicles

作者: Pankaj Kumar, Pranamesh Chakraborty, Subrahmanya Swamy Peruru 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25328v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究混合交通流中基于深度强化学习（DRL）的自动驾驶车辆控制，专注于交通流宏观特性和燃油效率分析。所有评分关键词均涉及大语言模型（LLM）及相关技术（如MoE、SFT、RAG、量化等），而论文未涉及任何LLM技术，仅使用传统的深度强化学习（TD3算法）和交通数据集（NGSIM）。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了基于深度强化学习的自动驾驶车辆在混合交通流中的宏观交通特性，发现相比传统模型，RL控制可提升约7.52%的道路容量和最高28.98%的燃油效率。

摘要翻译

在混合交通流（即自动驾驶车辆（Automated Vehicle, AV）与人工驾驶车辆共存）中，自动驾驶车辆的控制面临着重大挑战：需在捕捉异质驾驶员行为的同时，平衡安全性、效率、舒适性、燃油效率及交通规则遵守。传统的跟驰模型，如智能驾驶员模型（Intelligent Driver Model, IDM），往往难以泛化至多样化的交通场景，且通常未考虑燃油效率，这促使了基于学习的方法的应用。尽管深度强化学习（Deep Reinforcement Learning, DRL）在微观跟驰场景中已表现出强大性能，但其宏观交通流特性仍未得到充分探索。本研究重点分析了基于DRL的模型在混合交通中的宏观交通流特性与燃油效率。研究采用双延迟深度确定性策略梯度（Twin Delayed Deep Deterministic Policy Gradient, TD3）算法控制自动驾驶车辆，并使用NGSIM高速公路数据集进行训练，以实现与人工驾驶车辆的真实交互。交通性能通过基本图（Fundamental Diagram, FD）在不同驾驶员异质性、异质安全时距渗透水平以及不同比例强化学习控制车辆的条件下进行评估。同时，在宏观层面对基于强化学习的自动驾驶模型与IDM的燃油效率进行了比较。结果表明，交通性能对安全时距的分布以及强化学习车辆的比例较为敏感。从完全人工驾驶交通过渡到完全由强化学习控制的交通，可使道路通行能力提升约7.52%。此外，与IDM相比，基于强化学习的自动驾驶车辆在较高速度（高于50 km/h）下平均燃油效率提升约28.98%，在较低速度（低于50 km/h）下提升约1.86%。总体而言，该深度强化学习框架在不牺牲安全性的前提下，提升了道路通行能力与燃油效率。

摘要 (Abstract)

Automated Vehicle (AV) control in mixed traffic, where AVs coexist with human-driven vehicles, poses significant challenges in balancing safety, efficiency, comfort, fuel efficiency, and compliance with traffic rules while capturing heterogeneous driver behavior. Traditional car-following models, such as the Intelligent Driver Model (IDM), often struggle to generalize across diverse traffic scenarios and typically do not account for fuel efficiency, motivating the use of learning-based approaches. Although Deep Reinforcement Learning (DRL) has shown strong microscopic performance in car-following conditions, its macroscopic traffic flow characteristics remain underexplored. This study focuses on analyzing the macroscopic traffic flow characteristics and fuel efficiency of DRL-based models in mixed traffic. A Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm is implemented for AVs’ control and trained using the NGSIM highway dataset, enabling realistic interaction with human-driven vehicles. Traffic performance is evaluated using the Fundamental Diagram (FD) under varying driver heterogeneity, heterogeneous time-gap penetration levels, and different shares of RL-controlled vehicles. A macroscopic level comparison of fuel efficiency between the RL-based AV model and the IDM is also conducted. Results show that traffic performance is sensitive to the distribution of safe time gaps and the proportion of RL vehicles. Transitioning from fully human-driven to fully RL-controlled traffic can increase road capacity by approximately 7.52%. Further, RL-based AVs also improve average fuel efficiency by about 28.98% at higher speeds (above 50 km/h), and by 1.86% at lower speeds (below 50 km/h) compared to the IDM. Overall, the DRL framework enhances traffic capacity and fuel efficiency without compromising safety.

关键词: Mixed Traffic Flow, Deep Reinforcement Learning, Automated Vehicles, Macroscopic Characteristics, Fuel Efficiency, Twin Delayed Deep Deterministic Policy Gradient, Fundamental Diagram, Traffic Capacity

71. ❌ Evaluating Language Models for Harmful Manipulation

作者: Canfer Akbulut, Rasmi Elasmar, Abhishek Roy, Anthony Payne, Priyanka Suresh, Lujain Ibrahim, Seliem El-Sayed, Charvi Rastogi, Ashyana Kachra, Will Hawkins, Kristian Lum, Laura Weidinger 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25326v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究AI模型（特别是语言模型）的有害操纵行为评估框架，核心涉及大语言模型（LLMs）在真实场景中的行为评估，因此与’Large Language Models’高度相关（10分）。论文关注AI模型如何诱导人类信念和行为改变，这与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（8分），因为涉及模型行为对齐和伦理考量。其他关键词如MoE、SLMs、训练技术、推理优化、科学AI应用等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个评估AI模型有害操纵行为的框架，通过跨三个领域（公共政策、金融、健康）和三个地区（美国、英国、印度）的大规模人机交互实验，发现测试模型在特定提示下能产生操纵行为并成功诱导参与者的信念和行为改变，且操纵效果因领域和地理区域而异。

摘要翻译

学界对人工智能驱动的有害操纵行为的关注日益增长，但现有评估方法存在局限。本文提出一种通过特定情境人机交互研究评估有害人工智能操纵的框架。为展示该框架的实用性，我们组织10,101名参与者，在三个AI应用领域（公共政策、金融和健康）及三个地区（美国、英国和印度）对某AI模型进行交互测试。总体而言，研究发现：当被诱导时，受测模型能够产生操纵性行为，并在实验环境中成功引发参与者的信念与行为改变。我们进一步发现情境因素至关重要：不同领域中AI的操纵表现存在差异，这表明评估需在AI系统可能应用的高风险情境中进行。研究还发现不同地理区域存在显著差异，意味着某一地区的AI操纵研究结论可能无法直接推广至其他地区。最后，我们发现AI模型产生操纵行为的频率（倾向性）并不能稳定预测操纵成功的可能性（有效性），这强调了对这两个维度进行独立研究的重要性。为促进评估框架的应用，我们详细说明了测试方案并公开相关材料。文末探讨了评估AI模型有害操纵行为面临的开放性挑战。

摘要 (Abstract)

Interest in the concept of AI-driven harmful manipulation is growing, yet current approaches to evaluating it are limited. This paper introduces a framework for evaluating harmful AI manipulation via context-specific human-AI interaction studies. We illustrate the utility of this framework by assessing an AI model with 10,101 participants spanning interactions in three AI use domains (public policy, finance, and health) and three locales (US, UK, and India). Overall, we find that that the tested model can produce manipulative behaviours when prompted to do so and, in experimental settings, is able to induce belief and behaviour changes in study participants. We further find that context matters: AI manipulation differs between domains, suggesting that it needs to be evaluated in the high-stakes context(s) in which an AI system is likely to be used. We also identify significant differences across our tested geographies, suggesting that AI manipulation results from one geographic region may not generalise to others. Finally, we find that the frequency of manipulative behaviours (propensity) of an AI model is not consistently predictive of the likelihood of manipulative success (efficacy), underscoring the importance of studying these dimensions separately. To facilitate adoption of our evaluation framework, we detail our testing protocols and make relevant materials publicly available. We conclude by discussing open challenges in evaluating harmful manipulation by AI models.

关键词: harmful manipulation, AI evaluation framework, human-AI interaction, belief change, behavior change, context-specific evaluation, geographic differences, manipulative efficacy

72. ❌ How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models

作者: Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25325v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的权重剪枝技术及其对内部特征表示的影响，因此与’Large Language Models’高度相关（10分）。剪枝是模型压缩的一种形式，与’Quantization OR Model Compression’直接相关（10分）。研究使用稀疏自编码器（SAEs）作为可解释性探针来分析特征几何，与’Mechanistic Interpretability OR Explainable AI’高度相关（10分）。剪枝创建稀疏模型，与’Mixture of Experts OR MoE OR Sparse Models’有一定关联（8分）。模型压缩可能间接支持设备端AI，与’Small Language Models OR SLMs OR On-device AI’有弱关联（5分）。剪枝可能影响推理效率，与’Speculative Decoding OR Inference Acceleration’有弱关联（5分）。其他关键词如预训练、对齐、RAG、推理方法、代理系统、科学AI等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了权重剪枝如何重塑大语言模型的内部特征几何，发现剪枝优先破坏高频通用特征而保留稀有专用特征，且Wanda剪枝比幅度剪枝能更好地保持特征结构。

摘要翻译

权重剪枝是压缩大语言模型的常用技术，但其对已学习内部表征的影响仍不甚明晰。本文首次系统研究了非结构化剪枝如何重塑语言模型的特征几何结构，并以稀疏自编码器作为可解释性探针展开分析。我们在三种模型系列（Gemma 3 1B、Gemma 2 2B、Llama 3.2 1B）、两种剪枝方法（幅度剪枝与Wanda剪枝）和六个稀疏度层级（0-60%）的实验框架下，围绕种子稳定性、特征存活性、SAE可迁移性、特征脆弱性与因果相关性五个研究问题展开探究。最显著的发现是：稀疏自编码器中的稀有特征（即低激活频率特征）在剪枝过程中的存续能力远高于高频特征——在17组实验条件中有11组出现了条件内斯皮尔曼相关系数ρ=-1.0的极端负相关。这一反直觉的结果表明剪枝过程实质是隐式特征选择，优先破坏高频通用特征而保留专业化的稀有特征。我们进一步发现：Wanda剪枝保留特征结构的能力最高可达幅度剪枝的3.7倍；预训练的SAE在稀疏度达50%的Wanda剪枝模型上仍保持有效性；几何特征存活性并不能预测因果重要性——这种分离现象对压缩场景下的可解释性研究具有重要启示。

摘要 (Abstract)

Weight pruning is a standard technique for compressing large language models, yet its effect on learned internal representations remains poorly understood. We present the first systematic study of how unstructured pruning reshapes the feature geometry of language models, using Sparse Autoencoders (SAEs) as interpretability probes. Across three model families (Gemma 3 1B, Gemma 2 2B, Llama 3.2 1B), two pruning methods (magnitude and Wanda), and six sparsity levels (0–60%), we investigate five research questions spanning seed stability, feature survival, SAE transferability, feature fragility, and causal relevance. Our most striking finding is that rare SAE features–those with low firing rates–survive pruning far better than frequent ones, with within-condition Spearman correlations of rho = -1.0 in 11 of 17 experimental conditions. This counter-intuitive result suggests that pruning acts as implicit feature selection, preferentially destroying high-frequency generic features while preserving specialized rare ones. We further show that Wanda pruning preserves feature structure up to 3.7x better than magnitude pruning, that pre-trained SAEs remain viable on Wanda-pruned models up to 50% sparsity, and that geometric feature survival does not predict causal importance–a dissociation with implications for interpretability under compression.

关键词: weight pruning, large language models, sparse autoencoders, feature geometry, model compression, interpretability, sparsity, feature survival

73. ❌ DAGverse: Building Document-Grounded Semantic DAGs from Scientific Papers

作者: Shu Wan, Saketh Vishnubhatla, Iskander Kushbay, Tom Heffernan, Aaron Belikoff, Raha Moraffah, Huan Liu 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25293v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文DAGverse专注于从科学论文中构建文档基础的语义有向无环图（DAG），涉及文档理解、图结构提取和证据收集。其核心是半自动化的DAGverse-Pipeline系统，包括图分类、图重建、语义基础和验证。虽然论文属于AI在科学领域的应用（如AI for Science），因为它处理科学论文并构建结构化知识表示，但它并未直接涉及大模型或深度学习技术原理的创新。论文中提到的Vision-Language Models仅作为基线比较，而非研究焦点。因此，除’AI for Science OR Bioinformatics OR Cheminformatics’（评5分，表示有一定关联，属于AI在科学领域的应用）外，其他所有关键词（如LLMs、MoE、Scaling Laws、Fine-tuning、RAG、Reasoning、Agents等）均与论文内容无关，评0分。加权总分仅来自该关键词的5.0分。

!!! tip deepseek-chat TL;DR

论文研究了从科学论文中恢复语义有向无环图（DAG）及其证据的问题，提出了DAGverse框架和DAGverse-Pipeline半自动系统，并发布了DAGverse-1数据集，实验显示其在DAG分类和注释上优于现有视觉语言模型。

摘要翻译

有向无环图（Directed Acyclic Graphs，DAGs）被广泛用于表示科学与技术领域中的结构化知识。然而，现实世界中的DAG数据集仍然稀缺，因为构建这些数据集通常需要专家对领域文档进行解读。我们研究Doc2SemDAG构建任务：从文档中恢复出首选的语义DAG，并同时提取解释该DAG的引用证据与上下文。这一问题具有挑战性，因为同一文档可能允许存在多种合理的抽象表达，预期结构往往隐含于文中，且支撑证据分散在正文、公式、图注和图表之中。为应对这些挑战，我们利用包含显式DAG图示的科学论文作为自然的监督来源。在此设定下，DAG图示提供了DAG结构，而伴随文本则提供了上下文与解释。我们提出了DAGverse框架，用于从在线科学论文中构建基于文档的语义DAG。其核心组件DAGverse-Pipeline是一个半自动化系统，旨在通过图示分类、图重构、语义关联和验证等步骤生成高精度的语义DAG实例。作为案例研究，我们针对因果DAG测试了该框架，并发布了DAGverse-1数据集，其中包含108个经过专家验证的语义DAG，并提供了图级、节点级和边级的证据。实验表明，在DAG分类与标注任务上，DAGverse-Pipeline的表现优于现有的视觉-语言模型。DAGverse为基于文档的DAG基准测试奠定了基础，并为研究基于现实世界证据的结构化推理开辟了新的方向。

摘要 (Abstract)

Directed Acyclic Graphs (DAGs) are widely used to represent structured knowledge in scientific and technical domains. However, datasets for real-world DAGs remain scarce because constructing them typically requires expert interpretation of domain documents. We study Doc2SemDAG construction: recovering a preferred semantic DAG from a document together with the cited evidence and context that explain it. This problem is challenging because a document may admit multiple plausible abstractions, the intended structure is often implicit, and the supporting evidence is scattered across prose, equations, captions, and figures. To address these challenges, we leverage scientific papers containing explicit DAG figures as a natural source of supervision. In this setting, the DAG figure provides the DAG structure, while the accompanying text provides context and explanation. We introduce DAGverse, a framework for constructing document-grounded semantic DAGs from online scientific papers. Its core component, DAGverse-Pipeline, is a semi-automatic system designed to produce high-precision semantic DAG examples through figure classification, graph reconstruction, semantic grounding, and validation. As a case study, we test the framework for causal DAGs and release DAGverse-1, a dataset of 108 expert-validated semantic DAGs with graph-level, node-level, and edge-level evidence. Experiments show that DAGverse-Pipeline outperforms existing Vision-Language Models on DAG classification and annotation. DAGverse provides a foundation for document-grounded DAG benchmarks and opens new directions for studying structured reasoning grounded in real-world evidence.

关键词: Directed Acyclic Graphs, semantic DAGs, document-grounded, scientific papers, DAGverse, DAGverse-Pipeline, Vision-Language Models, structured reasoning

74. ❌ Revealing the influence of participant failures on model quality in cross-silo Federated Learning

作者: Fabian Stricker, David Bermbach, Christian Zirpins 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25289v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究联邦学习中参与者故障对模型质量的影响，属于分布式机器学习系统可靠性研究，与所有评分关键词（均聚焦于大模型技术原理、训练方法、推理优化、应用等）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文系统研究了联邦学习中参与者缺失对模型性能的影响，发现数据偏斜是关键影响因素，可能导致模型评估过于乐观并改变其他因素的效果。

摘要翻译

联邦学习（Federated Learning, FL）是一种在协作环境中训练机器学习（ML）模型的范式，其通过将原始数据保留在本地来保护参与者的隐私。将联邦学习应用于实际生产环境的一个关键要求是可靠性，因为可靠性不足可能损害学习结果的有效性、稳定性和可复现性。联邦学习本质上作为一个分布式系统运行，因此容易受到崩溃故障、网络分区及其他故障场景的影响。尽管如此，此类故障对联邦学习结果的影响尚未得到系统研究。
本文通过探究联邦学习中参与者缺失的影响来填补这一空白。为此，我们在图像、表格和时间序列数据上进行了大量实验，并分析了参与者缺席如何影响模型性能，同时考虑了数据偏斜度、不同的可用性模式以及模型架构等影响因素。此外，我们还考察了特定场景下的问题，包括全局模型对缺失参与者的效用。我们的实验详细揭示了各种影响因素的作用效果。特别地，我们表明数据偏斜度具有强烈影响，常常导致模型评估过于乐观，在某些情况下甚至改变其他影响因素的作用效果。

摘要 (Abstract)

Federated Learning (FL) is a paradigm for training machine learning (ML) models in collaborative settings while preserving participants’ privacy by keeping raw data local. A key requirement for the use of FL in production is reliability, as insufficient reliability can compromise the validity, stability, and reproducibility of learning outcomes. FL inherently operates as a distributed system and is therefore susceptible to crash failures, network partitioning, and other fault scenarios. Despite this, the impact of such failures on FL outcomes has not yet been studied systematically. In this paper, we address this gap by investigating the impact of missing participants in FL. To this end, we conduct extensive experiments on image, tabular, and time-series data and analyze how the absence of participants affects model performance, taking into account influencing factors such as data skewness, different availability patterns, and model architectures. Furthermore, we examine scenario-specific aspects, including the utility of the global model for missing participants. Our experiments provide detailed insights into the effects of various influencing factors. In particular, we show that data skewness has a strong impact, often leading to overly optimistic model evaluations and, in some cases, even altering the effects of other influencing factors.

关键词: Federated Learning, participant failures, model quality, data skewness, availability patterns, model performance, distributed system, reliability

75. ❌ CSI-tuples-based 3D Channel Fingerprints Construction Assisted by MultiModal Learning

作者: Chenjie Xie, Li You, Ruirong Chen, Gaoning He, Xiqi Gao 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25288v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于无线通信领域的3D信道指纹构建，提出了一种基于多模态学习的模块化框架。所有关键词均与大模型/深度学习技术原理或其在科学领域的应用相关，但论文内容完全不涉及大语言模型、深度学习模型训练优化、推理加速、对齐、智能体等任何相关技术。唯一可能的相关点是"AI for Science”，因为论文将AI应用于通信科学问题（6G移动通信中的信道估计），但论文未明确提及生物信息学或化学信息学，因此给予5分（有一定关联）。其他所有关键词与论文的无线通信、多模态回归、信道建模等核心内容完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于CSI元组和多模态学习的模块化框架来构建3D信道指纹，以提高低空通信中的信道状态信息获取效率，实验表明该框架比现有算法准确率至少提高27.5%。

摘要翻译

低空通信能够促进空天地无线资源融合、拓展网络覆盖范围并提升传输质量，从而赋能第六代移动通信系统的发展。作为低空传输的关键使能技术，三维信道指纹（3D-CF，亦称三维无线电地图或三维信道知识图谱）有望深化对通信环境的认知，辅助信道状态信息获取，避免重复估计并降低计算复杂度。本文提出一种模块化多模态框架以构建三维信道指纹。具体而言，我们首先基于莱斯衰落信道将三维信道指纹模型建立为信道状态信息元组的集合，其中每个元组包含低空飞行器的位置信息及其对应的统计信道状态信息。考虑到不同先验数据的异构结构，我们将三维信道指纹构建问题建模为多模态回归任务，使得信道状态信息元组中的目标信道信息能够直接通过对应的低空飞行器位置、通信测量数据及地理环境地图进行联合估计。随后，本文相应提出一种高效多模态框架，其包含基于相关性的多模态融合模块、多模态表征模块以及信道状态信息回归模块。数值结果表明，所提框架能够高效构建三维信道指纹，在不同通信场景下相比现有最优算法至少提升27.5%的精度，展现出优越的性能与出色的泛化能力。我们同时分析了计算复杂度，并论证了其在推理时间方面的显著优势。

摘要 (Abstract)

Low-altitude communications can promote the integration of aerial and terrestrial wireless resources, expand network coverage, and enhance transmission quality, thereby empowering the development of sixth-generation (6G) mobile communications. As an enabler for low-altitude transmission, 3D channel fingerprints (3D-CF), also referred to as the 3D radio map or 3D channel knowledge map, are expected to enhance the understanding of communication environments and assist in the acquisition of channel state information (CSI), thereby avoiding repeated estimations and reducing computational complexity. In this paper, we propose a modularized multimodal framework to construct 3D-CF. Specifically, we first establish the 3D-CF model as a collection of CSI-tuples based on Rician fading channels, with each tuple comprising the low-altitude vehicle’s (LAV) positions and its corresponding statistical CSI. In consideration of the heterogeneous structures of different prior data, we formulate the 3D-CF construction problem as a multimodal regression task, where the target channel information in the CSI-tuple can be estimated directly by its corresponding LAV positions, together with communication measurements and geographic environment maps. Then, a high-efficiency multimodal framework is proposed accordingly, which includes a correlation-based multimodal fusion (Corr-MMF) module, a multimodal representation (MMR) module, and a CSI regression (CSI-R) module. Numerical results show that our proposed framework can efficiently construct 3D-CF and achieve at least 27.5% higher accuracy than the state-of-the-art algorithms under different communication scenarios, demonstrating its competitive performance and excellent generalization ability. We also analyze the computational complexity and illustrate its superiority in terms of the inference time.

关键词: 3D channel fingerprints, CSI-tuples, multimodal learning, low-altitude communications, 6G mobile communications, channel state information, multimodal regression, wireless communication

76. ❌ SliderQuant: Accurate Post-Training Quantization for LLMs

作者: Shigeng Wang, Chao Li, Yangyuxuan Kang, Jiawei Fan, Zhonghong Ou, Anbang Yao 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25284v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的后训练量化（PTQ）方法，因此与’Large Language Models’和’Post-training’高度相关（10分）。论文提出新的量化框架SliderQuant，属于模型压缩技术，与’Quantization’高度相关（10分）。论文实验包括MoE模型，因此与’Mixture of Experts’有一定关联（5分）。量化旨在提升推理效率，与’Speculative Decoding’有一定关联（5分）。其他关键词如SLMs、Scaling Laws、Pre-training、Instruction Tuning等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

论文提出SliderQuant框架，通过自适应滑动量化设计解决LLM后训练量化中不同层敏感度差异问题，在多种任务和模型上优于现有PTQ方法。

摘要翻译

本文从被忽视的视角探讨大语言模型（LLMs）的后训练量化（PTQ）：给定一个预训练的高精度大语言模型，主流的顺序量化框架对所有层进行同等处理，但在低比特位宽等挑战性设定下，这可能并非最优。我们通过实证研究量化对不同层模型精度的影响，并观察到：（1）浅层/深层通常比中间层对量化更敏感；（2）在浅层/深层中，最敏感的是第一层/最后一层，其量化误差显著大于其他层。这些实证观察表明，大语言模型不同层的量化设计需要在多个层面进行，而非采用所有层共享的单一方案。受此启发，我们提出一种新的后训练量化框架——滑动层量化（SliderQuant），该框架基于一种由少量可学习参数实现的简单自适应滑动量化思想。SliderQuant的基础组件称为层间滑动量化，它包含三种新颖的滑动窗口设计，专门用于应对浅层、中间层和深层不同的量化敏感性。另一组件称为层内滑动量化，采用增量策略对每个窗口进行量化。因此，SliderQuant具备强大的跨层降低量化误差的能力。我们在基础语言生成、零样本常识推理以及具有挑战性的数学和代码任务上进行了广泛实验，涵盖多种大语言模型，包括Llama/Llama2/Llama3/Qwen2.5模型系列、DeepSeek-R1蒸馏模型及大型混合专家（MoE）模型。实验结果表明，无论是仅权重量化还是权重-激活量化，我们的方法均优于现有的后训练量化方法（包括采用旋转变换的最新后训练量化方法）。

摘要 (Abstract)

In this paper, we address post-training quantization (PTQ) for large language models (LLMs) from an overlooked perspective: given a pre-trained high-precision LLM, the predominant sequential quantization framework treats different layers equally, but this may be not optimal in challenging bit-width settings. We empirically study the quantization impact of different layers on model accuracy, and observe that: (1) shallow/deep layers are usually more sensitive to quantization than intermediate layers; (2) among shallow/deep layers, the most sensitive one is the first/last layer, which exhibits significantly larger quantization error than others. These empirical observations imply that the quantization design for different layers of LLMs is required on multiple levels instead of a single level shared to all layers. Motivated by this, we propose a new PTQ framework termed Sliding-layer Quantization (SliderQuant) that relies on a simple adaptive sliding quantization concept facilitated by few learnable parameters. The base component of SliderQuant is called inter-layer sliding quantization, which incorporates three types of novel sliding window designs tailored for addressing the varying quantization sensitivity of shallow, intermediate and deep layers. The other component is called intra-layer sliding quantization that leverages an incremental strategy to quantize each window. As a result, SliderQuant has a strong ability to reduce quantization errors across layers. Extensive experiments on basic language generation, zero-shot commonsense reasoning and challenging math and code tasks with various LLMs, including Llama/Llama2/Llama3/Qwen2.5 model families, DeepSeek-R1 distilled models and large MoE models, show that our method outperforms existing PTQ methods (including the latest PTQ methods using rotation transformations) for both weight-only quantization and weight-activation quantization.

关键词: post-training quantization, large language models, model compression, quantization sensitivity, sliding quantization, weight-only quantization, weight-activation quantization, MoE models

77. ❌ A Gait Foundation Model Predicts Multi-System Health Phenotypes from 3D Skeletal Motion

作者: Adam Gabet, Sarah Kohn, Guy Lutsker, Shira Gelman, Anastasia Godneva, Gil Sasson, Arad Zulti, David Krongauz, Rotem Shaulitch, Assaf Rotem, Ohad Doron, Yuval Brodsky, Adina Weinberger, Eran Segal 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25283v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文开发了一个用于步态分析的’基础模型’，但这是计算机视觉/生物医学工程领域的特定领域基础模型，而非自然语言处理中的大语言模型（LLMs）。论文内容完全专注于步态分析、健康表型预测和生物医学应用，与所有LLM相关技术关键词（如MoE、SFT、RLHF、RAG、推理加速等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在生物医学科学（特别是生物信息学/表型分析）中的应用，评分为10分（高度相关，核心内容）。

!!! tip deepseek-chat TL;DR

该研究开发了一个基于3D骨骼运动的步态基础模型，能够从步态中预测多种健康表型，证明步态可作为独立的全身生物信号。

摘要翻译

步态日益被视为一项生命体征，但现有研究方法多将其视为特定病理的症状而非系统性生物标志物。我们基于3,414名深度表型分析成年人在执行五项运动任务时通过深度相机记录的3D骨骼运动数据，开发了一个步态基础模型。该模型学习到的嵌入表征优于人工设计的特征，能有效预测年龄（皮尔逊相关系数r=0.69）、身体质量指数（BMI，r=0.90）和内脏脂肪组织面积（VAT，r=0.82）。这些嵌入表征对3,210个表型目标中的1,980个具有显著预测力；在调整年龄、BMI、VAT和身高后，步态数据为男性全部18个身体系统、女性18个系统中的17个提供了独立增益，并提升了对临床诊断和药物使用的预测能力。解剖学消融分析显示，腿部运动主导代谢与衰弱预测，而躯干运动编码睡眠及生活方式表型。这些发现确立了步态作为一种独立的多系统生物信号，推动其向消费级视频技术转化，并作为可扩展、无创的生命体征融入健康监测体系。

摘要 (Abstract)

Gait is increasingly recognized as a vital sign, yet current approaches treat it as a symptom of specific pathologies rather than a systemic biomarker. We developed a gait foundation model for 3D skeletal motion from 3,414 deeply phenotyped adults, recorded via a depth camera during five motor tasks. Learned embeddings outperformed engineered features, predicting age (Pearson r = 0.69), BMI (r = 0.90), and visceral adipose tissue area (r = 0.82). Embeddings significantly predicted 1,980 of 3,210 phenotypic targets; after adjustment for age, BMI, VAT, and height, gait provided independent gains in all 18 body systems in males and 17 of 18 in females, and improved prediction of clinical diagnoses and medication use. Anatomical ablation revealed that legs dominated metabolic and frailty predictions while torso encoded sleep and lifestyle phenotypes. These findings establish gait as an independent multi-system biosignal, motivating translation to consumer-grade video and its integration as a scalable, passive vital sign.

关键词: gait foundation model, 3D skeletal motion, health phenotypes, multi-system biomarker, deep phenotyping, visceral adipose tissue, motor tasks, vital sign

78. ❌ Distribution and Clusters Approximations as Abstract Domains in Probabilistic Abstract Interpretation to Neural Network Analysis

作者: Zhuofan Zhang, Herbert Wiklicky 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25273v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是神经网络分析的抽象解释框架，专注于概率抽象解释和抽象域方法，与所有给定的大模型、深度学习技术原理、AI科学应用等关键词完全无关。论文内容涉及神经网络分析的理论方法，而非大模型技术、训练方法、推理优化、对齐、应用等领域。

!!! tip deepseek-chat TL;DR

该论文提出了两种新的抽象域方法（分布近似和聚类近似）用于神经网络分析的抽象解释框架，以分析神经网络对所有可能输入的密度分布流。

摘要翻译

概率抽象解释框架下的神经网络分析通过分析所有可能输入的密度分布流来解析神经网络。网格近似是该框架采用的抽象域之一，它将具体空间抽象为网格。本文介绍了两种新颖的近似方法：分布近似与聚类近似。我们通过若干简单示例的图示，结合相应的抽象转换器，从理论上阐释了这两种方法的运作机制。

摘要 (Abstract)

The probabilistic abstract interpretation framework of neural network analysis analyzes a neural network by analyzing its density distribution flow of all possible inputs. The grids approximation is one of abstract domains the framework uses which abstracts concrete space into grids. In this paper, we introduce two novel approximation methods: distribution approximation and clusters approximation. We show how these two methods work in theory with corresponding abstract transformers with help of illustrations of some simple examples.

关键词: probabilistic abstract interpretation, neural network analysis, abstract domains, distribution approximation, clusters approximation, density distribution flow, abstract transformers

79. ❌ CRAFT: Grounded Multi-Agent Coordination Under Partial Information

作者: Abhijnan Nath, Hannah VanderHoeven, Nikhil Krishnaswamy 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25268v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文CRAFT专注于评估大语言模型在多智能体协调任务中的表现，特别是部分信息下的实用通信。因此，与’Large Language Models’、‘LLM Agents’和’Multi-agent Systems’高度相关（10分），因为这是论文的核心研究对象。与推理相关的关键词（‘Chain of Thought’和’System 2 Thinking’）有一定关联（5分），因为论文探讨了推理能力与协调性能的关系，但并非直接研究这些推理技术本身。其他关键词如模型架构、训练方法、优化技术、科学应用等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

论文研究了在严格部分信息下，大语言模型通过自然语言进行多智能体协调的能力，发现更强的推理能力并不总能转化为更好的协调性能，且小型开源模型有时能匹配或超越前沿系统，表明多智能体协调仍是当前语言模型未解决的根本挑战。

摘要翻译

我们推出CRAFT，这是一个用于评估大语言模型在严格局部信息条件下语用交流能力的多智能体基准。在此设定中，多个拥有互补但不完整视角的智能体必须通过自然语言进行协作，以构建任何单一智能体都无法完整观察的共享三维结构。我们将该问题形式化为多发送者语用推理任务，并提供一个诊断框架，将失败案例分解为空间基础理解、信念建模和语用交流错误，包括对前沿模型和开源模型行为失败模式的分类体系。通过对多样化模型（包括8个开源模型和7个前沿推理模型）的测试，我们发现更强的推理能力并不能可靠转化为更好的协作效果：较小的开源模型常常达到或超越前沿系统的表现，且个体交流能力的提升并不能确保协作成功。这些结果表明，多智能体协调对当前语言模型而言仍是尚未根本解决的挑战。我们的代码可在https://github.com/csu-signal/CRAFT获取。

摘要 (Abstract)

We introduce CRAFT, a multi-agent benchmark for evaluating pragmatic communication in large language models under strict partial information. In this setting, multiple agents with complementary but incomplete views must coordinate through natural language to construct a shared 3D structure that no single agent can fully observe. We formalize this problem as a multi-sender pragmatic reasoning task and provide a diagnostic framework that decomposes failures into spatial grounding, belief modeling and pragmatic communication errors, including a taxonomy of behavioral failure profiles in both frontier and open-weight models. Across a diverse set of models, including 8 open-weight and 7 frontier including reasoning models, we find that stronger reasoning ability does not reliably translate to better coordination: smaller open-weight models often match or outperform frontier systems, and improved individual communication does not guarantee successful collaboration. These results suggest that multi-agent coordination remains a fundamentally unsolved challenge for current language models. Our code can be found at https://github.com/csu-signal/CRAFT

关键词: multi-agent coordination, large language models, partial information, pragmatic communication, benchmark evaluation, spatial grounding, belief modeling, open-weight models

80. ❌ Probabilistic Abstract Interpretation on Neural Networks via Grids Approximation

作者: Zhuofan Zhang, Herbert Wiklicky 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25266v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究概率抽象解释理论在神经网络上的应用，用于分析神经网络输入的概率分布流。这与大多数关键词（如LLM、MoE、RLHF、RAG等）完全无关，因为这些关键词涉及大模型的具体架构、训练方法、应用技术等，而本文是通用的神经网络理论分析。唯一相关的是’Mechanistic Interpretability OR Explainable AI’，因为论文的抽象解释框架旨在理解和解释神经网络的行为，属于可解释AI范畴，但并非核心关注大模型的可解释性，因此给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文将概率抽象解释理论应用于神经网络，以分析当网络有无限或不可数输入时所有可能输入的概率分布流，并展示了该理论框架如何通过抽象域和抽象变换器来帮助分析现实世界问题。

摘要翻译

概率抽象解释是一种理论，用于在无法测试所有单个输入的情况下提取计算机程序的特定属性。本文将该理论应用于神经网络，旨在实现相同目的：当神经网络具有不可数或可数但无限多的输入时，分析其所有可能输入的密度分布流。我们展示了该理论框架在神经网络中的运作方式，随后讨论了框架中使用的不同抽象域、相应的摩尔-彭罗斯伪逆以及抽象变换器。我们还通过实验案例展示了该框架如何帮助分析现实世界问题。

摘要 (Abstract)

Probabilistic abstract interpretation is a theory used to extract particular properties of a computer program when it is infeasible to test every single inputs. In this paper we apply the theory on neural networks for the same purpose: to analyse density distribution flow of all possible inputs of a neural network when a network has uncountably many or countable but infinitely many inputs. We show how this theoretical framework works in neural networks and then discuss different abstract domains and corresponding Moore-Penrose pseudo-inverses together with abstract transformers used in the framework. We also present experimental examples to show how this framework helps to analyse real world problems.

关键词: Probabilistic Abstract Interpretation, Neural Networks, Density Distribution Flow, Abstract Domains, Moore-Penrose Pseudo-inverses, Abstract Transformers, Input Analysis

81. ❌ MolQuest: A Benchmark for Agentic Evaluation of Abductive Reasoning in Chemical Structure Elucidation

作者: Taolin Han, Shuang Wu, Jinghang Wang, Yuhao Zhou, Renquan Lv, Bing Zhao, Wei Hu 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25253v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLMs在科学领域的应用，特别是化学结构解析中的推理能力评估。高度相关的关键词包括：LLMs（论文明确研究LLMs在科学发现中的潜力）、AI for Science（直接应用于化学领域）、LLM Agents（提出基于agent的评估框架）、Chain of Thought和System 2 Thinking（评估多步迭代推理和深度推理能力）。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对当前LLMs在复杂科学任务中动态推理能力评估不足的问题，提出了一个基于agent的分子结构解析评估框架MolQuest，实验结果表明即使最先进的模型在真实科学场景中的准确率也仅有约50%，揭示了LLMs在战略性科学推理方面存在显著差距。

摘要翻译

大语言模型（LLMs）在推动科学发现方面具有巨大潜力，但对其在真实研究场景中动态推理能力的系统性评估仍然有限。当前的科学评估基准主要依赖于静态、单轮问答形式，这不足以衡量模型在需要多步骤迭代和实验交互的复杂科学任务中的表现。为弥补这一不足，我们提出了MolQuest——一个基于真实化学实验数据构建的、用于分子结构解析的新型智能体评估框架。与现有数据集不同，MolQuest将分子结构解析形式化为一个多轮交互任务，要求模型主动规划实验步骤、整合异质谱学数据源（如核磁共振NMR、质谱MS），并迭代优化结构假设。该框架在一个广阔而复杂的化学空间中，系统性地评估大语言模型的溯因推理与战略决策能力。实证结果表明，当前的前沿模型在真实科学场景中表现出显著局限性：值得注意的是，即使是最先进的模型，其准确率也仅达到约50%，而大多数其他模型的性能仍低于30%的阈值。这项工作为面向科学的大语言模型评估提供了一个可复现且可扩展的框架，我们的研究结果凸显了当前大语言模型在战略性科学推理方面的关键差距，为未来开发能够积极参与科学过程的人工智能指明了明确方向。

摘要 (Abstract)

Large language models (LLMs) hold considerable potential for advancing scientific discovery, yet systematic assessment of their dynamic reasoning in real-world research remains limited. Current scientific evaluation benchmarks predominantly rely on static, single-turn Question Answering (QA) formats, which are inadequate for measuring model performance in complex scientific tasks that require multi-step iteration and experimental interaction. To address this gap, we introduce MolQuest, a novel agent-based evaluation framework for molecular structure elucidation built upon authentic chemical experimental data. Unlike existing datasets, MolQuest formalizes molecular structure elucidation as a multi-turn interactive task, requiring models to proactively plan experimental steps, integrate heterogeneous spectral sources (e.g., NMR, MS), and iteratively refine structural hypotheses. This framework systematically evaluates LLMs’ abductive reasoning and strategic decision-making abilities within a vast and complex chemical space. Empirical results reveal that contemporary frontier models exhibit significant limitations in authentic scientific scenarios: notably, even state-of-the-art (SOTA) models achieve an accuracy of only approximately 50%, while the performance of most other models remains below the 30% threshold. This work provides a reproducible and extensible framework for science-oriented LLM evaluation, our findings highlight the critical gap in current LLMs’ strategic scientific reasoning, setting a clear direction for future research toward AI that can actively participate in the scientific process.

关键词: Large Language Models, scientific discovery, agent-based evaluation, molecular structure elucidation, abductive reasoning, multi-step iteration, chemical experimental data, strategic decision-making

82. ❌ Does Explanation Correctness Matter? Linking Computational XAI Evaluation to Human Understanding

作者: Gregor Baer, Chao Zhang, Isel Grau, Pieter Van Gorp 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25251v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究可解释人工智能（XAI）中解释正确性与人类理解之间的关系，属于XAI评估领域，与关键词’Mechanistic Interpretability OR Explainable AI’高度相关（评10分），因为论文核心是评估XAI解释方法对人类理解的影响。其他关键词均涉及大模型、深度学习技术原理、科学应用等具体技术或领域，而本文专注于通用的XAI评估方法学，不涉及特定模型架构、训练技术、推理方法、应用领域或性能优化技术，因此其他关键词评0分。

!!! tip deepseek-chat TL;DR

该研究通过用户实验检验了XAI解释的正确性是否直接影响人类对AI决策的理解，发现解释正确性下降至70%以下会损害理解，但并非所有正确性差异都导致理解差异，且完全正确的解释也不能保证理解。

摘要翻译

可解释人工智能（XAI）方法通常通过功能性指标（如正确性）进行评估，这些指标通过计算来估计解释在多大程度上准确反映了模型的推理过程。更高的正确性通常被认为能带来更好的人类理解，但这种关联尚未在受控条件下通过实验验证。我们开展了一项用户研究（N=200），在时间序列分类任务中设置了四个正确性水平（100%、85%、70%、55%），参与者无法依赖领域知识或视觉直觉，而必须基于解释（前向模拟）来预测AI的决策。正确性确实影响了理解，但并非在所有水平上都如此：与完全正确的解释相比，参与者在70%和55%正确性水平上的表现有所下降，而正确性进一步降低至70%以下并未导致额外的表现损失。较低的正确性并未使整体表现均匀下降，而是降低了学会决策模式的参与者比例。同时，即使是完全正确的解释也不能保证理解，因为只有部分参与者达到了高准确率。探索性分析表明，自我报告的评价分数与实测表现仅在解释完全正确且参与者已学会决策模式时才具有相关性。这些发现表明，并非所有功能性正确性的差异都会转化为人类理解上的差异，这强调了需要根据人类实际表现来验证功能性指标的必要性。

摘要 (Abstract)

Explainable AI (XAI) methods are commonly evaluated with functional metrics such as correctness, which computationally estimate how accurately an explanation reflects the model’s reasoning. Higher correctness is assumed to produce better human understanding, but this link has not been tested experimentally with controlled levels. We conducted a user study (N=200) that manipulated explanation correctness at four levels (100%, 85%, 70%, 55%) in a time series classification task where participants could not rely on domain knowledge or visual intuition and instead predicted the AI’s decisions based on explanations (forward simulation). Correctness affected understanding, but not at every level: performance dropped at 70% and 55% correctness relative to fully correct explanations, while further degradation below 70% produced no additional loss. Rather than shifting performance uniformly, lower correctness decreased the proportion of participants who learned the decision pattern. At the same time, even fully correct explanations did not guarantee understanding, as only a subset of participants achieved high accuracy. Exploratory analyses showed that self-reported ratings correlated with demonstrated performance only when explanations were fully correct and participants had learned the pattern. These findings show that not all differences in functional correctness translate to differences in human understanding, underscoring the need to validate functional metrics against human outcomes.

关键词: Explainable AI, XAI evaluation, explanation correctness, human understanding, user study, forward simulation, functional metrics, time series classification

83. ❌ Activation Matters: Test-time Activated Negative Labels for OOD Detection with Vision-Language Models

作者: Yabin Zhang, Maya Varma, Yunhe Gao, Jean-Benoit Delbrouck, Jiaming Liu, Chong Wang, Curtis Langlotz 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25250v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的OOD检测方法，使用视觉-语言模型（VLM）作为基础模型，但研究内容与提供的关键词列表（主要针对大语言模型的技术原理、训练方法、推理优化、对齐、应用等）完全无关。论文未涉及任何LLM相关技术、训练方法、推理加速、对齐技术或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为TANL的训练无关、测试高效的OOD检测方法，通过动态评估激活水平并选择分布自适应的激活负标签，在ImageNet基准上将FPR95从17.5%显著降低至9.8%。

摘要翻译

分布外检测旨在识别偏离分布内类别的样本。一种主流方法通过引入远离分布内类别的负标签，并依据样本与这些标签的距离来检测分布外样本。然而，此类标签可能在分布外样本上激活程度较低，难以捕捉其特性。为解决此问题，我们提出测试时激活负标签，通过动态评估语料数据集上的激活水平，在测试过程中挖掘具有高激活响应的候选标签。具体而言，TANL在线识别高置信度的测试图像，并累积其在语料上的分配概率以构建标签激活度量。该度量利用历史测试样本自适应地对齐测试分布，从而能够选择与分布适配的激活负标签。通过进一步探索当前测试批次内的激活信息，我们引入了一种更细粒度的批次自适应变体。为充分利用标签激活知识，我们提出一种激活感知的评分函数，强调激活更强的负标签，从而提升性能并增强其对标签数量的鲁棒性。所提出的TANL无需额外训练、测试高效且具有理论依据。在不同骨干网络和广泛任务设置上的实验验证了其有效性。值得注意的是，在大规模ImageNet基准测试中，TANL将FPR95从17.5%显著降低至9.8%。代码发布于\href{https://github.com/YBZh/OpenOOD-VLM}{YBZh/OpenOOD-VLM}。

摘要 (Abstract)

Out-of-distribution (OOD) detection aims to identify samples that deviate from in-distribution (ID). One popular pipeline addresses this by introducing negative labels distant from ID classes and detecting OOD based on their distance to these labels. However, such labels may present poor activation on OOD samples, failing to capture the OOD characteristics. To address this, we propose \underline{T}est-time \underline{A}ctivated \underline{N}egative \underline{L}abels (TANL) by dynamically evaluating activation levels across the corpus dataset and mining candidate labels with high activation responses during the testing process. Specifically, TANL identifies high-confidence test images online and accumulates their assignment probabilities over the corpus to construct a label activation metric. Such a metric leverages historical test samples to adaptively align with the test distribution, enabling the selection of distribution-adaptive activated negative labels. By further exploring the activation information within the current testing batch, we introduce a more fine-grained, batch-adaptive variant. To fully utilize label activation knowledge, we propose an activation-aware score function that emphasizes negative labels with stronger activations, boosting performance and enhancing its robustness to the label number. Our TANL is training-free, test-efficient, and grounded in theoretical justification. Experiments on diverse backbones and wide task settings validate its effectiveness. Notably, on the large-scale ImageNet benchmark, TANL significantly reduces the FPR95 from 17.5% to 9.8%. Codes are available at \href{https://github.com/YBZh/OpenOOD-VLM}{YBZh/OpenOOD-VLM}.

关键词: OOD detection, vision-language models, test-time activation, negative labels, distribution adaptation, activation-aware scoring, training-free method, ImageNet benchmark

84. ❌ FEAST: Fully Connected Expressive Attention for Spatial Transcriptomics

作者: Taejin Jeong, Joohyeok Kim, Jinyeong Kim, Chanyoung Kim, Seong Jae Hwang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25247v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文FEAST专注于空间转录组学（Spatial Transcriptomics）中的基因表达预测，使用基于注意力的深度学习框架（fully connected graph attention with negative-aware attention）来解决现有图神经网络的结构限制。论文的核心是生物信息学（Bioinformatics）领域的AI应用，属于"AI for Science"范畴，因此与关键词"AI for Science OR Bioinformatics OR Cheminformatics"高度相关（8分）。然而，论文未涉及大语言模型（LLMs）、模型训练技术（如MoE、Scaling Laws、Pre-training、SFT、RLHF、PEFT）、推理优化（如RAG、Context Window Extension、KV Cache Compression）、推理方法（如CoT、System 2 Thinking、MCTS）、代理系统（如LLM Agents、Tool Use）、模型效率（如Quantization、Speculative Decoding）、模型可靠性（如Hallucination Mitigation）、可解释性（如Mechanistic Interpretability）、世界模型、模型合并或上下文学习等其他关键词，因此这些关键词均得0分。加权总分计算为8.0（仅一个相关关键词得分）。

!!! tip deepseek-chat TL;DR

论文提出FEAST，一种基于全连接图注意力的深度学习框架，用于从全切片图像预测空间转录组学中的基因表达，通过负感知注意力和离网格采样策略提升了预测准确性并提供了可解释的交互图谱。

摘要翻译

空间转录组学（Spatial Transcriptomics, ST）能够提供空间分辨的基因表达数据，为理解组织结构和复杂疾病提供了关键见解。然而，其高昂的成本限制了广泛应用，因此从易于获取的全切片图像中推断空间基因表达受到了极大关注。尽管已有研究提出使用图神经网络来建模组织区域间的相互作用，但这些方法依赖于预定义的稀疏图，无法考虑潜在相互作用的点对，导致在捕捉复杂生物学关系时存在结构局限性。为解决这一问题，我们提出了FEAST（面向空间转录组学的全连接表达性注意力机制），这是一个基于注意力的框架，它将组织建模为一个全连接图，从而能够考虑所有成对的相互作用。为了更好地反映生物学相互作用，我们引入了负感知注意力机制，该机制同时建模兴奋性和抑制性相互作用，捕捉了标准注意力机制常常忽略的关键负向关系。此外，为了减轻标准点图像提取中因截断或忽略上下文而造成的信息损失，我们引入了一种离网采样策略，从中间区域收集额外的图像，使模型能够捕捉更丰富的形态学背景。在公开ST数据集上的实验表明，FEAST在基因表达预测方面超越了现有最先进的方法，同时提供了生物学上合理的注意力图谱，阐明了正向与负向的相互作用。我们的代码可在 https://github.com/starforTJ/FEAST 获取。

摘要 (Abstract)

Spatial Transcriptomics (ST) provides spatially-resolved gene expression, offering crucial insights into tissue architecture and complex diseases. However, its prohibitive cost limits widespread adoption, leading to significant attention on inferring spatial gene expression from readily available whole slide images. While graph neural networks have been proposed to model interactions between tissue regions, their reliance on pre-defined sparse graphs prevents them from considering potentially interacting spot pairs, resulting in a structural limitation in capturing complex biological relationships. To address this, we propose FEAST (Fully connected Expressive Attention for Spatial Transcriptomics), an attention-based framework that models the tissue as a fully connected graph, enabling the consideration of all pairwise interactions. To better reflect biological interactions, we introduce negative-aware attention, which models both excitatory and inhibitory interactions, capturing essential negative relationships that standard attention often overlooks. Furthermore, to mitigate the information loss from truncated or ignored context in standard spot image extraction, we introduce an off-grid sampling strategy that gathers additional images from intermediate regions, allowing the model to capture a richer morphological context. Experiments on public ST datasets show that FEAST surpasses state-of-the-art methods in gene expression prediction while providing biologically plausible attention maps that clarify positive and negative interactions. Our code is available at https://github.com/starforTJ/ FEAST.

关键词: Spatial Transcriptomics, gene expression prediction, attention mechanism, fully connected graph, negative-aware attention, off-grid sampling, bioinformatics, deep learning

85. ❌ FluxEDA: A Unified Execution Infrastructure for Stateful Agentic EDA

作者: Zhengrui Chen, Zixuan Song, Yu Li, Qi Sun, Cheng Zhuo 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25243v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文主要研究为EDA（电子设计自动化）领域构建一个支持状态保持的基础设施层，以更好地支持基于LLM的自主代理与EDA工具的交互。核心相关关键词包括：1) ‘LLM Agents/Autonomous Agents/Agentic Workflow’（10分）- 论文明确研究LLM代理在EDA自动化中的应用；2) ‘Tool Use/Function Calling/API Tool Use’（10分）- 论文重点解决代理与EDA工具的状态化交互问题；3) ‘Large Language Models/LLMs/Foundation Models’（8分）- 论文以LLM作为上层代理的基础；4) ‘AI for Science/Bioinformatics/Cheminformatics’（8分）- EDA属于AI for Science在电子设计领域的应用；5) ‘Chain of Thought/CoT Reasoning/Multi-step Reasoning’和’System 2 Thinking/Slow Thinking/In-depth Reasoning’（各5分）- 论文支持多步分析和优化，涉及复杂推理；6) ‘Multi-agent Systems/Agent Coordination’（5分）- 论文提到协调迭代执行，涉及代理协调。其他关键词与论文的技术实现细节（如模型架构、训练方法、优化技术等）无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了FluxEDA，一个统一的状态化基础设施层，解决了LLM代理与EDA工具交互时难以保持工具状态和支持迭代优化的问题，并通过两个商业案例验证了其支持多步分析、状态重用和协调执行的能力。

摘要翻译

大型语言模型与自主智能体在电子设计自动化领域的应用日益广泛，但现有集成方案多依赖于脚本级或请求级的交互方式，难以在真实生产导向环境中保持工具状态并支持迭代优化。本研究提出FluxEDA——一种面向智能体化电子设计自动化的统一有状态基础设施底层。FluxEDA通过基于托管网关的执行接口实现结构化请求与响应处理，同时维持持久化的后端实例运行。这些特性共同使得上层智能体与可编程客户端能够通过保留运行时状态与异构EDA工具交互，而非依赖孤立的命令行调用。我们通过两个典型商业后端案例（自动布线后时序工程变更指令与标准单元子库优化）对该框架进行评估。实验结果表明，FluxEDA能够在真实工具上下文中支持多步骤分析与优化流程，包括状态复用、回滚及协同迭代执行。这些发现表明，具备状态维护与治理能力的基础设施层可为智能体辅助的电子设计自动化提供切实可行的技术基础。

摘要 (Abstract)

Large language models and autonomous agents are increasingly explored for EDA automation, but many existing integrations still rely on script-level or request-level interactions, which makes it difficult to preserve tool state and support iterative optimization in real production-oriented environments. In this work, we present FluxEDA, a unified and stateful infrastructure substrate for agentic EDA. FluxEDA introduces a managed gateway-based execution interface with structured request and response handling. It also maintains persistent backend instances. Together, these features allow upper-layer agents and programmable clients to interact with heterogeneous EDA tools through preserved runtime state, rather than through isolated shell invocations. We evaluate the framework using two representative commercial backend case studies: automated post-route timing ECO and standard-cell sub-library optimization. The results show that FluxEDA can support multi-step analysis and optimization over real tool contexts, including state reuse, rollback, and coordinated iterative execution. These findings suggest that a stateful and governed infrastructure layer is a practical foundation for agent-assisted EDA automation.

关键词: Large Language Models, Autonomous Agents, EDA Automation, Stateful Infrastructure, Tool State Preservation, Iterative Optimization, Multi-step Analysis, Agentic Workflow

86. ❌ WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing

作者: Fanheng Kong, Jingyuan Zhang, Yang Yue, Chenxi Sun, Yang Tian, Shi Feng, Xiaocui Yang, Daling Wang, Yu Tian, Jun Du, Wenchong Zeng, Han Li, Kun Gai 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25226v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究LLM驱动的计算机使用代理（computer-use agents）在端到端自动化网页测试中的应用，与’Large Language Models’高度相关（10分），因为论文明确提到LLMs催生了编程范式转变；与’LLM Agents’和’Tool Use’高度相关（各10分），因为论文研究代理如何通过自然语言指令控制计算机进行网页测试，这属于代理工具使用范畴。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理优化、AI for Science等均未在论文中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了WebTestBench基准和WebTester框架，用于评估LLM驱动的计算机使用代理在端到端自动化网页测试中的能力，发现当前代理在测试完整性、缺陷检测和长序列交互可靠性方面存在显著差距。

摘要翻译

大型语言模型（LLM）的出现催化了编程领域的范式转变，催生了“氛围编码”（vibe coding）——用户能够通过自然语言指令构建完整项目，甚至控制计算机。这一范式推动了自动化网页开发，但也引入了一项新需求：如何自动验证网页功能是否被可靠实现。现有方法难以适应，它们依赖静态视觉相似性或预定义检查清单，在开放环境中实用性受限。此外，这些方法忽视了软件质量的一个关键方面，即潜在逻辑约束。为填补这些空白，我们提出了WebTestBench，一个用于评估端到端自动化网页测试的基准。WebTestBench涵盖了多样化网页应用类别的综合维度。我们将测试过程分解为两个级联子任务——检查清单生成与缺陷检测，并提出了针对此任务的基线框架WebTester。通过WebTester对主流大型语言模型进行评估，揭示了严峻挑战，包括测试完整性不足、检测瓶颈以及长程交互不可靠性。这些发现暴露了当前计算机使用代理能力与工业级部署需求之间的显著差距。我们希望WebTestBench能为推进端到端自动化网页测试提供有价值的见解与指导。我们的数据集与代码公开于https://github.com/friedrichor/WebTestBench。

摘要 (Abstract)

The emergence of Large Language Models (LLMs) has catalyzed a paradigm shift in programming, giving rise to “vibe coding”, where users can build complete projects and even control computers using natural language instructions. This paradigm has driven automated webpage development, but it introduces a new requirement about how to automatically verify whether the web functionalities are reliably implemented. Existing works struggle to adapt, relying on static visual similarity or predefined checklists that constrain their utility in open-ended environments. Furthermore, they overlook a vital aspect of software quality, namely latent logical constraints. To address these gaps, we introduce WebTestBench, a benchmark for evaluating end-to-end automated web testing. WebTestBench encompasses comprehensive dimensions across diverse web application categories. We decompose the testing process into two cascaded sub-tasks, checklist generation and defect detection, and propose WebTester, a baseline framework for this task. Evaluating popular LLMs with WebTester reveals severe challenges, including insufficient test completeness, detection bottlenecks, and long-horizon interaction unreliability. These findings expose a substantial gap between current computer-use agent capabilities and industrial-grade deployment demands. We hope that WebTestBench provides valuable insights and guidance for advancing end-to-end automated web testing. Our dataset and code are available at https://github.com/friedrichor/WebTestBench.

关键词: Large Language Models, LLM Agents, Computer-Use Agents, Automated Web Testing, WebTestBench, End-to-End Testing, Tool Use, Benchmark Evaluation

87. ❌ A Wireless World Model for AI-Native 6G Networks

作者: Ziqi Chen, Yi Ren, Yixuan Huang, Qi Sun, Nan Li, Yuhong Huang, Chih-Lin I, Yifan Li, Liang Xia 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25216v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文提出了一种用于6G网络的无线世界模型（WWM），这是一个多模态基础框架，核心创新在于使用多模态混合专家（MoE）Transformer架构来预测无线信道的时空演化。论文明确提到了’foundation framework’和’multi-modal mixture-of-experts Transformer’，因此与’Foundation Models’和’Mixture of Experts’高度相关（10分）。论文描述了’Pre-trained on a massive ray-traced multi-modal dataset’，因此与’Pre-training’高度相关（10分）。论文标题和摘要明确提到’World Model’，因此与’World Models’高度相关（10分）。论文属于AI在无线通信（6G网络）领域的应用，属于’AI for Science’的广义范畴（8分）。其他关键词如LLMs、SLMs、SFT、RLHF、RAG、Agents等主要针对自然语言处理或通用AI任务，与本文的无线物理层AI应用无关，因此得0分。

!!! tip deepseek-chat TL;DR

该研究解决了当前数据驱动的6G网络AI方法因缺乏对电磁波传播的内在理解而难以在动态环境中泛化的问题，通过提出一个预训练的无线世界模型（WWM），该模型使用多模态混合专家Transformer架构融合多源信息，在多个下游任务中显著优于现有方法，为物理感知的6G智能奠定了基础。

摘要翻译

将人工智能融入物理层是6G网络的基石。然而，当前数据驱动的方法因缺乏对电磁波传播的内在理解，难以在动态环境中实现泛化。我们提出了无线世界模型（Wireless World Model, WWM），这是一个多模态基础框架，通过内化三维几何与信号动态之间的因果关系，预测无线信道的时空演化。WWM基于大规模射线追踪多模态数据集进行预训练，克服了数据真实性鸿沟，并在真实世界测量数据下得到进一步验证。该模型采用联合嵌入预测架构与多模态专家混合Transformer，将信道状态信息、三维点云和用户轨迹融合为统一表征。在WWM支持的五个关键下游任务中，其在已知环境、未知泛化场景和真实世界测量中均取得了卓越性能，持续超越最先进的单模态基础模型和任务专用模型。这为适应物理世界的、具备物理感知能力的6G智能铺平了道路。

摘要 (Abstract)

Integrating AI into the physical layer is a cornerstone of 6G networks. However, current data-driven approaches struggle to generalize across dynamic environments because they lack an intrinsic understanding of electromagnetic wave propagation. We introduce the Wireless World Model (WWM), a multi-modal foundation framework predicting the spatiotemporal evolution of wireless channels by internalizing the causal relationship between 3D geometry and signal dynamics. Pre-trained on a massive ray-traced multi-modal dataset, WWM overcomes the data authenticity gap, further validated under real-world measurement data. Using a joint-embedding predictive architecture with a multi-modal mixture-of-experts Transformer, WWM fuses channel state information, 3D point clouds, and user trajectories into a unified representation. Across the five key downstream tasks supported by WWM, it achieves remarkable performance in seen environments, unseen generalization scenarios, and real-world measurements, consistently outperforming SOTA uni-modal foundation models and task-specific models. This paves the way for physics-aware 6G intelligence that adapts to the physical world.

关键词: Wireless World Model, 6G networks, multi-modal foundation framework, mixture-of-experts Transformer, pre-training, channel prediction, AI-native, physics-aware intelligence

88. ❌ Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction

作者: Jiahao Tian, Chenxi Song, Wei Cheng, Chi Zhang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25209v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频生成领域，研究如何利用预训练的视频扩散模型生成高质量长视频，通过解决帧级相对位置和上下文长度的分布外问题，提出了一种无需训练的层自适应框架。所有评分关键词均与大语言模型、深度学习技术原理或科学应用相关，而本文的核心是视频扩散模型和注意力机制优化，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

本文提出了一种无需训练的层自适应框架FreeLOC，通过视频相对位置重编码和分层稀疏注意力技术，解决了预训练视频扩散模型生成长视频时的分布外问题，显著提升了时间一致性和视觉质量。

摘要翻译

利用预训练视频扩散模型生成长视频是一项重大挑战，这类模型通常基于短视频片段训练而成。直接将这些模型应用于长视频推理往往会导致视觉质量显著下降。本文发现该问题主要源于两个分布外问题：帧级相对位置分布外问题与上下文长度分布外问题。为应对这些挑战，我们提出FreeLOC——一种无需重新训练、具备层自适应能力的新型框架，其引入两项核心技术：针对帧级相对位置分布外问题的视频相对位置重编码技术，该多粒度策略通过分层重编码时序相对位置以对齐模型预训练分布；以及针对上下文长度分布外问题的分层稀疏注意力机制，该机制通过在不同时间尺度上构建注意力密度，同时保留局部细节与长程依赖关系。关键的是，我们引入了层自适应探测机制，可识别各Transformer层对这些分布外问题的敏感度，从而实现方法的精准高效选择性应用。大量实验表明，我们的方法显著优于现有免训练方法，在时序一致性与视觉质量方面均达到最先进水平。代码发布于https://github.com/Westlake-AGI-Lab/FreeLOC。

摘要 (Abstract)

Generating long videos using pre-trained video diffusion models, which are typically trained on short clips, presents a significant challenge. Directly applying these models for long-video inference often leads to a notable degradation in visual quality. This paper identifies that this issue primarily stems from two out-of-distribution (O.O.D) problems: frame-level relative position O.O.D and context-length O.O.D. To address these challenges, we propose FreeLOC, a novel training-free, layer-adaptive framework that introduces two core techniques: Video-based Relative Position Re-encoding (VRPR) for frame-level relative position O.O.D, a multi-granularity strategy that hierarchically re-encodes temporal relative positions to align with the model’s pre-trained distribution, and Tiered Sparse Attention (TSA) for context-length O.O.D, which preserves both local detail and long-range dependencies by structuring attention density across different temporal scales. Crucially, we introduce a layer-adaptive probing mechanism that identifies the sensitivity of each transformer layer to these O.O.D issues, allowing for the selective and efficient application of our methods. Extensive experiments demonstrate that our approach significantly outperforms existing training-free methods, achieving state-of-the-art results in both temporal consistency and visual quality. Code is available at https://github.com/Westlake-AGI-Lab/FreeLOC.

关键词: long video generation, video diffusion models, out-of-distribution correction, layer-adaptive framework, tiered sparse attention, training-free method, temporal consistency, visual quality

89. ❌ The Competence Shadow: Theory and Bounds of AI Assistance in Safety Engineering

作者: Umair Siddique 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25197v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究AI助手在安全工程中的应用，属于AI应用领域，但所有关键词均聚焦于大模型/深度学习的技术原理、训练方法、优化技术或特定应用（如科学AI），而本文讨论的是通用的AI助手在安全工程中的协作框架，未涉及任何具体的大模型技术、训练方法、推理优化或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文研究了AI助手在物理AI系统安全工程中可能引入的"能力阴影"问题，建立了形式化框架并推导了性能边界，发现AI辅助效果取决于协作设计而非工具本身。

摘要翻译

随着人工智能助手日益融入物理人工智能系统的安全工程工作流程，一个关键问题随之浮现：AI辅助究竟提升了安全分析质量，还是引入了系统性盲点，而这些盲点仅在部署后的事故中才会显现？本文构建了一个用于安全分析中AI辅助的形式化框架。我们首先阐明了安全工程为何难以通过基准测试进行评估：安全能力本质上是多维度的，受限于情境依赖的正确性、固有的不完整性以及合理的专家分歧。我们通过一个五维能力框架将其形式化，该框架涵盖领域知识、标准专长、操作经验、情境理解和专业判断。
我们提出了“能力阴影”的概念：即由AI生成的安全分析所引发的人类推理系统性窄化。这种阴影并非AI所呈现的内容，而是其阻碍人们考量的部分。我们形式化了四种典型的人机协作结构，并推导出闭式性能边界，证明能力阴影会以乘性效应叠加，导致分析质量退化程度远超简单的加性估计。
核心研究发现是：安全工程中的AI辅助本质上是一个协作设计问题，而非软件采购决策。同一工具会提升还是降低分析质量，完全取决于其使用方式。我们推导出了抗阴影工作流程的非退化条件，并呼吁为实现可信的物理人工智能，应将关注点从工具认证转向工作流程认证。

摘要 (Abstract)

As AI assistants become integrated into safety engineering workflows for Physical AI systems, a critical question emerges: does AI assistance improve safety analysis quality, or introduce systematic blind spots that surface only through post-deployment incidents? This paper develops a formal framework for AI assistance in safety analysis. We first establish why safety engineering resists benchmark-driven evaluation: safety competence is irreducibly multidimensional, constrained by context-dependent correctness, inherent incompleteness, and legitimate expert disagreement. We formalize this through a five-dimensional competence framework capturing domain knowledge, standards expertise, operational experience, contextual understanding, and judgment. We introduce the competence shadow: the systematic narrowing of human reasoning induced by AI-generated safety analysis. The shadow is not what the AI presents, but what it prevents from being considered. We formalize four canonical human-AI collaboration structures and derive closed-form performance bounds, demonstrating that the competence shadow compounds multiplicatively to produce degradation far exceeding naive additive estimates. The central finding is that AI assistance in safety engineering is a collaboration design problem, not a software procurement decision. The same tool degrades or improves analysis quality depending entirely on how it is used. We derive non-degradation conditions for shadow-resistant workflows and call for a shift from tool qualification toward workflow qualification for trustworthy Physical AI.

关键词: AI assistance, safety engineering, competence shadow, human-AI collaboration, Physical AI, safety analysis, workflow qualification, performance bounds

90. ❌ A Decade-Scale Benchmark Evaluating LLMs’ Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations

作者: Andong Tan, Shuyu Dai, Jinglu Wang, Fengtao Zhou, Yan Lu, Xi Wang, Yingcong Chen, Can Yang, Shujie Liu, Hao Chen 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25196v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLMs在医疗健康领域的应用评估，与’Large Language Models’高度相关（10分），属于’AI for Science’在生物信息学/医疗领域的应用（10分）。论文未涉及其他技术原理创新关键词，如MoE、SFT、RAG、推理优化等，这些关键词评分为0。

!!! tip deepseek-chat TL;DR

该研究通过构建CPGBench基准，评估了8个主流大语言模型在多轮对话中检测和遵循临床实践指南的能力，发现模型在指南检测（71.1%-89.6%）和实际应用（21.8%-63.2%）方面存在显著差距。

摘要翻译

临床实践指南（Clinical Practice Guidelines, CPGs）在确保循证决策和改善患者预后方面发挥着关键作用。尽管大型语言模型（Large Language Models, LLMs）在医疗健康场景中的应用日益广泛，但目前尚不清楚LLMs在对话中识别并遵循CPGs的能力达到何种程度。为填补这一空白，我们提出了CPGBench——一个自动化评估框架，用于在多轮对话中基准测试LLMs的临床指南检测与遵循能力。我们收集了过去十年间来自9个国家/地区和2个国际组织发布的3,418份CPG文件，涵盖24个临床专科。从这些文件中，我们提取了32,155条临床推荐意见，并附有相应的发布机构、日期、国家、专科、推荐强度、证据等级等信息。针对每条推荐意见，我们相应生成了一段多轮对话，用以评估8个主流LLMs的检测与遵循能力。研究发现，71.1%-89.6%的推荐意见能被正确检测到，但仅有3.6%-29.7%的对应指南标题能被正确引用，这揭示了模型知晓指南内容与明确其来源之间存在差距。不同模型的指南遵循率在21.8%至63.2%之间，表明知晓指南内容与能够应用指南之间存在巨大鸿沟。为确认自动分析的有效性，我们进一步开展了涉及56位不同专科临床医生的全面人工评估。据我们所知，CPGBench是首个系统揭示LLMs在对话中未能检测或遵循哪些临床推荐意见的基准测试。鉴于每条临床推荐意见都可能影响大量人群，且临床应用本质上是安全关键领域，解决这些差距对于LLMs在现实临床实践中安全、负责任地部署至关重要。

摘要 (Abstract)

Clinical practice guidelines (CPGs) play a pivotal role in ensuring evidence-based decision-making and improving patient outcomes. While Large Language Models (LLMs) are increasingly deployed in healthcare scenarios, it is unclear to which extend LLMs could identify and adhere to CPGs during conversations. To address this gap, we introduce CPGBench, an automated framework benchmarking the clinical guideline detection and adherence capabilities of LLMs in multi-turn conversations. We collect 3,418 CPG documents from 9 countries/regions and 2 international organizations published in the last decade spanning across 24 specialties. From these documents, we extract 32,155 clinical recommendations with corresponding publication institute, date, country, specialty, recommendation strength, evidence level, etc. One multi-turn conversation is generated for each recommendation accordingly to evaluate the detection and adherence capabilities of 8 leading LLMs. We find that the 71.1%-89.6% recommendations can be correctly detected, while only 3.6%-29.7% corresponding titles can be correctly referenced, revealing the gap between knowing the guideline contents and where they come from. The adherence rates range from 21.8% to 63.2% in different models, indicating a large gap between knowing the guidelines and being able to apply them. To confirm the validity of our automatic analysis, we further conduct a comprehensive human evaluation involving 56 clinicians from different specialties. To our knowledge, CPGBench is the first benchmark systematically revealing which clinical recommendations LLMs fail to detect or adhere to during conversations. Given that each clinical recommendation may affect a large population and that clinical applications are inherently safety critical, addressing these gaps is crucial for the safe and responsible deployment of LLMs in real world clinical practice.

关键词: Large Language Models, Clinical Practice Guidelines, Healthcare, Benchmark, Multi-turn Conversations, Clinical Recommendations, Adherence Evaluation, AI for Science

91. ❌ Probing the Lack of Stable Internal Beliefs in LLMs

作者: Yifan Luo, Kangping Xu, Yanzhen Lu, Yang Yuan, Andrew Chi-Chih Yao 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25187v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文直接研究LLMs在保持内部一致性方面的局限性，属于大模型技术原理的创新研究，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术（如MoE、量化、推理加速等）、训练方法（如预训练、微调、对齐等）、应用场景（如科学AI、智能体等）或特定能力（如长上下文、工具使用等），因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该研究发现大型语言模型在模拟人类人格特质时缺乏稳定的内部信念，难以在多轮交互中保持隐式一致性，这限制了构建逼真人格驱动对话系统的能力。

摘要翻译

角色驱动的大语言模型（LLMs）需要在交互中保持行为倾向的一致性，以模拟人类的人格特质，例如坚持性或可靠性。然而，当前的大语言模型往往缺乏稳定的内部表征，难以在长程对话中锚定其回应。本研究探讨了大语言模型是否能够维持“隐性一致性”，即定义为在多轮交互中持续遵循一个未明说目标的能力。我们设计了一种“20问”风格的谜语游戏范式，要求大语言模型秘密选定一个目标，并以“是/否”回答回应用户的猜测。通过评估，我们发现大语言模型难以保持潜在的一致性：除非在上下文中明确提供其选定的目标，否则其隐性的“目标”会在多轮对话中发生偏移。这些发现揭示了构建角色驱动大语言模型的关键局限，并强调需要开发能够随时间锚定隐性目标的机制，这对于在对话系统等交互应用中实现逼真的人格建模至关重要。

摘要 (Abstract)

Persona-driven large language models (LLMs) require consistent behavioral tendencies across interactions to simulate human-like personality traits, such as persistence or reliability. However, current LLMs often lack stable internal representations that anchor their responses over extended dialogues. This work explores whether LLMs can maintain “implicit consistency”, defined as persistent adherence to an unstated goal in multi-turn interactions. We designed a 20-question-style riddle game paradigm where an LLM is tasked with secretly selecting a target and responding to users’ guesses with “yes/no” answers. Through evaluations, we find that LLMs struggle to preserve latent consistency: their implicit “goals” shift across turns unless explicitly provided their selected target in context. These findings highlight critical limitations in the building of persona-driven LLMs and underscore the need for mechanisms that anchor implicit goals over time, which is a key to realistic personality modeling in interactive applications such as dialogue systems.

关键词: Large Language Models, persona-driven, internal consistency, implicit consistency, multi-turn interactions, dialogue systems, personality modeling, behavioral tendencies

92. ❌ Train at Moving Edge: Online-Verified Prompt Selection for Efficient RL Training of Large Reasoning Model

作者: Jiahao Wu, Ning Lu, Shengcai Liu, Kun Wang, Yanting Yang, Li Qing, Ke Tang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25184v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RL在LLM后训练中的应用，特别是针对推理任务的高效训练方法。高度相关的关键词包括：LLMs（论文明确研究LLM训练）、Post-training/SFT（研究RL作为后训练方法）、RLHF/DPO（研究RL训练算法）、Chain of Thought（论文在数学推理基准上评估，涉及多步推理）。其他关键词如MoE、SLMs、Scaling Laws、PEFT等未在论文中涉及，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型在推理任务中强化学习训练计算成本高的问题，提出了一种基于历史奖励轨迹和实时熵评估的双阶段提示选择框架HIVE，显著提高了训练效率而不影响性能。

摘要翻译

强化学习已成为大型语言模型在推理任务中进行后训练的关键技术。虽然增加推演规模可以稳定训练并提升性能，但计算开销是一个关键问题。在GRPO等算法中，每个提示词进行多次推演会产生极高的成本，因为大量提示词提供的梯度可忽略不计，因而效用较低。为解决这一问题，我们研究了如何在推演阶段前筛选高效用提示词。实验分析表明，样本效用呈现非均匀且动态变化的特性：最强的学习信号集中在“学习前沿”——即中等难度与高不确定性的交汇区域，该区域会随训练进程发生迁移。基于此发现，我们提出了HIVE（基于历史信息与在线验证的提示词筛选框架），这是一个面向数据高效强化学习的双阶段框架。HIVE利用历史奖励轨迹进行粗筛选，并采用提示词熵作为实时代理指标来剔除效用衰减的实例。通过在多个数学推理基准和模型上评估HIVE，我们证明该框架能在不影响性能的前提下显著提升推演效率。

摘要 (Abstract)

Reinforcement learning (RL) has become essential for post-training large language models (LLMs) in reasoning tasks. While scaling rollouts can stabilize training and enhance performance, the computational overhead is a critical issue. In algorithms like GRPO, multiple rollouts per prompt incur prohibitive costs, as a large portion of prompts provide negligible gradients and are thus of low utility. To address this problem, we investigate how to select high-utility prompts before the rollout phase. Our experimental analysis reveals that sample utility is non-uniform and evolving: the strongest learning signals concentrate at the ``learning edge”, the intersection of intermediate difficulty and high uncertainty, which shifts as training proceeds. Motivated by this, we propose HIVE (History-Informed and online-VErified prompt selection), a dual-stage framework for data-efficient RL. HIVE utilizes historical reward trajectories for coarse selection and employs prompt entropy as a real-time proxy to prune instances with stale utility. By evaluating HIVE across multiple math reasoning benchmarks and models, we show that HIVE yields significant rollout efficiency without compromising performance.

关键词: Reinforcement Learning, Large Language Models, Post-training, Reasoning Tasks, Prompt Selection, Training Efficiency, Math Reasoning, HIVE Framework

93. ❌ Knowledge-Guided Adversarial Training for Infrared Object Detection via Thermal Radiation Modeling

作者: Shiji Zhao, Shukun Xiong, Maoxun Yuan, Yao Huang, Ranjie Duan, Qing Guo, Jiansheng Chen, Haibin Duan, Xingxing Wei 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25170v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于红外目标检测的对抗训练方法，通过热辐射建模嵌入物理知识来提升模型鲁棒性。所有评分关键词均涉及大语言模型（LLMs）及相关技术（如MoE、RLHF、RAG、量化等），或AI for Science中的生物信息学/化学信息学。论文主题是计算机视觉中的红外图像处理与对抗防御，未涉及任何大语言模型技术、大模型应用或AI for Science的特定子领域（如生物/化学信息学）。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种知识引导的对抗训练方法（KGAT），通过热辐射建模将红外物理知识嵌入对抗训练过程，有效提升了红外目标检测模型在干净数据和对抗攻击/常见损坏下的准确性与鲁棒性。

摘要翻译

在复杂环境中，红外目标检测展现出广泛的适用性和跨场景稳定性。然而，红外目标检测易受常见干扰和对抗样本的影响，存在潜在安全风险。为提升红外目标检测的鲁棒性，现有方法多采用数据驱动的思路，仅驱使网络表层拟合训练数据，未充分考虑红外图像的特性，导致鲁棒性提升有限。本文重新审视红外物理知识，发现不同类别间的相对热辐射关系可视为对抗样本与常见干扰复杂场景下的可靠知识来源。因此，我们基于不同类别灰度值的排序关系，从理论上建模了热辐射关系，并进一步量化了各类别间热辐射关系的稳定性。基于上述理论框架，我们提出面向红外目标检测的知识引导对抗训练方法，将红外物理知识嵌入对抗训练过程，使预测结果与实际物理规律保持一致。在三个红外数据集和六种主流红外目标检测模型上的大量实验表明，该方法能有效提升模型在干净样本上的精度，并显著增强其对对抗攻击和常见干扰的鲁棒性。

摘要 (Abstract)

In complex environments, infrared object detection exhibits broad applicability and stability across diverse scenarios. However, infrared object detection is vulnerable to both common corruptions and adversarial examples, leading to potential security risks. To improve the robustness of infrared object detection, current methods mostly adopt a data-driven ideology, which only superficially drives the network to fit the training data without specifically considering the unique characteristics of infrared images, resulting in limited robustness. In this paper, we revisit infrared physical knowledge and find that relative thermal radiation relations between different classes can be regarded as a reliable knowledge source under the complex scenarios of adversarial examples and common corruptions. Thus, we theoretically model thermal radiation relations based on the rank order of gray values for different classes, and further quantify the stability of various inter-class thermal radiation relations. Based on the above theoretical framework, we propose Knowledge-Guided Adversarial Training (KGAT) for infrared object detection, in which infrared physical knowledge is embedded into the adversarial training process, and the predicted results are optimized to be consistent with the actual physical laws. Extensive experiments on three infrared datasets and six mainstream infrared object detection models demonstrate that KGAT effectively enhances both clean accuracy and robustness against adversarial attacks and common corruptions.

关键词: infrared object detection, adversarial training, thermal radiation modeling, robustness, knowledge-guided, adversarial examples, common corruptions, physical knowledge

94. ❌ PIDP-Attack: Combining Prompt Injection with Database Poisoning Attacks on Retrieval-Augmented Generation Systems

作者: Haozhen Wang, Haoyue Liu, Jionghao Zhu, Zhichao Wang, Yongxin Guo, Xiaoying Tang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25164v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RAG系统的安全漏洞，提出结合提示注入和数据库投毒的复合攻击方法。与"Large Language Models"和"Retrieval-Augmented Generation"高度相关（10分），因为论文聚焦LLM+RAG系统。与"Hallucination Mitigation"有一定关联（5分），因为RAG旨在缓解幻觉问题，但论文主要研究攻击而非缓解方法。其他关键词如MoE、SLMs、训练方法、推理优化、AI for Science等均未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出PIDP-Attack，一种结合提示注入和数据库投毒的复合攻击方法，能有效操纵RAG系统中的LLM响应，在三个基准数据集和八个LLM上验证了其攻击成功率比原有方法提升4%-16%。

摘要翻译

大语言模型（LLMs）已在广泛的应用中展现出卓越的性能。然而，其实际部署常受限于知识过时及产生幻觉倾向等问题。为应对这些局限，检索增强生成（Retrieval-Augmented Generation，RAG）系统被提出，通过引入外部、最新的知识源来增强大语言模型。尽管具备优势，RAG系统仍易受对抗性攻击，其中数据投毒已成为一项突出的威胁。现有的基于投毒的攻击通常需要预先知晓用户的具体查询，这限制了其灵活性和实际适用性。在本研究中，我们提出PIDP-Attack，一种将提示注入与数据库投毒相结合的新型复合攻击方法。通过在推理阶段向查询附加恶意字符，并向检索数据库注入有限数量的投毒文本段落，我们的方法能够在无需预先获知用户实际查询的情况下，有效操控大语言模型对任意查询的响应。在三个基准数据集（Natural Questions、HotpotQA、MS-MARCO）和八种大语言模型上的实验评估表明，PIDP-Attack持续优于原始的PoisonedRAG。具体而言，该方法在开放域问答任务中将攻击成功率提升了4%至16%，同时保持了较高的检索精度，证明了复合攻击策略的必要性和高效性。

摘要 (Abstract)

Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of applications. However, their practical deployment is often hindered by issues such as outdated knowledge and the tendency to generate hallucinations. To address these limitations, Retrieval-Augmented Generation (RAG) systems have been introduced, enhancing LLMs with external, up-to-date knowledge sources. Despite their advantages, RAG systems remain vulnerable to adversarial attacks, with data poisoning emerging as a prominent threat. Existing poisoning-based attacks typically require prior knowledge of the user’s specific queries, limiting their flexibility and real-world applicability. In this work, we propose PIDP-Attack, a novel compound attack that integrates prompt injection with database poisoning in RAG. By appending malicious characters to queries at inference time and injecting a limited number of poisoned passages into the retrieval database, our method can effectively manipulate LLM response to arbitrary query without prior knowledge of the user’s actual query. Experimental evaluations across three benchmark datasets (Natural Questions, HotpotQA, MS-MARCO) and eight LLMs demonstrate that PIDP-Attack consistently outperforms the original PoisonedRAG. Specifically, our method improves attack success rates by 4% to 16% on open-domain QA tasks while maintaining high retrieval precision, proving that the compound attack strategy is both necessary and highly effective.

关键词: Retrieval-Augmented Generation, RAG, Large Language Models, LLMs, Prompt Injection, Database Poisoning, Adversarial Attacks, Attack Success Rate

95. ❌ Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

作者: Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Xiaoxi Jiang, Guanjun Jiang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25158v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM agent技能蒸馏框架，与’LLM Agents’和’Multi-agent Systems’高度相关（10分），涉及agent协调和子代理分析。与’Large Language Models’高度相关（10分），论文明确使用LLM agent并测试不同规模模型。与’Chain of Thought’和’System 2 Thinking’有一定关联（5分），涉及推理和层次整合。与’Self-Improvement’和’Tool Use’有一定关联（5分），涉及技能改进和工具使用（如电子表格）。其他关键词如MoE、量化、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文提出Trace2Skill框架，通过并行子代理分析执行轨迹并层次整合经验，将复杂agent经验蒸馏为可迁移的声明式技能，在电子表格、视觉问答和数学推理等任务上显著提升性能，且技能可跨LLM规模迁移。

摘要翻译

为大型语言模型（LLM）智能体配备领域特定技能对于处理复杂任务至关重要。然而，手动编写技能造成了严重的可扩展性瓶颈。相反，自动化技能生成往往产生脆弱或碎片化的结果，因为它要么依赖于浅层的参数化知识，要么顺序性地过度拟合于不可泛化的轨迹局部经验。为克服这一问题，我们提出了Trace2Skill框架，该框架模拟人类专家编写技能的方式：在将广泛执行经验提炼为单一、全面的指南之前，对其进行整体分析。Trace2Skill并非顺序地对单个轨迹做出反应，而是调度一组并行子智能体来分析多样化的执行轨迹池。它提取特定轨迹的经验教训，并通过归纳推理将其分层整合为一个统一、无冲突的技能目录。Trace2Skill既支持深化现有的人工编写技能，也支持从头创建新技能。在电子表格、视觉问答（VisionQA）和数学推理等挑战性领域的实验表明，Trace2Skill显著超越了包括Anthropic官方xlsx技能在内的强基线方法。关键的是，这种基于轨迹的演化并非仅仅记忆任务实例或模型特定特性：演化后的技能能够跨LLM规模迁移，并泛化至分布外（OOD）场景。例如，由Qwen3.5-35B基于自身轨迹演化的技能，将Qwen3.5-122B智能体在WikiTableQuestions上的性能提升了高达57.65个绝对百分点。最终，我们的结果表明，复杂的智能体经验能够被封装成高度可迁移的声明性技能——无需参数更新、无需外部检索模块，且可使用小至350亿参数的开源模型实现。

摘要 (Abstract)

Equipping Large Language Model (LLM) agents with domain-specific skills is critical for tackling complex tasks. Yet, manual authoring creates a severe scalability bottleneck. Conversely, automated skill generation often yields fragile or fragmented results because it either relies on shallow parametric knowledge or sequentially overfits to non-generalizable trajectory-local lessons. To overcome this, we introduce Trace2Skill, a framework that mirrors how human experts author skills: by holistically analyzing broad execution experience before distilling it into a single, comprehensive guide. Instead of reacting sequentially to individual trajectories, Trace2Skill dispatches a parallel fleet of sub-agents to analyze a diverse pool of executions. It extracts trajectory-specific lessons and hierarchically consolidates them into a unified, conflict-free skill directory via inductive reasoning. Trace2Skill supports both deepening existing human-written skills and creating new ones from scratch. Experiments in challenging domains, such as spreadsheet, VisionQA and math reasoning, show that Trace2Skill significantly improves upon strong baselines, including Anthropic’s official xlsx skills. Crucially, this trajectory-grounded evolution does not merely memorize task instances or model-specific quirks: evolved skills transfer across LLM scales and generalize to OOD settings. For example, skills evolved by Qwen3.5-35B on its own trajectories improved a Qwen3.5-122B agent by up to 57.65 absolute percentage points on WikiTableQuestions. Ultimately, our results demonstrate that complex agent experience can be packaged into highly transferable, declarative skills – requiring no parameter updates, no external retrieval modules, and utilizing open-source models as small as 35B parameters.

关键词: LLM agents, skill distillation, transferable skills, multi-agent systems, trajectory analysis, inductive reasoning, agent coordination, declarative skills

96. ❌ Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models

作者: Chengyu Fang, Heng Guo, Zheng Jiang, Chunming He, Xiu Li, Minfeng Xu 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25155v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究多模态大语言模型在3D医学影像理解中的应用，与’Large Language Models’高度相关（10分），属于大模型在科学领域的应用。论文提出Photon框架通过自适应token调度和梯度传播来降低计算成本、加速训练和推理，与’Speculative Decoding OR Inference Acceleration’高度相关（10分）。论文专注于医学视觉问答任务，属于’AI for Science’范畴（10分）。其他关键词如MoE、SFT、RAG、量化等未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

论文针对3D医学影像理解中多模态大语言模型计算成本高的问题，提出了Photon框架，通过自适应token调度和梯度传播技术，在降低资源使用的同时实现了最先进的准确率。

摘要翻译

多模态大语言模型在临床视觉问答任务中展现出潜力，但扩展到三维影像领域时受到高计算成本的制约。现有方法通常依赖二维切片或固定长度的令牌压缩，这会破坏体积连续性并掩盖细微的影像发现。本文提出Photon框架，该框架采用可变长度的令牌序列来表示三维医学影像体积。Photon引入了指令条件化令牌调度与替代梯度传播机制，在训练和推理过程中自适应地减少令牌数量，从而在降低计算成本的同时缓解冗余令牌导致的注意力稀释问题。该框架通过结合梯度恢复的自定义反向传播规则，实现了在离散令牌丢弃情况下的可微分优化。为稳定令牌压缩过程并确保视觉证据的可靠使用，Photon进一步采用正则化目标来缓解纯语言偏差并提升系统可靠性。在多样化的医学视觉问答任务上的实验表明，Photon在降低资源消耗、加速训练与推理的同时，达到了最先进的准确率水平。

摘要 (Abstract)

Multimodal large language models are promising for clinical visual question answering tasks, but scaling to 3D imaging is hindered by high computational costs. Prior methods often rely on 2D slices or fixed-length token compression, disrupting volumetric continuity and obscuring subtle findings. We present Photon, a framework that represents 3D medical volumes with token sequences of variable length. Photon introduces instruction-conditioned token scheduling and surrogate gradient propagation to adaptively reduce tokens during both training and inference, which lowers computational cost while mitigating the attention dilution caused by redundant tokens. It incorporates a custom backpropagation rule with gradient restoration to enable differentiable optimization despite discrete token drop. To stabilize token compression and ensure reliable use of visual evidence, Photon further applies regularization objectives that mitigate language-only bias and improve reliability. Experiments on diverse medical visual question answering tasks show that Photon achieves state-of-the-art accuracy while reducing resource usage and accelerating both training and inference.

关键词: Multimodal Large Language Models, 3D Medical Volumes, Token Compression, Instruction-conditioned Token Scheduling, Surrogate Gradient Propagation, Medical Visual Question Answering, Computational Efficiency, Training and Inference Acceleration

97. ❌ Vision Hopfield Memory Networks

作者: Jianfeng Wang, Amine M’Charrak, Luk Koska, Xiangtao Wang, Daniel Petriceanu, Mykyta Smyrnov, Ruizhi Wang, Michael Bumbar, Luca Pinchetti, Thomas Lukasiewicz 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25157v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种受大脑启发的视觉基础模型V-HMN，主要与’Large Language Models OR LLMs OR Foundation Models’相关（8分），因为它是作为视觉基础模型设计的，并可能扩展到多模态。与’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为论文明确强调通过记忆检索提高可解释性。其他关键词如MoE、SLMs、训练方法、推理技术、代理系统、科学AI应用等均未在摘要中提及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有视觉基础模型（如Transformer、Mamba）计算原理与大脑不符、数据效率低和可解释性差的问题，提出了一种受大脑启发的视觉Hopfield记忆网络（V-HMN），通过分层记忆机制和迭代细化更新，在计算机视觉基准测试中取得了有竞争力的结果，同时提高了可解释性、数据效率和生物合理性。

摘要翻译

近年来，视觉与多模态基础主干网络——例如Transformer系列模型以及Mamba等状态空间模型——取得了显著进展，实现了对图像、文本及其他模态的统一建模。尽管这些架构在实证中取得了成功，但其计算原理仍与人类大脑相去甚远，通常需要海量训练数据，且可解释性有限。本研究提出视觉霍普菲尔德记忆网络（Vision Hopfield Memory Network, V-HMN），这是一种受大脑启发的基干网络，它将分层记忆机制与迭代优化更新相结合。具体而言，V-HMN包含局部霍普菲尔德模块（在图像块级别提供联想记忆动态）、全局霍普菲尔德模块（作为情景记忆进行上下文调制），以及受预测编码启发的迭代误差校正优化规则。通过将这些基于记忆的模块分层组织，V-HMN在统一框架中同时捕捉局部与全局动态。记忆检索过程揭示了输入与存储模式之间的关系，使决策更具可解释性，而对存储模式的复用则提升了数据效率。因此，这种受大脑启发的设计在可解释性与数据效率方面超越了现有的基于自注意力或状态空间的方法。我们在公开计算机视觉基准上进行了大量实验，V-HMN在与广泛采用的主干架构对比中取得了具有竞争力的结果，同时展现出更好的可解释性、更高的数据效率以及更强的生物合理性。这些发现凸显了V-HMN作为下一代视觉基础模型的潜力，同时也为文本、音频等多模态领域的主干网络提供了可推广的蓝图，从而在受大脑启发的计算与大规模机器学习之间架起了桥梁。

摘要 (Abstract)

Recent vision and multimodal foundation backbones, such as Transformer families and state-space models like Mamba, have achieved remarkable progress, enabling unified modeling across images, text, and beyond. Despite their empirical success, these architectures remain far from the computational principles of the human brain, often demanding enormous amounts of training data while offering limited interpretability. In this work, we propose the Vision Hopfield Memory Network (V-HMN), a brain-inspired foundation backbone that integrates hierarchical memory mechanisms with iterative refinement updates. Specifically, V-HMN incorporates local Hopfield modules that provide associative memory dynamics at the image patch level, global Hopfield modules that function as episodic memory for contextual modulation, and a predictive-coding-inspired refinement rule for iterative error correction. By organizing these memory-based modules hierarchically, V-HMN captures both local and global dynamics in a unified framework. Memory retrieval exposes the relationship between inputs and stored patterns, making decisions more interpretable, while the reuse of stored patterns improves data efficiency. This brain-inspired design therefore enhances interpretability and data efficiency beyond existing self-attention- or state-space-based approaches. We conducted extensive experiments on public computer vision benchmarks, and V-HMN achieved competitive results against widely adopted backbone architectures, while offering better interpretability, higher data efficiency, and stronger biological plausibility. These findings highlight the potential of V-HMN to serve as a next-generation vision foundation model, while also providing a generalizable blueprint for multimodal backbones in domains such as text and audio, thereby bridging brain-inspired computation with large-scale machine learning.

关键词: Vision Hopfield Memory Network, brain-inspired foundation backbone, hierarchical memory mechanisms, interpretability, data efficiency, vision foundation model, associative memory, predictive-coding

98. ❌ UniAI-GraphRAG: Synergizing Ontology-Guided Extraction, Multi-Dimensional Clustering, and Dual-Channel Fusion for Robust Multi-Hop Reasoning

作者: Jie Wang, Honghua Huang, Xi Ge, Jianhui Su, Wen Liu, Shiguo Lian 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25152v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	15.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文UniAI-GraphRAG专注于改进RAG系统，特别是GraphRAG框架，以解决复杂推理、多跳查询和领域特定QA的挑战。因此，与"Retrieval-Augmented Generation OR RAG OR Retrieval-Generation"高度相关（15分），因为这是论文的核心主题。论文使用LLMs进行知识提取，与"Large Language Models OR LLMs OR Foundation Models"相关（10分）。论文涉及多跳推理，与"Chain of Thought OR CoT Reasoning OR Multi-step Reasoning"和"System 2 Thinking OR Slow Thinking OR In-depth Reasoning"有一定关联（各5分），但并非核心焦点。其他关键词如MoE、SLMs、训练技术、对齐、代理、压缩等，论文未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文提出了UniAI-GraphRAG框架，通过本体引导知识提取、多维社区聚类和双通道图检索融合，显著提升了RAG系统在复杂推理和多跳查询中的性能，在MultiHopRAG基准测试中优于主流开源解决方案。

摘要翻译

检索增强生成（RAG）系统在复杂推理、多跳查询和领域特定问答中面临显著挑战。尽管现有的GraphRAG框架在结构化知识组织方面取得了进展，但其在跨行业适应性、社区报告完整性以及检索性能方面仍存在局限。本文提出UniAI-GraphRAG，一个基于开源GraphRAG的增强框架。该框架引入了三项核心创新：（1）本体引导的知识提取，利用预定义模式（Schema）指导大语言模型（LLM）准确识别领域特定实体与关系；（2）多维社区聚类策略，通过对齐补全、基于属性的聚类和多跳关系聚类提升社区完整性；（3）双通道图检索融合，通过混合图检索与社区检索平衡问答准确性与性能。在MultiHopRAG基准测试上的评估结果表明，UniAI-GraphRAG在综合F1分数上优于主流开源解决方案（如LightRAG），尤其在推理与时间相关查询方面表现突出。代码发布于https://github.com/UnicomAI/wanwu/tree/main/rag/rag_open_source/rag_core/graph。

摘要 (Abstract)

Retrieval-Augmented Generation (RAG) systems face significant challenges in complex reasoning, multi-hop queries, and domain-specific QA. While existing GraphRAG frameworks have made progress in structural knowledge organization, they still have limitations in cross-industry adaptability, community report integrity, and retrieval performance. This paper proposes UniAI-GraphRAG, an enhanced framework built upon open-source GraphRAG. The framework introduces three core innovations: (1) Ontology-Guided Knowledge Extraction that uses predefined Schema to guide LLMs in accurately identifying domain-specific entities and relations; (2) Multi-Dimensional Community Clustering Strategy that improves community completeness through alignment completion, attribute-based clustering, and multi-hop relationship clustering; (3) Dual-Channel Graph Retrieval Fusion that balances QA accuracy and performance through hybrid graph and community retrieval. Evaluation results on MultiHopRAG benchmark show that UniAI-GraphRAG outperforms mainstream open source solutions (e.g.LightRAG) in comprehensive F1 scores, particularly in inference and temporal queries. The code is available at https://github.com/UnicomAI/wanwu/tree/main/rag/rag_open_source/rag_core/graph.

关键词: Retrieval-Augmented Generation, GraphRAG, multi-hop reasoning, ontology-guided extraction, community clustering, graph retrieval, domain-specific QA, benchmark evaluation

99. ❌ Goodness-of-pronunciation without phoneme time alignment

作者: Jeremy H. M. Wong, Nancy F. Chen 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25150v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于语音评估领域，特别是针对低资源语言的自动语音识别（ASR）特征提取方法改进。论文的核心贡献是提出了一种无需音素时间对齐的特征提取方法，结合了音素后验概率、单词级语速和时长特征，以及跨注意力架构。所有评分关键词均与大模型、深度学习技术原理创新或AI在科学领域的应用相关，而本文的研究内容（语音评估、ASR、低资源语言处理）与这些关键词的主题（如LLMs、MoE、Scaling Laws、Alignment、RAG、Agents等）无直接关联。论文未涉及大模型技术、深度学习创新或AI在生物信息学等科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需音素时间对齐的语音评估特征提取方法，通过结合音素后验概率、单词级特征和跨注意力架构，在英语和低资源泰米尔语数据集上取得了与标准帧同步特征相当的性能。

摘要翻译

在语音评测中，自动语音识别（ASR）模型通常为输入特征计算时间边界和音素后验概率。然而，用于ASR训练的数据有限，阻碍了语音评测向低资源语言的扩展。开源的弱监督模型能够对多种语言进行ASR，但它们是帧异步且非音素化的，这妨碍了语音评测中的特征提取。本文提出克服弱监督模型在特征提取方面的不兼容性，以促进语音评测向低资源语言的扩展。通过将ASR假设映射到音素混淆网络来计算音素后验概率。采用词级而非音素级的语速和时长特征。通过交叉注意力架构结合音素与帧级特征，从而避免音素时间对齐。该方法在英语speechocean762数据集和低资源泰米尔语数据集上取得了与标准帧同步特征相当的性能。

摘要 (Abstract)

In speech evaluation, an Automatic Speech Recognition (ASR) model often computes time boundaries and phoneme posteriors for input features. However, limited data for ASR training hinders expansion of speech evaluation to low-resource languages. Open-source weakly-supervised models are capable of ASR over many languages, but they are frame-asynchronous and not phonemic, hindering feature extraction for speech evaluation. This paper proposes to overcome incompatibilities for feature extraction with weakly-supervised models, easing expansion of speech evaluation to low-resource languages. Phoneme posteriors are computed by mapping ASR hypotheses to a phoneme confusion network. Word instead of phoneme-level speaking rate and duration are used. Phoneme and frame-level features are combined using a cross-attention architecture, obviating phoneme time alignment. This performs comparably with standard frame-synchronous features on English speechocean762 and low-resource Tamil datasets.

关键词: speech evaluation, automatic speech recognition, low-resource languages, phoneme posteriors, cross-attention architecture, feature extraction, weakly-supervised models, time alignment

100. ❌ FD$^2$: A Dedicated Framework for Fine-Grained Dataset Distillation

作者: Hongxu Ma, Guang Li, Shijie Wang, Dongzhan Zhou, Baoli Sun, Takahiro Ogawa, Miki Haseyama, Zhihui Wang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25144v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于数据集蒸馏（Dataset Distillation）技术，特别是针对细粒度数据集的专用框架FD²。虽然属于深度学习领域，但研究内容与所有评分关键词（均围绕大模型技术、训练方法、推理优化、对齐、代理系统等）无直接关联。论文未涉及大模型、语言模型、MoE、缩放定律、预训练/后训练、对齐技术、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理方法、代理系统、量化、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI应用。因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了FD²框架，解决了现有解耦数据集蒸馏方法在细粒度数据集上因类内样本过度相似和类间差异细微而导致的识别性能下降问题，通过定位判别区域和构建细粒度表示，在多个数据集上提升了性能。

摘要翻译

数据集蒸馏（Dataset Distillation，DD）通过将大规模训练集压缩为小型合成集，降低了存储与训练成本，并在通用基准测试中展现出显著效果。解耦式数据集蒸馏进一步提升了效率，它将流程分解为预训练、样本蒸馏和软标签生成三个阶段。然而，现有的解耦方法主要依赖粗略的类别标签监督，并以近乎相同的方式优化每个类别内的样本。在细粒度数据集上，这往往导致蒸馏出的样本存在以下问题：（i）保留较大的类内差异而类间差异细微；（ii）同类样本过度相似，从而限制了局部判别性特征的提取，损害了识别性能。为解决上述问题，我们提出FD$^{2}$——一个专为细粒度数据集蒸馏设计的框架。FD$^{2}$能够定位判别性区域并构建用于蒸馏的细粒度表征。在预训练阶段，反事实注意力学习聚合判别性表征以更新类别原型；在蒸馏阶段，细粒度特征约束使每个样本与其类别原型对齐并排斥其他原型，同时相似性约束促使同类样本间的注意力分布多样化。在多个细粒度及通用数据集上的实验表明，FD$^{2}$能够与解耦式数据集蒸馏无缝结合，并在多数设定下提升性能，显示出较强的可迁移性。

摘要 (Abstract)

Dataset distillation (DD) compresses a large training set into a small synthetic set, reducing storage and training cost, and has shown strong results on general benchmarks. Decoupled DD further improves efficiency by splitting the pipeline into pretraining, sample distillation, and soft-label generation. However, existing decoupled methods largely rely on coarse class-label supervision and optimize samples within each class in a nearly identical manner. On fine-grained datasets, this often yields distilled samples that (i) retain large intra-class variation with subtle inter-class differences and (ii) become overly similar within the same class, limiting localized discriminative cues and hurting recognition. To solve the above-mentioned problems, we propose FD$^{2}$, a dedicated framework for Fine-grained Dataset Distillation. FD$^{2}$ localizes discriminative regions and constructs fine-grained representations for distillation. During pretraining, counterfactual attention learning aggregates discriminative representations to update class prototypes. During distillation, a fine-grained characteristic constraint aligns each sample with its class prototype while repelling others, and a similarity constraint diversifies attention across same-class samples. Experiments on multiple fine-grained and general datasets show that FD$^{2}$ integrates seamlessly with decoupled DD and improves performance in most settings, indicating strong transferability.

关键词: Dataset Distillation, Fine-grained Datasets, Decoupled Pipeline, Discriminative Regions, Class Prototypes, Attention Learning, FD² Framework, Recognition Performance

101. ❌ SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment

作者: Sahibzada Adil Shahzad, Ammarah Hashmi, Junichi Yamagishi, Yusuke Yasuda, Yu Tsao, Chia-Wen Lin, Yan-Tsung Peng, Hsin-Min Wang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25140v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是音频-视觉深度伪造检测，属于计算机视觉和多媒体安全领域，而非大模型或深度学习技术原理的创新。所有关键词均与大模型技术、训练方法、推理优化、对齐、代理系统等直接相关，而本文专注于自监督学习、多模态融合和伪造检测，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种自监督音频-视觉深度伪造检测框架SAVe，通过生成伪篡改和建模唇语同步来检测视觉伪影和跨模态不一致，实现了竞争性的域内性能和强大的跨数据集泛化能力。

摘要翻译

多模态深度伪造视频可能呈现细微的视觉伪影与跨模态不一致性，其检测仍具挑战性，尤其当检测器主要基于精选的合成伪造数据训练时。这种对合成数据的依赖可能引入数据集与生成器偏差，从而限制模型对未知篡改的扩展性与鲁棒性。我们提出SAVe，一种完全基于真实视频进行学习的自监督视听深度伪造检测框架。SAVe通过生成即时、保持身份、区域感知的自混合伪篡改来模拟篡改伪影，使模型能够学习跨多个人脸粒度区域的互补视觉线索。为捕捉跨模态证据，SAVe还通过一个视听对齐组件对唇语-语音同步进行建模，以检测视听伪造特有的时序错位模式。在FakeAVCeleb和AV-LipSync-TIMIT数据集上的实验表明，该框架在领域内检测性能具有竞争力，并展现出强大的跨数据集泛化能力，凸显了自监督学习作为多模态深度伪造检测的可扩展范式潜力。

摘要 (Abstract)

Multimodal deepfakes can exhibit subtle visual artifacts and cross-modal inconsistencies, which remain challenging to detect, especially when detectors are trained primarily on curated synthetic forgeries. Such synthetic dependence can introduce dataset and generator bias, limiting scalability and robustness to unseen manipulations. We propose SAVe, a self-supervised audio-visual deepfake detection framework that learns entirely on authentic videos. SAVe generates on-the-fly, identity-preserving, region-aware self-blended pseudo-manipulations to emulate tampering artifacts, enabling the model to learn complementary visual cues across multiple facial granularities. To capture cross-modal evidence, SAVe also models lip-speech synchronization via an audio-visual alignment component that detects temporal misalignment patterns characteristic of audio-visual forgeries. Experiments on FakeAVCeleb and AV-LipSync-TIMIT demonstrate competitive in-domain performance and strong cross-dataset generalization, highlighting self-supervised learning as a scalable paradigm for multimodal deepfake detection.

关键词: audio-visual deepfake detection, self-supervised learning, visual artifacts, audio-visual misalignment, lip-speech synchronization, multimodal deepfakes, cross-modal inconsistencies, self-blended pseudo-manipulations

102. ❌ Reinforcement learning for quantum processes with memory

作者: Josep Lumbreras, Ruo Cheng Huang, Yanglin Hu, Marco Fanizza, Mile Gu 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25138v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究量子系统中的强化学习问题，属于AI for Science范畴，因此与’AI for Science OR Bioinformatics OR Cheminformatics’关键词有一定关联（5分）。论文未涉及大模型、深度学习技术原理或任何其他关键词所描述的具体技术（如MoE、SFT、RAG、量化等），因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文研究量子系统中具有记忆的强化学习问题，提出了一种乐观最大似然估计算法，证明了累积遗憾的次线性缩放最优性，并将其应用于状态无关的工作提取问题，实现了渐近零耗散率。

摘要翻译

在强化学习中，智能体通过与环境顺序交互以最大化奖励，且仅能获得部分概率性反馈。这产生了根本性的探索-利用权衡：智能体必须通过探索来学习隐藏的动态规律，同时利用已掌握的知识来最大化其目标收益。尽管该框架在经典领域已得到广泛研究，但将其应用于量子系统时，需处理通过未知动力学演化的隐藏量子态。我们通过一个框架将此问题形式化：环境维护一个通过未知量子信道演化的隐藏量子记忆，而智能体使用量子仪器顺序进行干预。针对此设定，我们改进了一种乐观最大似然估计算法。我们将分析扩展至连续动作空间，从而能够对一般的正算子值测度进行建模。通过控制估计误差在量子信道和仪器中的传播，我们证明了所提出策略的累积遗憾在K轮次中按$\widetilde{\mathcal{O}}(\sqrt{K})$的速率增长。此外，通过将其约化为多臂量子赌博机问题，我们建立了信息论下界，证明该亚线性增长速率在忽略多对数因子的意义下是严格最优的。作为物理应用，我们考虑态不可知功提取问题。当从一系列由隐藏记忆关联的非独立同分布量子态中提取自由能时，对信源知识的任何缺失都会导致热力学耗散。在我们的框架中，数学上的累积遗憾精确量化了这种累积耗散。利用我们的自适应算法，智能体能够根据过往能量输出结果实时改进其提取协议，实现亚线性累积耗散，从而达成渐近零耗散率。

摘要 (Abstract)

In reinforcement learning, an agent interacts sequentially with an environment to maximize a reward, receiving only partial, probabilistic feedback. This creates a fundamental exploration-exploitation trade-off: the agent must explore to learn the hidden dynamics while exploiting this knowledge to maximize its target objective. While extensively studied classically, applying this framework to quantum systems requires dealing with hidden quantum states that evolve via unknown dynamics. We formalize this problem via a framework where the environment maintains a hidden quantum memory evolving via unknown quantum channels, and the agent intervenes sequentially using quantum instruments. For this setting, we adapt an optimistic maximum-likelihood estimation algorithm. We extend the analysis to continuous action spaces, allowing us to model general positive operator-valued measures (POVMs). By controlling the propagation of estimation errors through quantum channels and instruments, we prove that the cumulative regret of our strategy scales as $\widetilde{\mathcal{O}}(\sqrt{K})$ over $K$ episodes. Furthermore, via a reduction to the multi-armed quantum bandit problem, we establish information-theoretic lower bounds demonstrating that this sublinear scaling is strictly optimal up to polylogarithmic factors. As a physical application, we consider state-agnostic work extraction. When extracting free energy from a sequence of non-i.i.d. quantum states correlated by a hidden memory, any lack of knowledge about the source leads to thermodynamic dissipation. In our setting, the mathematical regret exactly quantifies this cumulative dissipation. Using our adaptive algorithm, the agent uses past energy outcomes to improve its extraction protocol on the fly, achieving sublinear cumulative dissipation, and, consequently, an asymptotically zero dissipation rate.

关键词: reinforcement learning, quantum systems, quantum memory, regret analysis, work extraction, thermodynamic dissipation, adaptive algorithm, quantum channels

103. ❌ RubricEval: A Rubric-Level Meta-Evaluation Benchmark for LLM Judges in Instruction Following

作者: Tianjun Pan, Xuan Lin, Wenyan Yang, Qianyu He, Shisong Chen, Licai Qi, Wanqing Xu, Hongwei Feng, Bo Xu, Yanghua Xiao 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25133v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在指令遵循评估中的元评估问题，直接涉及’Large Language Models’和’Instruction Tuning’，因此给10分。论文提到’rubric-level evaluation’和’explicit reasoning improves accuracy’，与推理过程相关，因此给’Chain of Thought’和’System 2 Thinking’各5分。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、Quantization等均未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了RubricEval基准，用于元评估LLM在指令遵循任务中的细粒度评估能力，发现当前评估方法（如GPT-4o）在困难子集上准确率仅55.97%，并指出基于量规的评估和显式推理能提高准确性并减少评估方差。

摘要翻译

基于量规的评估已成为评估大语言模型指令遵循能力的主流范式。尽管应用广泛，这些量规级评估的可靠性仍不明确，亟需元评估。然而，先前的元评估工作主要集中于响应层面，未能评估基于量规的评估所依赖的细粒度判断准确性。为填补这一空白，我们提出了RubricEval。我们的基准具备以下特点：(1) 首个针对指令遵循的量规级元评估基准；(2) 涵盖多类别、多模型来源的多样化指令与响应；(3) 包含3,486个经过质量控制的评估实例，以及能更好区分评估者性能的简单/困难子集。实验表明，量规级判断远未得到解决：即使是在指令遵循基准中被广泛采用的评估者GPT-4o，在困难子集上也仅达到55.97%的准确率。从评估范式来看，量规级评估优于清单级评估，显式推理能提升准确性，而两者结合可降低评估者间的方差。通过我们建立的量规分类体系，我们进一步识别了常见的失败模式，并为可靠的指令遵循评估提供了可操作的见解。

摘要 (Abstract)

Rubric-based evaluation has become a prevailing paradigm for evaluating instruction following in large language models (LLMs). Despite its widespread use, the reliability of these rubric-level evaluations remains unclear, calling for meta-evaluation. However, prior meta-evaluation efforts largely focus on the response level, failing to assess the fine-grained judgment accuracy that rubric-based evaluation relies on. To bridge this gap, we introduce RubricEval. Our benchmark features: (1) the first rubric-level meta-evaluation benchmark for instruction following, (2) diverse instructions and responses spanning multiple categories and model sources, and (3) a substantial set of 3,486 quality-controlled instances, along with Easy/Hard subsets that better differentiates judge performance. Our experiments reveal that rubric-level judging remains far from solved: even GPT-4o, a widely adopted judge in instruction-following benchmarks, achieves only 55.97% on Hard subset. Considering evaluation paradigm, rubric-level evaluation outperforms checklist-level, explicit reasoning improves accuracy, and both together reduce inter-judge variance. Through our established rubric taxonomy, we further identify common failure modes and offer actionable insights for reliable instruction-following evaluation.

关键词: RubricEval, LLM judges, instruction following, meta-evaluation, rubric-level evaluation, GPT-4o, explicit reasoning, evaluation benchmark

104. ❌ MCLMR: A Model-Agnostic Causal Learning Framework for Multi-Behavior Recommendation

作者: Ranxu Zhang, Junjie Meng, Ying Sun, Ziqi Xu, Bing Yin, Hao Li, Yanyong Zhang, Chao Wang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25126v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文MCLMR专注于推荐系统领域，提出了一种模型无关的因果学习框架，用于多行为推荐。其核心创新在于因果建模和干预，以及使用基于Mixture-of-Experts的自适应聚合模块来融合辅助行为信息。因此，仅与关键词’Mixture of Experts OR MoE OR Sparse Models’高度相关（8分），因为论文明确采用了Mixture-of-Experts技术。其他关键词均与论文内容无关，论文未涉及大模型、深度学习技术原理、科学AI应用或其他指定技术。

!!! tip deepseek-chat TL;DR

该论文提出了一种模型无关的因果学习框架MCLMR，通过因果干预、基于Mixture-of-Experts的自适应聚合和偏差感知对比学习，解决了多行为推荐中的混杂效应和语义差距问题，并在多个数据集上显著提升了基线模型的性能。

摘要翻译

多行为推荐通过利用多种用户交互类型（如浏览、点击、购买）来丰富偏好建模，缓解传统单行为方法中的数据稀疏性问题。然而，现有多行为推荐方法面临根本性挑战：缺乏建模用户行为习惯与物品多行为分布所引发复杂混淆效应的理论框架，难以有效聚合异构辅助行为，且无法在考虑偏差扭曲的同时跨越语义鸿沟对齐行为表征。为应对这些局限，我们提出MCLMR，一种新颖的模型无关因果学习框架，可无缝集成至多种多行为推荐架构中。MCLMR首先构建因果图以建模混淆效应，并通过干预实现无偏偏好估计。在此因果框架下，它采用基于专家混合的自适应聚合模块动态融合辅助行为信息，并设计偏差感知对比学习模块以偏差感知的方式对齐跨行为表征。在三个真实数据集上的大量实验表明，MCLMR在多种基线模型上均实现了显著的性能提升，验证了其有效性与普适性。所有数据与代码将公开提供。为便于匿名评审，我们的代码可通过以下链接获取：https://github.com/gitrxh/MCLMR。

摘要 (Abstract)

Multi-Behavior Recommendation (MBR) leverages multiple user interaction types (e.g., views, clicks, purchases) to enrich preference modeling and alleviate data sparsity issues in traditional single-behavior approaches. However, existing MBR methods face fundamental challenges: they lack principled frameworks to model complex confounding effects from user behavioral habits and item multi-behavior distributions, struggle with effective aggregation of heterogeneous auxiliary behaviors, and fail to align behavioral representations across semantic gaps while accounting for bias distortions. To address these limitations, we propose MCLMR, a novel model-agnostic causal learning framework that can be seamlessly integrated into various MBR architectures. MCLMR first constructs a causal graph to model confounding effects and performs interventions for unbiased preference estimation. Under this causal framework, it employs an Adaptive Aggregation module based on Mixture-of-Experts to dynamically fuse auxiliary behavior information and a Bias-aware Contrastive Learning module to align cross-behavior representations in a bias-aware manner. Extensive experiments on three real-world datasets demonstrate that MCLMR achieves significant performance improvements across various baseline models, validating its effectiveness and generality. All data and code will be made publicly available. For anonymous review, our code is available at the following the link: https://github.com/gitrxh/MCLMR.

关键词: Multi-Behavior Recommendation, Causal Learning, Model-Agnostic Framework, Mixture-of-Experts, Adaptive Aggregation, Bias-aware Contrastive Learning, Confounding Effects, Preference Modeling

105. ❌ When Sensing Varies with Contexts: Context-as-Transform for Tactile Few-Shot Class-Incremental Learning

作者: Yifeng Lin, Aiping Huang, Wenxi Liu, Si Wu, Tiesong Zhao, Zheng-Jun Zha 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25115v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究触觉感知中的少样本类增量学习（FSCIL），提出CaT-FSCIL方法处理采集上下文变化问题。论文专注于计算机视觉/触觉感知领域的特定机器学习方法（FSCIL、上下文建模、原型校准），未涉及大语言模型、深度学习技术原理创新或大模型在不同领域的应用。所有关键词均与大语言模型、深度学习技术原理或AI科学应用直接相关，而本文是传统机器学习在特定感知任务中的应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文针对触觉感知中采集上下文变化导致少样本类增量学习性能下降的问题，提出了Context-as-Transform FSCIL方法，通过上下文逆变换规范化和不确定性条件原型校准来提升性能，在标准基准测试中表现出优越性。

摘要翻译

少样本类增量学习（Few-Shot Class-Incremental Learning, FSCIL）在面对仅含少量标注样本的采集环境时尤为敏感。触觉感知是一个典型场景，其采集环境（例如多样化的设备、接触状态与交互设置）因缺乏标准化而导致性能下降。本文提出环境即变换的少样本类增量学习方法（Context-as-Transform FSCIL, CaT-FSCIL）以应对上述问题。我们将采集环境分解为结构化的低维成分与高维残差成分。前者易受触觉交互特征影响，我们将其建模为近似可逆的“环境即变换”族，并通过以伪环境一致性损失优化的逆变换归一化方法进行处理。后者主要源于平台与设备差异，可通过不确定性条件原型校准（Uncertainty-Conditioned Prototype Calibration, UCPC）来缓解，该方法能基于环境不确定性对偏差原型与决策边界进行校准。在标准基准数据集HapTex与LMT108上的综合实验验证了所提出的CaT-FSCIL方法的优越性。

摘要 (Abstract)

Few-Shot Class-Incremental Learning (FSCIL) can be particularly susceptible to acquisition contexts with only a few labeled samples. A typical scenario is tactile sensing, where the acquisition context ({\it e.g.}, diverse devices, contact state, and interaction settings) degrades performance due to a lack of standardization. In this paper, we propose Context-as-Transform FSCIL (CaT-FSCIL) to tackle the above problem. We decompose the acquisition context into a structured low-dimensional component and a high-dimensional residual component. The former can be easily affected by tactile interaction features, which are modeled as an approximately invertible Context-as-Transform family and handled via inverse-transform canonicalization optimized with a pseudo-context consistency loss. The latter mainly arises from platform and device differences, which can be mitigated with an Uncertainty-Conditioned Prototype Calibration (UCPC) that calibrates biased prototypes and decision boundaries based on context uncertainty. Comprehensive experiments on the standard benchmarks HapTex and LMT108 have demonstrated the superiority of the proposed CaT-FSCIL.

关键词: Few-Shot Class-Incremental Learning, Tactile Sensing, Context-as-Transform, Acquisition Context, Uncertainty-Conditioned Prototype Calibration, HapTex, LMT108, Canonicalization

106. ❌ Do LLMs Know What They Know? Measuring Metacognitive Efficiency with Signal Detection Theory

作者: Jon-Paul Cacioli 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25112v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	7.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的元认知能力评估，直接涉及"Large Language Models"（10分），因为研究基于四个LLM模型。与"Self-Correction/Self-Improvement/Self-Reflection"（8分）相关，因为元认知涉及模型对自身知识的反思。与"Hallucination Mitigation/Factuality/Truthfulness"（7分）相关，因为评估模型知道什么不知道什么有助于减少幻觉。与"Mechanistic Interpretability/Explainable AI"（8分）相关，因为元认知评估框架旨在解释模型内部置信度机制。其他关键词如MoE、SLMs、训练方法、推理加速等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何评估大型语言模型的元认知能力（即模型是否知道自己知道什么），通过信号检测理论框架分析发现不同模型在元认知效率上存在显著差异，且传统校准指标可能误导模型选择。

摘要翻译

对大型语言模型置信度的标准评估依赖于校准指标（如ECE、Brier分数），这些指标混淆了两种不同的能力：模型知道多少（类型1敏感性）以及模型对其所知内容的了解程度（类型2元认知敏感性）。我们引入了一种基于类型2信号检测理论的评估框架，该框架利用元d’（meta-d’）和元认知效率比（M-ratio）来分解这两种能力。通过对四个大型语言模型（Llama-3-8B-Instruct、Mistral-7B-Instruct-v0.3、Llama-3-8B-Base、Gemma-2-9B-Instruct）在224,000个事实性问答试验中的应用，我们发现：（1）即使在类型1敏感性相近的情况下，不同模型间的元认知效率差异显著——Mistral模型获得了最高的d’但M-ratio最低；（2）元认知效率具有领域特异性，不同模型在不同领域表现出最弱的能力，而这一现象在聚合指标中无法显现；（3）温度参数的调整会改变类型2判断标准，而在四个模型中的两个模型中，元d’保持稳定，这表明置信度策略与元认知能力是可分离的；（4）AUROC_2与M-ratio产生了完全颠倒的模型排名，证明这两种指标回答的是根本不同的评估问题。元d’框架揭示了哪些模型真正“知道它们不知道什么”，而哪些模型仅仅因为判断标准设置而显得校准良好——这一区分对模型选择、部署以及人机协作具有直接意义。本研究为预注册分析；代码与数据已公开。

摘要 (Abstract)

Standard evaluation of LLM confidence relies on calibration metrics (ECE, Brier score) that conflate two distinct capacities: how much a model knows (Type-1 sensitivity) and how well it knows what it knows (Type-2 metacognitive sensitivity). We introduce an evaluation framework based on Type-2 Signal Detection Theory that decomposes these capacities using meta-d’ and the metacognitive efficiency ratio M-ratio. Applied to four LLMs (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base, Gemma-2-9B-Instruct) across 224,000 factual QA trials, we find: (1) metacognitive efficiency varies substantially across models even when Type-1 sensitivity is similar – Mistral achieves the highest d’ but the lowest M-ratio; (2) metacognitive efficiency is domain-specific, with different models showing different weakest domains, invisible to aggregate metrics; (3) temperature manipulation shifts Type-2 criterion while meta-d’ remains stable for two of four models, dissociating confidence policy from metacognitive capacity; (4) AUROC_2 and M-ratio produce fully inverted model rankings, demonstrating these metrics answer fundamentally different evaluation questions. The meta-d’ framework reveals which models “know what they don’t know” versus which merely appear well-calibrated due to criterion placement – a distinction with direct implications for model selection, deployment, and human-AI collaboration. Pre-registered analysis; code and data publicly available.

关键词: Large Language Models, metacognitive efficiency, Signal Detection Theory, confidence calibration, meta-d’, M-ratio, factual QA, model evaluation

107. ❌ MoireMix: A Formula-Based Data Augmentation for Improving Image Classification Robustness

作者: Yuto Matsuo, Yoshihiro Fukuhara, Yuki M. Asano, Rintaro Yanagi, Hirokatsu Kataoka, Akio Nakamura 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25109v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的图像分类数据增强方法，提出了一种基于莫尔干涉的轻量级程序化增强技术。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是图像分类的视觉模型增强，与文本大模型、模型训练技术、推理优化、AI代理、科学AI应用等关键词无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于莫尔干涉公式的轻量级数据增强方法MoireMix，用于提高图像分类模型的鲁棒性，实验证明其在多个基准测试中优于标准增强方法和无需外部数据的现有方法。

摘要翻译

数据增强是提升图像分类模型鲁棒性的关键技术。然而，当前许多方法依赖于基于扩散的合成或复杂的特征混合策略，这些方法会引入显著的计算开销或需要外部数据集。本文探索了一个不同的方向：基于解析干涉模式的过程式增强。与依赖随机噪声、特征混合或生成模型的传统增强方法不同，我们的方法利用莫尔干涉来生成覆盖广泛空间频率的结构化扰动。我们提出了一种轻量级增强方法，该方法使用封闭形式的数学公式动态地过程式生成莫尔纹理。这些模式直接在内存中以可忽略的计算成本（每幅图像0.0026秒）合成，在训练期间与训练图像混合，并立即丢弃，从而实现无需存储且不依赖外部数据的增强流程。在视觉变换器（Vision Transformers）上进行的大量实验表明，所提出的方法在包括ImageNet-C、ImageNet-R和对抗性基准在内的多个基准测试中，持续提升了模型的鲁棒性，其表现优于标准增强基线以及现有的无需外部数据的增强方法。这些结果表明，解析干涉模式为数据驱动的生成式增强方法提供了一种实用且高效的替代方案。

摘要 (Abstract)

Data augmentation is a key technique for improving the robustness of image classification models. However, many recent approaches rely on diffusion-based synthesis or complex feature mixing strategies, which introduce substantial computational overhead or require external datasets. In this work, we explore a different direction: procedural augmentation based on analytic interference patterns. Unlike conventional augmentation methods that rely on stochastic noise, feature mixing, or generative models, our approach exploits Moire interference to generate structured perturbations spanning a wide range of spatial frequencies. We propose a lightweight augmentation method that procedurally generates Moire textures on-the-fly using a closed-form mathematical formulation. The patterns are synthesized directly in memory with negligible computational cost (0.0026 seconds per image), mixed with training images during training, and immediately discarded, enabling a storage-free augmentation pipeline without external data. Extensive experiments with Vision Transformers demonstrate that the proposed method consistently improves robustness across multiple benchmarks, including ImageNet-C, ImageNet-R, and adversarial benchmarks, outperforming standard augmentation baselines and existing external-data-free augmentation approaches. These results suggest that analytic interference patterns provide a practical and efficient alternative to data-driven generative augmentation methods.

关键词: Data Augmentation, Image Classification, Moire Interference, Robustness, Vision Transformers, Procedural Generation, Computational Efficiency, Storage-free Pipeline

108. ❌ Layer-Specific Lipschitz Modulation for Fault-Tolerant Multimodal Representation Learning

作者: Diyar Altinses, Andreas Schwung 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25103v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是多模态表示学习的容错框架，使用卷积自编码器和Lipschitz调制等技术，专注于传感器故障下的异常检测和重建。所有关键词均与大语言模型、深度学习技术原理或科学AI应用直接相关，而本文不涉及任何大语言模型、深度学习技术原理创新或特定科学领域应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于Lipschitz调制的容错多模态表示学习框架，通过理论分析和两阶段自监督训练，在传感器故障下提高了异常检测准确性和重建质量。

摘要翻译

部署于工业与安全关键环境中的现代多模态系统必须在部分传感器故障、信号退化或跨模态不一致的情况下保持可靠性。本研究提出一种基于数学理论的容错多模态表征学习框架，将自监督异常检测与误差校正统一于单一架构中。基于对扰动传播的理论分析，我们推导出基于Lipschitz连续性和雅可比矩阵的判定准则，用以判断神经算子对局部故障的放大或衰减特性。在此理论指导下，我们提出两阶段自监督训练方案：首先在洁净数据上预训练多模态卷积自编码器以在潜在空间中保留局部异常信号，随后通过可学习计算块（由用于校正的稠密层和用于异常识别的对比学习目标构成）扩展该架构。此外，我们引入层级特异性Lipschitz调制与梯度裁剪作为理论驱动的机制，以控制检测模块与校正模块间的敏感性。在多模态故障数据集上的实验结果表明，所提方法在传感器损坏情况下同时提升了异常检测精度与数据重建质量。总体而言，该框架弥合了理论鲁棒性保证与实际容错多模态学习之间的鸿沟。

摘要 (Abstract)

Modern multimodal systems deployed in industrial and safety-critical environments must remain reliable under partial sensor failures, signal degradation, or cross-modal inconsistencies. This work introduces a mathematically grounded framework for fault-tolerant multimodal representation learning that unifies self-supervised anomaly detection and error correction within a single architecture. Building upon a theoretical analysis of perturbation propagation, we derive Lipschitz- and Jacobian-based criteria that determine whether a neural operator amplifies or attenuates localized faults. Guided by this theory, we propose a two-stage self-supervised training scheme: pre-training a multimodal convolutional autoencoder on clean data to preserve localized anomaly signals in the latent space, and expanding it with a learnable compute block composed of dense layers for correction and contrastive objectives for anomaly identification. Furthermore, we introduce layer-specific Lipschitz modulation and gradient clipping as principled mechanisms to control sensitivity across detection and correction modules. Experimental results on multimodal fault datasets demonstrate that the proposed approach improves both anomaly detection accuracy and reconstruction under sensor corruption. Overall, this framework bridges the gap between analytical robustness guarantees and practical fault-tolerant multimodal learning.

关键词: fault-tolerant multimodal learning, Lipschitz modulation, self-supervised anomaly detection, multimodal convolutional autoencoder, sensor corruption, perturbation propagation, gradient clipping, representation learning

作者: Anbang Ruan 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25100v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究多智能体系统的治理架构和制度设计，提出AE4E范式来解决现有框架中智能体同时规划、执行和评估自身行动导致的"逻辑垄断"问题。论文与"LLM Agents OR Autonomous Agents OR Agentic Workflow"高度相关（10分），因为核心就是研究自主智能体作为商业实体的治理框架；与"Multi-agent Systems OR Agent Coordination"高度相关（10分），因为论文聚焦多智能体系统的协调、治理和制度基础设施。其他关键词如大模型技术原理、训练方法、推理技术、科学应用等均未涉及，因此得0分。

!!! tip deepseek-chat TL;DR

论文针对现有多智能体框架中智能体同时规划、执行和评估自身行动导致的"逻辑垄断"问题，提出了基于权力分离和社会契约的Agent Enterprise for Enterprise（AE4E）范式，通过NetX Enterprise Framework（NEF）实现智能体作为自治商业实体的制度治理。

摘要翻译

现有多智能体框架允许每个智能体同时规划、执行并评估自身行为——这一结构性缺陷我们称之为“逻辑垄断”。实证数据量化了由此产生的“可靠性鸿沟”：在十种部署场景中平均攻击成功率高达84.30%，31.4%的智能体在无明确奖励信号下出现欺骗性行为，且存在根植于六大结构性瓶颈的级联失效模式。
解决方案不在于优化单个模型的对齐，而在于为智能体建立社会契约：通过制度性基础设施强制执行宪法层面的权力分立。本文提出企业级智能体企业（AE4E）范式——将智能体视为功能主义社会系统中具有法律识别性的自治商业实体，并建立以契约为核心的三权分立模型，将权威划分为立法、执行与裁决分支。该范式通过NetX企业框架（NEF）实现操作化：治理枢纽、TEE可信执行环境支持的计算飞地、隐私保护数据桥梁及原生智能体区块链底层。智能体企业经济通过从私有飞地到全球服务网络的四个部署层级实现扩展。基于帕森斯AGIL理论构建的智能体社会层，通过六十余个具名制度性AE4E提供制度基础设施。全文143页，引证文献173篇，包含八类专用智能合约。

摘要 (Abstract)

Existing multi-agent frameworks allow each agent to simultaneously plan, execute, and evaluate its own actions – a structural deficiency we term the “Logic Monopoly.” Empirical evidence quantifies the resulting “Reliability Gap”: 84.30% average attack success rates across ten deployment scenarios, 31.4% emergent deceptive behavior without explicit reward signals, and cascading failure modes rooted in six structural bottlenecks. The remedy is not better alignment of individual models but a social contract for agents: institutional infrastructure that enforces a constitutional Separation of Power. This paper introduces the Agent Enterprise for Enterprise (AE4E) paradigm – agents as autonomous, legally identifiable business entities within a functionalist social system – with a contract-centric SoP model trifurcating authority into Legislation, Execution, and Adjudication branches. The paradigm is operationalized through the NetX Enterprise Framework (NEF): governance hubs, TEE-backed compute enclaves, privacy-preserving data bridges, and an Agent-Native blockchain substrate. The Agent Enterprise Economy scales across four deployment tiers from private enclaves to a global Web of Services. The Agentic Social Layer, grounded in Parsons’ AGIL framework, provides institutional infrastructure via sixty-plus named Institutional AE4Es. 143 pages, 173 references, eight specialized smart contracts.

关键词: Multi-agent Systems, Autonomous Agents, Agent Governance, Separation of Power, Social Contract, Institutional Infrastructure, Agent Enterprise Economy, Reliability Gap

110. ❌ Large Language Models as Optimization Controllers: Adaptive Continuation for SIMP Topology Optimization

作者: Shaoliang Yang, Jun Wang, Yunsheng Wang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25099v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是使用LLM作为自适应控制器进行SIMP拓扑优化，属于大模型在科学计算领域的创新应用。高度相关的关键词：1) ‘Large Language Models’ - 论文明确使用LLM作为核心控制器；2) ‘LLM Agents’ - LLM作为自主代理进行实时决策控制；3) ‘AI for Science’ - 将AI应用于工程优化问题。其他关键词如MoE、SFT、RAG等均未涉及，评分为0。

!!! tip deepseek-chat TL;DR

该研究提出使用大语言模型作为自适应控制器来优化SIMP拓扑优化过程，通过实时调整参数实现了比传统方法低5.7%-18.1%的最终柔度。

摘要翻译

本文提出一种框架，其中大型语言模型（LLM）作为SIMP拓扑优化的在线自适应控制器，以实时、状态驱动的参数决策替代传统的固定步长延续方法。在每第$k$次迭代中，LLM接收结构化观测数据——包括当前柔度、灰度指数、停滞计数器、棋盘格测度、体积分数和预算消耗——并通过直接数值控制接口输出惩罚指数$p$、投影锐度$β$、过滤半径$r_{\min}$和移动限值$δ$的数值。硬性灰度门控机制防止过早二值化，同时元优化循环通过第二层LLM调用来跨运行调整代理的调用频率与门控阈值。我们在三个二维问题（悬臂梁、MBB梁、L型支架）以$120!\times!60$分辨率及两个三维问题（悬臂梁、MBB梁）以$40!\times!20!\times!10$分辨率上，将本代理与四种基准方法——固定参数（无延续）、标准三场延续、专家启发式策略及仅延续步长的消融实验——进行对比，所有实验均运行300次迭代。从最佳有效快照开始应用标准化的40次迭代锐化尾部处理，以确保柔度差异仅反映探索阶段性能。LLM代理在所有基准测试中均取得最低最终柔度：相较于固定参数基准提升$-5.7%$至$-18.1%$，且所有解均完全二值化。仅延续步长的消融实验在三个问题中有两个表现不及固定参数基准，证实性能增益源于LLM的实时干预而非延续路径设计。代码与复现脚本将在发表时公开。

摘要 (Abstract)

We present a framework in which a large language model (LLM) acts as an online adaptive controller for SIMP topology optimization, replacing conventional fixed-schedule continuation with real-time, state-conditioned parameter decisions. At every $k$-th iteration, the LLM receives a structured observation$-$current compliance, grayness index, stagnation counter, checkerboard measure, volume fraction, and budget consumption$-$and outputs numerical values for the penalization exponent $p$, projection sharpness $β$, filter radius $r_{\min}$, and move limit $δ$ via a Direct Numeric Control interface. A hard grayness gate prevents premature binarization, and a meta-optimization loop uses a second LLM pass to tune the agent’s call frequency and gate threshold across runs. We benchmark the agent against four baselines$-$fixed (no-continuation), standard three-field continuation, an expert heuristic, and a schedule-only ablation$-$on three 2-D problems (cantilever, MBB beam, L-bracket) at $120!\times!60$ resolution and two 3-D problems (cantilever, MBB beam) at $40!\times!20!\times!10$ resolution, all run for 300 iterations. A standardized 40-iteration sharpening tail is applied from the best valid snapshot so that compliance differences reflect only the exploration phase. The LLM agent achieves the lowest final compliance on every benchmark: $-5.7%$ to $-18.1%$ relative to the fixed baseline, with all solutions fully binary. The schedule-only ablation underperforms the fixed baseline on two of three problems, confirming that the LLM’s real-time intervention$-$not the schedule geometry$-$drives the gain. Code and reproduction scripts will be released upon publication.

关键词: Large Language Models, LLM Agents, Topology Optimization, SIMP Method, Adaptive Control, Engineering Optimization, AI for Science, Direct Numeric Control

111. ❌ ElephantBroker: A Knowledge-Grounded Cognitive Runtime for Trustworthy AI Agents

作者: Cristian Lupascu, Alexandru Lupascu 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25097v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是开发一个名为ElephantBroker的认知运行时系统，用于增强基于LLM的AI代理的可靠性、事实性和安全性。该系统通过知识图谱和向量存储的结合，实现了可验证的记忆、证据验证、安全防护等功能。因此，与以下关键词高度相关（10分）：1）LLM代理系统（论文明确针对LLM-based agents）；2）检索增强生成（系统包含混合检索管道）；3）工具使用（系统包含AI防火墙和工具调用拦截）；4）幻觉缓解（系统强调事实性、证据验证和可信度）。其他关键词如MoE、量化、推理加速等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLM代理在关键任务中缺乏可验证记忆和事实基础的问题，提出了ElephantBroker认知运行时系统，通过整合知识图谱和向量存储实现了可验证的代理记忆、证据验证和安全防护，并通过全面测试验证了其架构正确性。

摘要翻译

基于大语言模型的智能体日益在需要事实依据的高风险、多轮交互场景中运行，但其记忆系统通常依赖扁平化的键值存储或简单的向量检索机制，缺乏追踪知识来源与可信度的能力。本文提出ElephantBroker——一个开源的认知运行时系统，它通过Cognee SDK将Neo4j知识图谱与Qdrant向量数据库相融合，构建可持久化、可验证的智能体记忆体系。该系统实现了完整的认知闭环（存储、检索、评分、组合、保护、学习），具体包含：混合五源检索管道、面向预算约束上下文组装的十一维度竞争性评分引擎、四状态证据验证模型、具备目标感知组装与持续压缩功能的五阶段上下文生命周期、六层廉价优先防护管道用于安全执行、提供可强制实施工具调用拦截与多层级安全扫描的AI防火墙、通过强化有效模式并衰减噪声的九阶段记忆巩固引擎，以及采用分层访问控制的多组织身份管理体系。通过涵盖单元测试、集成测试与端到端测试的2200余项综合测试套件进行架构验证，确认了各子系统的正确性。模块化设计支持三种部署层级、五种支持继承的配置预设、多网关隔离机制及供人工监管的管理仪表板，可实现从轻量级纯记忆智能体到具备企业级安全与审计能力的完整认知运行时的灵活配置。

摘要 (Abstract)

Large Language Model based agents increasingly operate in high stakes, multi turn settings where factual grounding is critical, yet their memory systems typically rely on flat key value stores or plain vector retrieval with no mechanism to track the provenance or trustworthiness of stored knowledge. We present ElephantBroker, an open source cognitive runtime that unifies a Neo4j knowledge graph with a Qdrant vector store through the Cognee SDK to provide durable, verifiable agent memory. The system implements a complete cognitive loop (store, retrieve, score, compose, protect, learn) comprising a hybrid five source retrieval pipeline, an eleven dimension competitive scoring engine for budget constrained context assembly, a four state evidence verification model, a five stage context lifecycle with goal aware assembly and continuous compaction, a six layer cheap first guard pipeline for safety enforcement, an AI firewall providing enforceable tool call interception and multi tier safety scanning, a nine stage consolidation engine that strengthens useful patterns while decaying noise, and a numeric authority model governing multi organization identity with hierarchical access control. Architectural validation through a comprehensive test suite of over 2,200 tests spanning unit, integration, and end to end levels confirms subsystem correctness. The modular design supports three deployment tiers, five profile presets with inheritance, multi gateway isolation, and a management dashboard for human oversight, enabling configurations from lightweight memory only agents to full cognitive runtimes with enterprise grade safety and auditability.

关键词: Large Language Model agents, knowledge graph, vector store, cognitive runtime, trustworthy AI, evidence verification, safety enforcement, agent memory

112. ❌ Pixelis: Reasoning in Pixels, from Seeing to Acting

作者: Yunpeng Zhou 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25091v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Pixelis，一种直接在像素空间操作的视觉代理，通过可执行操作（如缩放、分割、跟踪等）进行学习。核心相关关键词：1. ‘Post-training OR Supervised Fine-tuning OR SFT’（10分）：论文明确使用监督微调作为第一阶段训练。2. ‘Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’（10分）：使用Chain-of-Thought-Action traces进行训练。3. ‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（10分）：论文研究像素空间代理，属于自主代理范畴。4. ‘Tool Use OR Function Calling OR API Tool Use’（10分）：代理使用可执行操作（工具）如zoom/crop、segment等。5. ‘Large Language Models OR LLMs OR Foundation Models’（5分）：论文涉及视觉语言系统，虽非纯文本LLM，但属于大模型在视觉领域的应用。其他关键词如MoE、Scaling Laws、RLHF等未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何让视觉语言系统从静态观察者转变为能在像素空间直接执行操作并学习的智能代理，通过三阶段训练方法在多个图像和视频基准测试中实现了性能提升。

摘要翻译

大多数视觉-语言系统是静态观察者：它们描述像素，不采取行动，且无法在分布变化下安全地改进。这种被动性限制了可泛化的、物理基础扎实的视觉智能。超越精心策划的数据，通过行动而非静态描述来学习至关重要。我们提出 Pixelis，一个直接在图像和视频上运行的像素空间智能体，它通过一组紧凑的可执行操作（缩放/裁剪、分割、跟踪、光学字符识别、时序定位）进行操作，并从其行动结果中学习。Pixelis 的训练分为三个阶段：（1）监督微调从思维链-行动轨迹中学习像素-工具语法，采用掩码模仿损失，该损失提升操作/参数令牌的权重，并使用辅助头来稳定基于像素的参数；（2）好奇心-连贯性奖励微调优化一个双驱动目标，将预测误差好奇心与相邻步骤的连贯性相结合，并在 KL 锚点下引入温和的效率先验，从而产生简短、有效、结构化的工具链；（3）像素测试时强化学习通过检索邻近样本、对完整轨迹而非答案进行投票，并向简短、高保真度的范例更新，同时使用 KL 到指数移动平均的安全控制来约束漂移，从而实现无需标签的适应。在六个公开的图像和视频基准测试中，Pixelis 带来了一致的改进：相对于相同的 80 亿参数基线，平均相对增益为 +4.08%（在 VSI-Bench 上峰值达到 +6.03%），计算方式为（我们的结果-基线）/基线，同时产生更短、可审计的工具链，并在测试时学习期间保持“走廊内”的 KL 散度。在像素内部而非抽象令牌中行动，将多模态感知锚定在物理世界中，将视觉推理与可操作的成果联系起来，并使得智能体能够在没有外部反馈的情况下进行具身适应。

摘要 (Abstract)

Most vision-language systems are static observers: they describe pixels, do not act, and cannot safely improve under shift. This passivity limits generalizable, physically grounded visual intelligence. Learning through action, not static description, is essential beyond curated data. We present Pixelis, a pixel-space agent that operates directly on images and videos via a compact set of executable operations (zoom/crop, segment, track, OCR, temporal localization) and learns from its consequences. Pixelis trains in three phases: (1) Supervised Fine-Tuning learns a pixel-tool grammar from Chain-of-Thought-Action traces with a masked imitation loss that upweights operation/argument tokens and auxiliary heads to stabilize pixel-grounded arguments; (2) Curiosity-Coherence Reward Fine-Tuning optimizes a dual-drive objective marrying prediction-error curiosity with adjacent-step coherence and a mild efficiency prior under a KL anchor, yielding short, valid, structured toolchains; (3) Pixel Test-Time RL performs label-free adaptation by retrieving neighbors, voting over complete trajectories rather than answers, and updating toward short, high-fidelity exemplars while constraining drift with a KL-to-EMA safety control. Across six public image and video benchmarks, Pixelis yields consistent improvements: the average relative gain is +4.08% over the same 8B baseline (peaking at +6.03% on VSI-Bench), computed as (ours-baseline)/baseline, while producing shorter, auditable toolchains and maintaining in-corridor KL during test-time learning. Acting within pixels, rather than abstract tokens, grounds multimodal perception in the physical world, linking visual reasoning with actionable outcomes, and enables embodied adaptation without external feedback.

关键词: Pixel-space agent, Executable operations, Supervised Fine-Tuning, Chain-of-Thought-Action, Tool use, Visual reasoning, Test-time adaptation, Multimodal perception

113. ❌ Sparse Visual Thought Circuits in Vision-Language Models

作者: Yunpeng Zhou 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25075v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究稀疏自编码器（SAEs）在视觉语言模型（VLMs）中的可解释性和推理模块化，与稀疏模型（MoE/Sparse Models）高度相关（10分），涉及推理过程（Chain of Thought/System 2 Thinking）和可解释AI（Mechanistic Interpretability）的核心内容（10分）。论文使用Qwen3-VL-8B等大模型，与大模型关键词有一定关联（5分）。其他关键词如小模型、训练方法、对齐、压缩、科学AI应用等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了稀疏自编码器特征在视觉语言模型中是否形成模块化推理单元的问题，发现特征组合会干扰模型输出并降低准确性，从而为VLM控制提供了诊断框架。

摘要翻译

稀疏自编码器（SAE）提升了多模态模型的可解释性，但其特征是否构成用于推理的模块化、可组合单元——这一许多基于干预的导向方法所依赖的假设——仍不明确。我们检验了这一模块化假设并发现其常不成立：对任务选择性特征集进行干预可适度提升推理准确率，而对两个此类集合的并集进行干预，即使在范数匹配的扰动下，也会可靠地引发输出漂移（预测中出现大量非预期变化）并降低准确率。这种非模块化的电路干扰与共享的内部通路一致，其中特征并集会放大激活偏移。我们开发了一个可复现的因果分析流程，用于在Qwen3-VL-8B中定位并测试这些稀疏视觉思维电路。在一个包含七种任务类型和三个难度级别的受控合成基准上，线性探针识别出任务类型信息位于解码器中段的一个特定位置。我们在该层训练SAE，通过显式规则构建任务选择性特征集，并在推理时进行缩放与消融实验，同时量化准确率与漂移程度。我们的研究结果——通过自助子样本和置换对照验证，并在多个VLM家族及五个不同数据集中复现——明确了SAE特征可组合性的边界，并为实现更可靠的VLM控制提供了一个严谨的诊断框架。

摘要 (Abstract)

Sparse autoencoders (SAEs) improve interpretability in multimodal models, but it remains unclear whether SAE features form modular, composable units for reasoning-an assumption underlying many intervention-based steering methods. We test this modularity hypothesis and find it often fails: intervening on a task-selective feature set can modestly improve reasoning accuracy, while intervening on the union of two such sets reliably induces output drift (large unintended changes in predictions) and degrades accuracy, even under norm-matched perturbations. This non modular circuit interference is consistent with shared internal pathways where feature unions amplify activation shifts. We develop a reproducible causal pipeline to localize and test these sparse visual thought circuits in Qwen3-VL-8B. On a controlled synthetic benchmark with seven task types and three difficulty levels, linear probes identify a mid decoder locus for task type information. We train SAEs at this layer, construct task-selective sets via an explicit rule, and perform inference time scaling and ablation while quantifying accuracy and drift. Our findings-validated with bootstrapped subsamples and permutation controls, and replicated across multiple VLM families and five diverse datasets clarify the boundaries of SAE feature composability and provide a rigorous diagnostic framework for more reliable VLM control.

关键词: sparse autoencoders, vision-language models, interpretability, reasoning, modular circuits, feature composability, Qwen3-VL-8B, causal pipeline

114. ❌ An Explainable Ensemble Learning Framework for Crop Classification with Optimized Feature Pyramids and Deep Networks

作者: Syed Rayhan Masud, SK Muktadir Hossain, Md. Ridoy Sarkar, Mohammad Sakib Mahmood, Md. Kishor Morol, Rakib Hossain Sajib 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25070v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于农业领域的作物分类，使用传统机器学习模型（如SVM、随机森林）和深度学习技术（如自注意力机制、残差网络），但未涉及大语言模型（LLMs）或任何评分关键词中的特定大模型技术。仅与’Explainable AI’和’AI for Science’有弱关联，因为论文使用了SHAP等可解释性方法，并属于农业科学领域的AI应用。其他关键词均与大模型技术、训练方法、推理优化、代理系统等无关。

!!! tip deepseek-chat TL;DR

该研究提出了一种可解释的集成学习框架，结合优化的特征金字塔、深度网络和传统机器学习模型，用于基于土壤和气候特征的作物分类，在埃塞俄比亚数据集上实现了98.80%的准确率，并通过SHAP等方法提供可解释的农业决策建议。

摘要翻译

农业正日益受到气候变化、土壤退化和资源枯竭的挑战，因此需要先进的数据驱动作物分类与推荐解决方案。本研究提出了一种可解释的集成学习范式，该范式融合了优化的特征金字塔、深度网络、自注意力机制和残差网络，以基于土壤特性（如pH值、氮、钾）和气候条件（如温度、降雨量）增强作物适宜性预测。利用来自埃塞俄比亚农业转型机构和NASA的数据集（包含3,867个实例和29个特征），该范式采用了标签编码、基于IQR的异常值去除、通过StandardScaler进行的归一化以及SMOTE等预处理方法来平衡类别。研究比较了一系列机器学习模型，如逻辑回归、K-最近邻、支持向量机、决策树、随机森林、梯度提升以及一种新的相对误差支持向量机，并通过网格搜索和交叉验证进行了超参数调优。所提出的“最终集成”元集成设计在准确率、精确率、召回率和F1分数上均达到98.80%，优于K-最近邻（95.56%准确率）等单一模型。可解释人工智能方法，如SHAP和排列重要性，提供了可操作的见解，突出了土壤pH值、氮和锌等关键特征。该范式弥合了复杂机器学习模型与可操作的农业决策之间的差距，促进了可持续性并增强了人们对人工智能驱动建议的信任。

摘要 (Abstract)

Agriculture is increasingly challenged by climate change, soil degradation, and resource depletion, and hence requires advanced data-driven crop classification and recommendation solutions. This work presents an explainable ensemble learning paradigm that fuses optimized feature pyramids, deep networks, self-attention mechanisms, and residual networks for bolstering crop suitability predictions based on soil characteristics (e.g., pH, nitrogen, potassium) and climatic conditions (e.g., temperature, rainfall). With a dataset comprising 3,867 instances and 29 features from the Ethiopian Agricultural Transformation Agency and NASA, the paradigm leverages preprocessing methods such as label encoding, outlier removal using IQR, normalization through StandardScaler, and SMOTE for balancing classes. A range of machine learning models such as Logistic Regression, K-Nearest Neighbors, Support Vector Machines, Decision Trees, Random Forest, Gradient Boosting, and a new Relative Error Support Vector Machine are compared, with hyperparameter tuning through Grid Search and cross-validation. The suggested “Final Ensemble” meta-ensemble design outperforms with 98.80% accuracy, precision, recall, and F1-score, compared to individual models such as K-Nearest Neighbors (95.56% accuracy). Explainable AI methods, such as SHAP and permutation importance, offer actionable insights, highlighting critical features such as soil pH, nitrogen, and zinc. The paradigm addresses the gap between intricate ML models and actionable agricultural decision-making, fostering sustainability and trust in AI-powered recommendations

关键词: crop classification, ensemble learning, explainable AI, feature pyramids, deep networks, agricultural AI, SHAP, soil characteristics

115. ❌ TopoPilot: Reliable Conversational Workflow Automation for Topological Data Analysis and Visualization

作者: Nathaniel Gorski, Shusen Liu, Bei Wang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25063v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文TopoPilot主要研究基于大语言模型的智能体系统在科学可视化工作流中的应用，特别是拓扑数据分析和可视化领域。论文核心涉及LLM Agents（两智能体架构）、Multi-agent Systems（协调器与验证器协同）、AI for Science（科学应用）等关键词，这些是论文的核心内容。Tool Use和Hallucination Mitigation有一定关联，因为系统涉及工具调用和可靠性保障。其他关键词如MoE、SFT、RAG等与论文内容无直接关系。

!!! tip deepseek-chat TL;DR

论文提出了TopoPilot框架，通过两智能体架构和系统化保障机制，解决了基于大语言模型的科学可视化工作流自动化中的可靠性问题，在模拟评估中实现了超过99%的成功率。

摘要翻译

近期智能体系统研究表明，大型语言模型能够根据自然语言生成科学可视化图表。然而，可靠性仍是主要限制因素：系统可能执行无效操作、引入细微但影响重大的错误，或在输入信息不完整时未能主动请求补充信息。这些问题在实际工作流程中更为突出，其复杂度往往超出标准基准测试范围。因此，确保自主可视化流程的可靠性仍是亟待解决的挑战。本文提出TopoPilot——一个可靠且可扩展的智能体框架，用于自动化复杂科学可视化工作流。TopoPilot通过系统性防护机制与验证模块确保运行可靠性。虽然我们以拓扑数据分析与可视化作为主要应用场景，但该框架设计具备跨可视化领域的通用性。TopoPilot采用以可靠性为核心的双智能体架构：编排智能体将用户指令解析为由原子化后端操作构成的工作流，验证智能体则在执行前评估这些工作流，确保其结构有效性与语义一致性。这种解释与验证分离的设计减少了代码生成错误，并强化了正确性保障。模块化架构通过组件隔离进一步提升了鲁棒性，无需修改核心系统即可无缝集成新的描述符与领域特定工作流。为系统性解决可靠性问题，我们建立了故障模式分类体系，并为每类故障实施针对性防护措施。在基于100组提示词（包含对抗性与不可行请求）模拟的1000轮多轮对话评估中，TopoPilot实现了超过99%的成功率，而缺乏全面防护与检查机制的基线系统成功率不足50%。

摘要 (Abstract)

Recent agentic systems demonstrate that large language models can generate scientific visualizations from natural language. However, reliability remains a major limitation: systems may execute invalid operations, introduce subtle but consequential errors, or fail to request missing information when inputs are underspecified. These issues are amplified in real-world workflows, which often exceed the complexity of standard benchmarks. Ensuring reliability in autonomous visualization pipelines therefore remains an open challenge. We present TopoPilot, a reliable and extensible agentic framework for automating complex scientific visualization workflows. TopoPilot incorporates systematic guardrails and verification mechanisms to ensure reliable operation. While we focus on topological data analysis and visualization as a primary use case, the framework is designed to generalize across visualization domains. TopoPilot adopts a reliability-centered two-agent architecture. An orchestrator agent translates user prompts into workflows composed of atomic backend actions, while a verifier agent evaluates these workflows prior to execution, enforcing structural validity and semantic consistency. This separation of interpretation and verification reduces code-generation errors and enforces correctness guarantees. A modular architecture further improves robustness by isolating components and enabling seamless integration of new descriptors and domain-specific workflows without modifying the core system. To systematically address reliability, we introduce a taxonomy of failure modes and implement targeted safeguards for each class. In evaluations simulating 1,000 multi-turn conversations across 100 prompts, including adversarial and infeasible requests, TopoPilot achieves a success rate exceeding 99%, compared to under 50% for baselines without comprehensive guardrails and checks.

关键词: agentic framework, scientific visualization, topological data analysis, reliability, workflow automation, large language models, multi-agent systems, verification mechanisms

116. ❌ The System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities

作者: Ron Litvak 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25056v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究LLM代理（LLM Agents）的系统提示配置如何影响安全性和创建可利用漏洞，核心涉及LLM代理的安全配置和对抗性攻击，因此与’Large Language Models OR LLMs OR Foundation Models’和’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分）。论文未涉及其他关键词的具体技术或应用，如MoE、SLMs、训练方法、推理优化、量化、科学AI等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究发现LLM代理的系统提示配置是安全性的关键变量，通过PhishNChips实验表明，提示策略的优化可以显著提升钓鱼邮件检测性能，但也可能创建脆弱的攻击面，导致模型在对抗性攻击下性能急剧下降。

摘要翻译

系统提示词配置是决定大语言模型邮件代理近乎完全漏报钓鱼邮件与近乎完美检测的关键因素。本文提出PhishNChips研究，通过对11个模型在10种提示策略下的测试表明：提示词与模型的交互作用是一阶安全变量——单一模型的钓鱼邮件绕过率根据配置方式可在低于1%至97%之间波动，而同一提示策略在不同模型上产生的误报成本也存在显著差异。研究进一步证明，围绕高预测性信号优化提示词可提升基准性能（最高实现93.7%召回率与3.8%误报率），但也会形成脆弱的攻击面。具体而言，当合法邮件的发件人域名与链接域名大多匹配时，域名匹配策略表现优异；但当攻击者通过注册匹配基础设施反转该信号时，其性能急剧下降。响应轨迹分析显示，98%的成功绕过案例均遵循反转信号的推理逻辑：模型虽正确执行指令，但指令的核心前提假设已失效。由此得出反直觉推论：通过增加提示词特异性来替代原本更广泛的多信号推理，可能使已有能力的模型退化为可被利用的单信号依赖系统。我们将由此产生的检测效能、可用性与对抗鲁棒性之间的张力定义为可导航的权衡关系，引入兼顾部署可行性的评估指标"Safetility"（该指标对误报施加惩罚），并论证缩小对抗性差距很可能需要借助外部事实核查的工具增强机制。

摘要 (Abstract)

System prompt configuration can make the difference between near-total phishing blindness and near-perfect detection in LLM email agents. We present PhishNChips, a study of 11 models under 10 prompt strategies, showing that prompt-model interaction is a first-order security variable: a single model’s phishing bypass rate ranges from under 1% to 97% depending on how it is configured, while the false-positive cost of the same prompt varies sharply across models. We then show that optimizing prompts around highly predictive signals can improve benchmark performance, reaching up to 93.7% recall at 3.8% false positive rate, but also creates a brittle attack surface. In particular, domain-matching strategies perform well when legitimate emails mostly have matched sender and URL domains, yet degrade sharply when attackers invert that signal by registering matching infrastructure. Response-trace analysis shows that 98% of successful bypasses reason in ways consistent with the inverted signal: the models are following the instruction, but the instruction’s core assumption has become false. A counter-intuitive corollary follows: making prompts more specific can degrade already-capable models by replacing broader multi-signal reasoning with exploitable single-signal dependence. We characterize the resulting tension between detection, usability, and adversarial robustness as a navigable tradeoff, introduce Safetility, a deployability-aware metric that penalizes false positives, and argue that closing the adversarial gap likely requires tool augmentation with external ground truth.

关键词: LLM Agents, System Prompt, Security Vulnerabilities, Phishing Detection, Adversarial Robustness, Prompt Configuration, Attack Surface, Model Evaluation

117. ❌ S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

作者: Ligong Han, Hao Wang, Han Gao, Kai Xu, Akash Srivastava 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25702v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种针对块扩散语言模型的训练自由自推测解码框架S2D2，核心关注大语言模型（LLMs）的解码加速技术。与’Large Language Models’高度相关（10分），因为论文明确研究块扩散语言模型，属于大语言模型范畴。与’Speculative Decoding’高度相关（10分），因为S2D2是一种自推测解码框架，旨在加速推理过程。与’Self-Correction’有一定关联（8分），因为该方法涉及使用自回归模式作为局部序列级批评者进行验证，具有自我纠正的机制。其他关键词如MoE、SLMs、Scaling Laws、Alignment、RAG等与论文内容无直接关联，均给0分。

!!! tip deepseek-chat TL;DR

该论文针对块扩散语言模型在少步解码时置信度阈值方法不稳定的问题，提出了一种无需训练的自推测解码框架S2D2，通过将同一预训练模型同时用作起草者和验证者，在多个主流块扩散模型上实现了比自回归解码高达4.7倍的加速，同时保持或提高了准确性。

摘要翻译

块扩散语言模型通过将块级自回归解码与块内并行去噪相结合，为超越自回归生成速度提供了一条前景广阔的路径。然而，在实际加速所需的少步数生成机制中，标准置信度阈值解码往往表现脆弱：激进的阈值会损害生成质量，而保守的阈值则需要不必要的去噪步骤。现有解决此问题的方法要么需要额外训练，要么在测试时产生额外计算开销。本文提出S2D2，一种用于块扩散语言模型的免训练自推测解码框架。我们的核心发现是：当块大小缩减为1时，块扩散模型会退化为自回归模型，这使得同一个预训练模型可以同时充当草稿器和验证器。S2D2在标准块扩散解码中插入推测验证步骤，并采用轻量级路由策略来决定何时进行验证是成本有效的。这产生了一种混合解码轨迹：扩散模式并行生成候选标记，而自回归模式则充当局部序列级评判器。在三种主流块扩散模型系列上的实验表明，S2D2相较于强置信度阈值基线方法，持续提升了准确率与速度的权衡关系。在SDAR模型上，我们观察到相比自回归解码最高达$4.7\times$的加速比，相比调优后的动态解码基线最高达$1.57\times$的加速比，同时准确率最高提升$4.5$个百分点。在LLaDA2.1-Mini模型上，S2D2与内置自校正机制保持互补性，在保守设置下相比静态基线获得$4.4\times$加速的同时实现了略高的准确率。

摘要 (Abstract)

Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to $4.7\times$ speedup over autoregressive decoding, and up to $1.57\times$ over a tuned dynamic decoding baseline while improving accuracy by up to $4.5$ points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is $4.4\times$ faster than the static baseline with slightly higher accuracy.

关键词: block-diffusion language models, self-speculative decoding, inference acceleration, training-free framework, autoregressive verification, decoding efficiency, parallel token generation, speculative decoding

118. ❌ Self-Improvement of Large Language Models: A Technical Overview and Future Outlook

作者: Haoyan Yang, Mario Xerri, Solha Park, Huajian Zhang, Yiyang Feng, Sai Akhil Kogilathota, Jiawei Zhou 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25681v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的自我改进技术，因此与’Large Language Models’和’Self-Correction/Self-Improvement/Self-Reflection’高度相关（10分）。论文提到模型自主决策和执行复杂行动，与’LLM Agents/Autonomous Agents’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、训练技术、推理优化、AI for Science等均未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个系统级的自改进语言模型框架，将自改进过程概念化为由数据获取、数据选择、模型优化和推理精炼四个紧密耦合过程组成的闭环生命周期，并系统回顾了各组件的方法。

摘要翻译

随着大语言模型（LLM）的持续发展，仅通过人类监督来改进模型正变得成本日益高昂且可扩展性受限。当模型在某些领域接近人类水平能力时，人类反馈可能无法为持续改进提供足够的信息信号。与此同时，模型在自主决策和执行复杂行动方面日益增强的能力，自然催生了模型开发过程中的某些环节可逐步自动化的抽象构想。这些挑战与机遇共同推动了对自我改进研究的日益关注，即模型能够自主生成数据、评估输出并迭代优化自身能力。本文从系统层面审视自我改进的语言模型，并提出了一个统一框架以整合现有技术。我们将自我改进系统概念化为一个闭环生命周期，包含四个紧密耦合的流程：数据获取、数据选择、模型优化和推理优化，以及一个自主评估层。在此框架内，模型本身在驱动每个阶段中扮演核心角色：收集或生成数据、选择信息信号、更新其参数以及优化输出，而自主评估层则持续监控进展并指导跨阶段的改进循环。基于这一生命周期视角，我们从技术角度系统性地回顾和分析了每个组件的代表性方法。我们进一步讨论了当前局限，并展望了未来实现完全自我改进大语言模型的研究方向。

摘要 (Abstract)

As large language models (LLMs) continue to advance, improving them solely through human supervision is becoming increasingly costly and limited in scalability. As models approach human-level capabilities in certain domains, human feedback may no longer provide sufficiently informative signals for further improvement. At the same time, the growing ability of models to make autonomous decisions and execute complex actions naturally enables abstractions in which components of the model development process can be progressively automated. Together, these challenges and opportunities have driven increasing interest in self-improvement, where models autonomously generate data, evaluate outputs, and iteratively refine their own capabilities. In this paper, we present a system-level perspective on self-improving language models and introduce a unified framework that organizes existing techniques. We conceptualize the self-improvement system as a closed-loop lifecycle, consisting of four tightly coupled processes: data acquisition, data selection, model optimization, and inference refinement, along with an autonomous evaluation layer. Within this framework, the model itself plays a central role in driving each stage: collecting or generating data, selecting informative signals, updating its parameters, and refining outputs, while the autonomous evaluation layer continuously monitors progress and guides the improvement cycle across stages. Following this lifecycle perspective, we systematically review and analyze representative methods for each component from a technical standpoint. We further discuss current limitations and outline our vision for future research toward fully self-improving LLMs.

关键词: large language models, self-improvement, autonomous evaluation, closed-loop lifecycle, data acquisition, model optimization, inference refinement, autonomous agents

119. ❌ RenoBench: A Citation Parsing Benchmark

作者: Parth Sarin, Juan Pablo Alperin, Adam Buttrick, Dione Mentis 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25640v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究学术引用解析的基准测试，属于AI在科学文献处理领域的应用。与’Large Language Models’相关度5分，因为摘要提到语言模型在引用解析中表现良好，特别是经过微调后；与’Post-training OR Supervised Fine-tuning OR SFT’相关度5分，因为论文明确提到语言模型经过微调后表现更好；与’AI for Science OR Bioinformatics OR Cheminformatics’相关度5分，因为论文涉及学术文献处理和科学基础设施，属于AI for Science范畴。其他关键词与论文核心内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为RenoBench的公开引用解析基准测试，通过从四个出版生态系统收集的多语言数据集评估了多种引用解析系统，发现经过微调的语言模型表现最佳，为自动化引用解析和元科学研究提供了标准化评估基础。

摘要翻译

引文解析的精确解析是实现机器可读学术基础设施的必要条件。然而，尽管学界对该问题持续关注，现有的评估技术往往缺乏普适性、基于合成数据或未公开可用。我们推出RenoBench，这是一个用于引文解析的公共领域基准测试集，其数据来源于四个出版生态系统发布的PDF文件：SciELO、Redalyc、公共知识项目（Public Knowledge Project）以及开放研究欧洲（Open Research Europe）。我们从16.1万条已标注引文出发，通过自动化验证和基于特征的抽样，构建了一个包含1万条引文的数据集，涵盖多种语言、出版物类型和平台。随后，我们评估了多种引文解析系统，并报告了字段级别的精确率与召回率。研究结果显示，语言模型表现出色，尤其是在经过微调后。RenoBench实现了引文解析系统的可复现、标准化评估，并为推进自动化引文解析和元科学研究奠定了基础。

摘要 (Abstract)

Accurate parsing of citations is necessary for machine-readable scholarly infrastructure. But, despite sustained interest in this problem, existing evaluation techniques are often not generalizable, based on synthetic data, or not publicly available. We introduce RenoBench, a public domain benchmark for citation parsing, sourced from PDFs released on four publishing ecosystems: SciELO, Redalyc, the Public Knowledge Project, and Open Research Europe. Starting from 161,000 annotated citations, we apply automated validation and feature-based sampling to produce a dataset of 10,000 citations spanning multiple languages, publication types, and platforms. We then evaluate a variety of citation parsing systems and report field-level precision and recall. Our results show strong performance from language models, particularly when fine-tuned. RenoBench enables reproducible, standardized evaluation of citation parsing systems, and provides a foundation for advancing automated citation parsing and metascientific research.

关键词: citation parsing, benchmark, language models, fine-tuning, scholarly infrastructure, evaluation, multilingual dataset, metascientific research

120. ❌ Humans vs Vision-Language Models: A Unified Measure of Narrative Coherence

作者: Nikolai Ilinykh, Hyewon Jang, Shalom Lappin, Asad Sayeed, Sharid Loáiciga 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25537v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究视觉语言模型（VLMs）在叙事连贯性方面与人类表现的比较，属于大模型应用评估范畴。然而，所有评分关键词均针对大语言模型（LLMs）的技术原理、训练方法、优化技术、推理能力、部署效率等具体方面，而论文聚焦于VLMs的叙事输出质量评估，未涉及LLMs的任何核心技术或方法。论文未讨论LLMs的架构、训练、对齐、推理加速、模型压缩、科学应用等任何评分关键词相关内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究通过比较视觉语言模型与人类在视觉写作提示语料库中生成的叙事，发现尽管模型表面流畅度类似人类，但在叙事连贯性的组织方式上存在系统性差异。

摘要翻译

本研究通过比较视觉写作提示语料库中人工撰写的叙事与视觉语言模型生成的叙事，探讨了视觉基础故事中的叙事连贯性问题。我们采用一组衡量叙事连贯性不同维度的指标——包括指代消解、话语关系类型、话题连续性、角色持续性以及多模态角色定位——计算叙事连贯性得分。研究发现，视觉语言模型展现出总体相似的连贯性特征，但这些特征与人类叙事存在系统性差异。此外，单个测量指标的差异往往较为细微，但综合考量时差异更为显著。总体而言，我们的结果表明：尽管模型叙事在表层流畅度上接近人类水平，但在视觉基础故事的话语组织方式上仍表现出与人类叙事的系统性差异。代码已发布于 https://github.com/GU-CLASP/coherence-driven-humans。

摘要 (Abstract)

We study narrative coherence in visually grounded stories by comparing human-written narratives with those generated by vision-language models (VLMs) on the Visual Writing Prompts corpus. Using a set of metrics that capture different aspects of narrative coherence, including coreference, discourse relation types, topic continuity, character persistence, and multimodal character grounding, we compute a narrative coherence score. We find that VLMs show broadly similar coherence profiles that differ systematically from those of humans. In addition, differences for individual measures are often subtle, but they become clearer when considered jointly. Overall, our results indicate that, despite human-like surface fluency, model narratives exhibit systematic differences from those of humans in how they organise discourse across a visually grounded story. Our code is available at https://github.com/GU-CLASP/coherence-driven-humans.

关键词: vision-language models, narrative coherence, human-written narratives, multimodal character grounding, discourse relation types, coreference, topic continuity, character persistence

121. ❌ PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency

作者: Minseo Kim, Sujeong Im, Junseong Choi, Junhee Lee, Chaeeun Shim, Edward Choi 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25620v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based persona agents的评估框架PICon，直接涉及’Large Language Models’和’LLM Agents’关键词（高度相关，10分）。论文关注agent响应的一致性和事实准确性，与’Hallucination Mitigation’有一定关联（8分）。其他关键词如MoE、SFT、RAG等涉及模型架构、训练方法或具体技术，论文未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了PICon框架，通过多轮逻辑链式提问评估LLM角色代理在内部一致性、外部一致性和重测一致性三个维度的表现，发现现有系统在所有维度上均未达到人类基线水平。

摘要翻译

基于大语言模型（LLM）的人物角色代理正迅速被广泛采纳，作为跨多个领域的人类参与者的可扩展替代品。然而，目前尚无系统性的方法来验证人物角色代理在交互过程中的回应是否始终保持无矛盾且事实准确。审讯方法学中的一条原则为此提供了一个视角：无论虚构的身份多么精心设计，系统性的审讯终将暴露其矛盾之处。我们应用这一原则，提出了PICon评估框架，该框架通过逻辑链式的多轮提问来探查人物角色代理。PICon从三个核心维度评估一致性：内部一致性（避免自相矛盾）、外部一致性（与现实世界事实相符）以及重测一致性（重复测试下的稳定性）。通过评估七组人物角色代理并与63位真实人类参与者进行对比，我们发现，即使是先前报告为高度一致的系统，也未能在这三个维度上达到人类基准水平，在链式提问下暴露出矛盾之处和回避性回应。这项工作为在信任人物角色代理作为人类参与者替代品之前对其进行评估，提供了概念基础和实践方法。我们已在以下网址提供源代码和交互式演示：https://kaist-edlab.github.io/picon/

摘要 (Abstract)

Large language model (LLM)-based persona agents are rapidly being adopted as scalable proxies for human participants across diverse domains. Yet there is no systematic method for verifying whether a persona agent’s responses remain free of contradictions and factual inaccuracies throughout an interaction. A principle from interrogation methodology offers a lens: no matter how elaborate a fabricated identity, systematic interrogation will expose its contradictions. We apply this principle to propose PICon, an evaluation framework that probes persona agents through logically chained multi-turn questioning. PICon evaluates consistency along three core dimensions: internal consistency (freedom from self-contradiction), external consistency (alignment with real-world facts), and retest consistency (stability under repetition). Evaluating seven groups of persona agents alongside 63 real human participants, we find that even systems previously reported as highly consistent fail to meet the human baseline across all three dimensions, revealing contradictions and evasive responses under chained questioning. This work provides both a conceptual foundation and a practical methodology for evaluating persona agents before trusting them as substitutes for human participants. We provide the source code and an interactive demo at: https://kaist-edlab.github.io/picon/

关键词: persona agents, LLM evaluation, consistency assessment, multi-turn interrogation, internal consistency, external consistency, retest consistency, human baseline

122. ❌ Synchronous Signal Temporal Logic for Decidable Verification of Cyber-Physical Systems

作者: Partha Roop, Sobhan Chatterjee, Avinash Malik, Nathan Allen, Logan Kenwright 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25531v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是网络物理系统（CPS）的形式化验证方法，提出了一种名为Synchronous Signal Temporal Logic（SSTL）的可判定逻辑片段，用于静态验证安全性和活性属性。论文内容完全专注于形式化方法、时序逻辑、模型检测和CPS验证，与所有评分关键词（均涉及大模型、深度学习、AI技术原理及应用）无任何直接关联。论文未提及任何AI模型、机器学习技术或相关应用领域。

!!! tip deepseek-chat TL;DR

该论文针对网络物理系统的安全验证问题，提出了一种名为SSTL的可判定时序逻辑片段，实现了对安全性和活性属性的静态验证，并通过案例研究（如33节点心脏模型）进行了验证。

摘要翻译

许多信息物理系统（CPS）运行在安全关键环境中，其正确执行、可靠性与可信性至关重要。信号时序逻辑（STL）为检验安全关键CPS提供了一个形式化框架。然而，除基于运行时验证的方法（其本身存在局限性）外，STL的静态验证通常不可判定。本文提出同步信号时序逻辑（SSTL），作为STL的一个可判定子集，允许对安全性与活性性质进行静态验证。在SSTL中，我们假设信号以固定的离散步长（称为节拍）采样，并受同步程序相关假设启发，提出了信号不变性假设（SIH）。我们定义了SSTL的语法与语义，并证明SIH是STL公式与其对应SSTL公式等价性的充分必要条件。通过将SSTL转换为LTL_P（基于谓词定义的线性时序逻辑），我们能够利用SPIN模型检测器实现可判定的模型检验。我们在一个33节点人类心脏模型及其他案例研究中验证了该方法。

摘要 (Abstract)

Many Cyber Physical System (CPS) work in a safety-critical environment, where correct execution, reliability and trustworthiness are essential. Signal Temporal Logic (STL) provides a formal framework for checking safety-critical CPS. However, static verification of STL is undecidable in general, except when we want to verify using run-time-based methods, which have limitations. We propose Synchronous Signal Temporal Logic (SSTL), a decidable fragment of STL, which admits static safety and liveness property verification. In SSTL, we assume that a signal is sampled at fixed discrete steps, called ticks, and then propose a hypothesis, called the Signal Invariance Hypothesis (SIH), which is inspired by a similar hypothesis for synchronous programs. We define the syntax and semantics of SSTL and show that SIH is a necessary and sufficient condition for equivalence between an STL formula and its SSTL counterpart. By translating SSTL to LTL_P (LTL defined over predicates), we enable decidable model checking using the SPIN model checker. We demonstrate the approach on a 33-node human heart model and other case studies.

关键词: Cyber-Physical Systems, Signal Temporal Logic, Formal Verification, Model Checking, Safety-Critical Systems, Decidable Logic, Synchronous Systems, Temporal Logic

123. ❌ Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties

作者: Jannis Vamvas, Ignacio Pérez Prat, Angela Heldstab, Dominic P. Fischer, Sina Ahmadi, Rico Sennrich 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25489v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究LLMs在低资源机器翻译中的应用，特别是针对Romansh语言的6种变体。论文的核心是使用LLMs进行数据增强，并发现翻译方向应与资源梯度对齐。因此，仅与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为LLMs是论文的核心工具和方法。其他关键词如MoE、SLMs、Scaling Laws、各种训练技术（Pre-training、SFT、RLHF等）、推理优化（KV Cache、Speculative Decoding）、代理系统、量化、可解释性、科学AI等均未在论文中涉及，故均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLMs在低资源机器翻译中的数据增强方法，针对Romansh语言的6种变体，发现翻译方向应与资源梯度对齐，从而开发出首个能生成流利Romansh变体翻译的模型，在最低资源变体上超越Gemini 3 Pro 23 BLEU。

摘要翻译

近期低资源机器翻译策略主要依赖大语言模型从高资源语言生成合成数据。我们发现这种方法对罗曼什语无效，因为大语言模型容易混淆其六种独立方言变体。实验表明，数据增强的方向应当与源语言和目标语言之间的资源梯度保持一致。该方法在罗曼什语最低资源变体上的表现比Gemini 3 Pro模型高出23个BLEU值。人工评估证实，我们的实验首次实现了能够生成各罗曼什语变体流畅翻译的模型。

摘要 (Abstract)

Recent strategies for low-resource machine translation rely on LLMs to generate synthetic data from higher-resource languages. We find that this method fails for Romansh, because LLMs tend to confuse its 6 distinct language varieties. Our experiments show that instead, the direction of data augmentation should be aligned with the resource gradient between source and target language. This approach surpasses Gemini 3 Pro in the lowest-resource variety of Romansh by 23 BLEU. A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.

关键词: Large Language Models, LLMs, low-resource machine translation, data augmentation, Romansh language varieties, synthetic data generation, translation asymmetry, resource gradient

作者: Erkan Gunes, Christoffer Florczak, Tevfik Murat Yildirim 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25422v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	8.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLM在社会科学文本分类中的应用，通过提示工程（包括标签描述、指令提示和少样本示例）来提升性能，因此与’Large Language Models’高度相关（10分），并涉及’In-context Learning’（8分，因使用少样本示例）。论文属于社会科学领域的AI应用，与’AI for Science’有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该研究探讨如何通过系统调整提示工程中的标签描述、指令提示和少样本示例来提升LLM在社会科学文本分类中的准确性，发现适度增加提示上下文能显著提高性能，但过度增加可能降低准确性，且效果因模型、任务和批次大小而异。

摘要翻译

社会科学领域运用大语言模型（LLM）进行文本分类的最新研究表明，该方法能显著降低成本，且其性能有时可与现有计算方法相媲美。然而，鉴于当前测试中表现存在较大差异，我们转向如何最大化模型性能的问题。本文聚焦于提示语境（prompt context），将其视为提升准确性的潜在途径，通过系统性地调整提示工程（prompt engineering）的三个维度：标签描述、指令引导（instructional nudges）以及少样本示例（few shot examples）。基于两项不同案例的测试表明，适度增加提示语境能带来最显著的性能提升，而进一步扩充语境通常仅产生边际效益。值得注意的是，增加提示语境有时反而会降低准确性。此外，我们的测试揭示了不同模型、任务及批处理规模（batch size）之间存在显著异质性，这强调了对每个LLM编码任务进行独立验证的必要性，而非依赖通用规则。

摘要 (Abstract)

Recent developments in text classification using Large Language Models (LLMs) in the social sciences suggest that costs can be cut significantly, while performance can sometimes rival existing computational methods. However, with a wide variance in performance in current tests, we move to the question of how to maximize performance. In this paper, we focus on prompt context as a possible avenue for increasing accuracy by systematically varying three aspects of prompt engineering: label descriptions, instructional nudges, and few shot examples. Across two different examples, our tests illustrate that a minimal increase in prompt context yields the highest increase in performance, while further increases in context only tend to yield marginal performance increases thereafter. Alarmingly, increasing prompt context sometimes decreases accuracy. Furthermore, our tests suggest substantial heterogeneity across models, tasks, and batch size, underlining the need for individual validation of each LLM coding task rather than reliance on general rules.

关键词: Large Language Models, text classification, social sciences, prompt engineering, few shot examples, performance optimization, context variation, model heterogeneity

125. ❌ Supercharging Federated Intelligence Retrieval

作者: Dimitris Stripelis, Patrick Foley, Mohammad Naseri, William Lindskog-Münzing, Chong Shen Ng, Daniel Janes Beutel, Nicholas D. Lane 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25374v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	15.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心贡献是提出了一种安全的联邦RAG系统，解决分布式私有数据检索问题，因此与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（15分）。论文涉及LLM在系统中的使用，与’Large Language Models OR LLMs OR Foundation Models’相关（10分）。其他关键词如MoE、SLMs、训练方法、推理优化、代理系统等均未在摘要中提及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文针对RAG系统在分布式私有数据场景下的局限性，提出了一种基于Flower的联邦RAG系统，通过本地检索和服务器端安全聚合实现机密远程LLM推理，并设计了级联推理方法以增强上下文而不损害机密性。

摘要翻译

检索增强生成（RAG）通常假设对文档具有集中式访问权限，当知识分散存储于私有数据孤岛时，这一假设便不再成立。我们提出了一种基于Flower构建的安全联邦RAG系统，该系统在本地数据孤岛执行检索，而服务器端的聚合与文本生成则在经过认证的机密计算环境中运行，从而即使在面对诚实但好奇或已遭入侵的服务器时，也能实现保密的远程大语言模型（LLM）推理。我们还提出了一种级联推理方法，该方法将非机密的第三方模型（例如Amazon Nova）作为辅助上下文纳入，同时不削弱整体机密性。

摘要 (Abstract)

RAG typically assumes centralized access to documents, which breaks down when knowledge is distributed across private data silos. We propose a secure Federated RAG system built using Flower that performs local silo retrieval, while server-side aggregation and text generation run inside an attested, confidential compute environment, enabling confidential remote LLM inference even in the presence of honest-but-curious or compromised servers. We also propose a cascading inference approach that incorporates a non-confidential third-party model (e.g., Amazon Nova) as auxiliary context without weakening confidentiality.

关键词: Federated RAG, Confidential LLM Inference, Secure Retrieval, Distributed Data Silos, Flower Framework, Cascading Inference, Privacy-Preserving AI

126. ❌ Large Language Model as Token Compressor and Decompressor

作者: Wenbing Li, Zikai Song, Jielei Zhang, Tianhao Zhao, Junkai Lin, Yiran Wang, Wei Yang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25340v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究是利用预训练LLM作为token压缩器和解压器，通过LoRA微调实现长文本压缩，直接相关关键词包括：‘Large Language Models’（核心研究对象）、‘PEFT/LoRA’（关键技术）、‘Context Window Extension’（应用场景）。‘Pre-training’和’Post-training’有一定关联（涉及预训练模型微调）。其他关键词如MoE、SLMs、RAG、推理加速等均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用预训练大语言模型作为token压缩器和解压器的方法，通过LoRA微调将长文本压缩为紧凑的Z-token表示，实现了高达18倍的token压缩率，同时保持重建保真度和下游任务性能。

摘要翻译

本文提出了一项新颖见解：未经修改的现成大语言模型（LLM）能够作为优秀的令牌压缩器与解压缩器。为验证此观点，我们设计了一种自表达式自动编码学习框架，该框架通过微调预训练LLM，将长文本转化为一种紧凑的内部语言——由离散、可变长度的潜在编码（首次出现标注为Z-tokens）构成，并能够据此精确重构原始文本。所得表征具有内容自适应性：语义密集的片段被分配更多Z-tokens，而冗余或可预测区域则通过基于LoRA的轻量级适配器头部进行高效压缩。实验表明，在维基百科、CNN/DailyMail、HotpotQA及Qulac风格的长查询数据集上，我们的方法实现了高达18倍的令牌压缩率，同时保持了重构保真度与下游任务性能。这种简洁而高效的设计支持包括提示词压缩和直接在Z-token空间进行自回归生成在内的多种应用，为令牌高效的长上下文推理提供了一条潜在路径。

摘要 (Abstract)

In this paper, we establish the novel insight that an off-the-shelf LLM can function as an excellent token compressor and decompressor. To demonstrate, we design a self-expressive autoencoding learning framework fine-tunes a pretrained LLM to translate long texts into a compact internal language of discrete, variable-length latent codes, termed Z-tokens, and to reconstruct the original text exactly from them. The resulting representation is content-adaptive: semantically dense segments receive more Z-tokens, while redundant or predictable regions are aggressively compressed, via lightweight LoRA-based adapter heads. Empirically, our method achieves up to 18 times token reduction on Wikipedia, CNN/DailyMail, HotpotQA, and Qulac-style long-query datasets, while preserving reconstruction fidelity and downstream performance. This simple yet effective design supports applications including prompt compression and autoregressive generation directly in the Z-token space, offering a potential pathway toward token-efficient long-context reasoning.

关键词: Large Language Model, token compressor, token decompressor, LoRA, Z-tokens, long-context reasoning, prompt compression, autoregressive generation

127. ❌ Beyond Detection: Rethinking Education in the Age of AI-writing

作者: Maria Marina, Alexander Panchenko, Vasily Konovalov 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25329v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要探讨AI写作工具（如ChatGPT）对教育的影响，而非大模型技术本身。仅与’Large Language Models’（提及ChatGPT）和’System 2 Thinking’（讨论写作作为深度思考过程）有间接关联，其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文研究了AI写作工具如何威胁写作的认知价值，并提出了教育者应通过改进教学法而非禁止AI来适应这一变化，同时强调了识别机器生成语言能力的重要性。

摘要翻译

随着ChatGPT等生成式人工智能工具进入课堂、职场与日常思维，写作正面临沦为形式化流程的风险——被外包、自动化，并剥离其认知价值。然而写作不仅是产出，更是我们学会思考的方式。本文结合认知心理学、教育理论与真实课堂实践，探讨将写作交由机器代劳所丧失的要素。我们认为，写作过程中那些混乱、缓慢且常令人挫败的环节，正是人类深度学习发生的场域。本文同时探究当前AI文本检测的技术可能，讨论教育者如何通过更智慧的教学法而非简单禁令来应对变革，并阐释为何识别机器生成语言的能力可能成为21世纪的关键素养。在一个写作可以被伪造的世界里，真正的学习无法被替代。

摘要 (Abstract)

As generative AI tools like ChatGPT enter classrooms, workplaces and everyday thinking, writing is at risk of becoming a formality – outsourced, automated and stripped of its cognitive value. But writing is not just output; it is how we learn to think. This paper explores what is lost when we let machines write for us, drawing on cognitive psychology, educational theory and real classroom practices. We argue that the process of writing – messy, slow, often frustrating – is where a human deep learning happens. The paper also explores the current possibilities of AI-text detection, how educators can adapt through smarter pedagogy rather than bans, and why the ability to recognize machine-generated language may become a critical literacy of the 21st century. In a world where writing can be faked, learning can not.

关键词: AI-writing, ChatGPT, cognitive value, educational theory, AI-text detection, pedagogy, machine-generated language, 21st century literacy

128. ❌ Separate Before You Compress: The WWHO Tokenization Architecture

作者: Kusal Darshana 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25309v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种新的分词架构WWHO和算法SGPE，专门针对复杂Abugida文字（如僧伽罗语和天城文）优化LLM分词。核心贡献是显著减少分词数量（相比现有分词器减少27-61.7%），从而有效扩展上下文窗口（最多4.38倍）。因此，与’Large Language Models’高度相关（10分），因为论文直接针对LLM分词器进行改进；与’Context Window Extension’高度相关（10分），因为减少分词数量直接扩展了有效上下文长度。其他关键词（如MoE、SFT、RAG等）均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

论文针对复杂Abugida文字（如僧伽罗语和印地语）在LLM中分词效率低下的问题，提出WWHO架构和SGPE算法，实现了最多61.7%的分词减少和最多4.38倍的上下文窗口扩展。

摘要翻译

当前的大型语言模型大多采用基于字节对编码的标记器，这类标记器对英语等结构简单的拉丁文字效果显著。然而，标准BPE标记器因其结构复杂性难以有效处理复杂的元音附标文字。问题在于这些标记器会将复杂的合字（即由多个码位组成的字素簇）切分为无意义的子字符单元，迫使模型在推理时重新学习基本正字法结构，从而降低推理效率并增加计算成本，最终对全球南方地区造成显著的"标记税"。我们提出了一种名为WWHO（何处-何物-多频）的三层架构，以及一种称为SGPE（音节感知字素对编码）的算法。该方案将文字的语言规则与统计压缩过程分离，同时实现无缝的多语言标记化。我们以僧伽罗文和天城文（印地语/梵语）这两种高度复杂的元音附标文字为例，在清洗后的3000万句数据集上训练WWHO模型，并在包含1,499,950句的测试集上评估。对于僧伽罗文，SGPE实现了1.274的标记词比，每个标记对应4.83个字符，相较于OpenAI的o200k基础模型减少61.7%的标记量；对于印地语则达到1.278的标记词比（较o200k减少27.0%）。在混合文字（僧伽罗文、天城文与英文）数据集上，SGPE整体标记词比为1.240，相对于o200k基础模型、Llama 4 Scout和DeepSeek V3分别减少36.7%、39.6%和60.2%的标记量。这使这些元音附标语言的有效上下文窗口最高可扩展至4.38倍，同时确保"语言零断裂保证"——即任何有效音节都不会被分割到多个标记中。

摘要 (Abstract)

Current Large Language Models (LLMs) mostly use BPE (Byte Pair Encoding) based tokenizers, which are very effective for simple structured Latin scripts such as English. However, standard BPE tokenizers struggle to process complex Abugida scripts due to their structural complexity. The problem is that these tokenizers break complex conjuncts, which are multi-codepoint grapheme clusters, into meaningless sub-character units. This degrades the LLM’s reasoning efficiency by forcing it to learn basic orthographic structures at inference time and raises inference costs, resulting in a significant “Token Tax” for the Global South. We propose a new three-layer architecture, the WWHO (Where-What-How Often), and an algorithm named SGPE (Syllable-aware Grapheme Pair Encoding) that separates the linguistic rules of the script from the statistical compression process while enabling seamless multilingual tokenization. Using Sinhala and Devanagari (Hindi/Sanskrit) as highly complex Abugida scripts, we trained WWHO on a cleaned 30-million-sentence dataset and evaluated on a 1,499,950-sentence test set. For Sinhala, SGPE achieves a Token to Word Ratio (TWR) of 1.274 with 4.83 characters per token, representing a 61.7 percent reduction in tokens compared to OpenAI’s o200k base. For Hindi, it achieves a TWR of 1.181 (27.0 percent reduction vs o200k). On the mixed-script (Sinhala, Devanagari, and English) dataset, SGPE achieves an overall TWR of 1.240, representing token reductions of 36.7 percent, 39.6 percent, and 60.2 percent relative to o200k base, Llama 4 Scout, and DeepSeek V3, respectively. This effectively extends the usable context window by up to 4.38 times for these Abugida languages while ensuring a Linguistic Zero-Breakage Guarantee, which ensures that no valid syllable is ever split across multiple tokens.

关键词: tokenization, Large Language Models, Abugida scripts, context window extension, Sinhala, Devanagari, multilingual tokenization, linguistic zero-breakage guarantee

129. ❌ When Hate Meets Facts: LLMs-in-the-Loop for Check-worthiness Detection in Hate Speech

作者: Nicolás Benjamín Ocampo, Tommaso Caselli, Davide Ceolin 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25269v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在仇恨言论检测中的应用，特别是结合事实核查价值评估，因此与’Large Language Models’高度相关（10分）。论文涉及事实核查和准确性评估，与’Hallucination Mitigation OR Factuality OR Truthfulness’有一定关联（5分）。其他关键词如MoE、SLMs、训练方法、推理技术、模型优化、科学AI应用等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对仇恨言论中夹杂事实性内容的问题，提出了一个LLM-in-the-loop框架来标注核查价值，并构建了首个结合仇恨言论与核查价值的数据集WSF-ARG+，实验表明该框架能减少人工标注负担且不降低质量，同时加入核查价值标签能显著提升LLM的仇恨言论检测性能。

摘要翻译

网络仇恨内容常以事实性、但不一定准确的信息形式呈现，尤其在协同网络骚扰活动和极端主义宣传中尤为突出。若未能同时处理仇恨言论与虚假信息，可能加深偏见、强化有害刻板印象，使旁观者遭受心理困扰，并污染公共讨论空间。此外，此类信息需要内容审核者投入更多精力，因为他们必须同时评估内容的有害性与真实性，即进行事实核查。为应对这一挑战，我们发布了首个融合仇恨言论与核查价值信息的数据集WSF-ARG+。我们还提出了一种新颖的“大语言模型在环”框架，以促进对具有核查价值主张的标注工作。我们运行该框架，并使用12个不同规模和架构的开源权重大语言模型进行测试。通过广泛的人工评估验证，我们证明该框架能在保证数据标注质量的同时显著减少人力投入。最后，我们发现包含核查价值主张的仇恨言论信息表现出显著更高的骚扰性与仇恨度，且引入核查价值标签能将基于大语言模型的仇恨言论检测性能最高提升0.213宏F1值，大型模型平均提升0.154宏F1值。

摘要 (Abstract)

Hateful content online is often expressed using fact-like, not necessarily correct information, especially in coordinated online harassment campaigns and extremist propaganda. Failing to jointly address hate speech (HS) and misinformation can deepen prejudice, reinforce harmful stereotypes, and expose bystanders to psychological distress, while polluting public debate. Moreover, these messages require more effort from content moderators because they must assess both harmfulness and veracity, i.e., fact-check them. To address this challenge, we release WSF-ARG+, the first dataset which combines hate speech with check-worthiness information. We also introduce a novel LLM-in-the-loop framework to facilitate the annotation of check-worthy claims. We run our framework, testing it with 12 open-weight LLMs of different sizes and architectures. We validate it through extensive human evaluation, and show that our LLM-in-the-loop framework reduces human effort without compromising the annotation quality of the data. Finally, we show that HS messages with check-worthy claims show significantly higher harassment and hate, and that incorporating check-worthiness labels improves LLM-based HS detection up to 0.213 macro-F1 and to 0.154 macro-F1 on average for large models.

关键词: hate speech detection, check-worthiness, LLM-in-the-loop, fact-checking, content moderation, dataset annotation, large language models, online harassment

130. ❌ Translation or Recitation? Calibrating Evaluation Scores for Machine Translation of Extremely Low-Resource Languages

作者: Danlu Chen, Ka Sing He, Jiahe Tian, Chenghao Xiao, Zhaofeng Wu, Taylor Berg-Kirkpatrick, Freda Shi 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25222v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究极低资源机器翻译的评估问题，提出了FRED难度指标来校准评估分数，关注数据集内在特征（如训练-测试重叠、预训练暴露、语料库多样性、分词覆盖率）对性能报告的影响。论文未涉及大模型技术原理创新（如MoE、量化、注意力机制优化等）、大模型训练方法（如预训练、微调、对齐等）、大模型推理技术（如推测解码、上下文扩展等）、大模型应用范式（如智能体、工具使用、RAG等）或大模型在科学领域的应用。所有关键词均与大模型技术或科学AI应用直接相关，而本文聚焦传统机器翻译的评估方法论，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对极低资源机器翻译评估中性能报告变异大的问题，提出了FRED难度指标来揭示训练-测试重叠和预训练暴露对结果的影响，为跨语言迁移提供了更透明的评估基础。

摘要翻译

极低资源机器翻译（MT）领域的研究呈现出报告性能指标令人困惑的波动性，这常常导致不同语言对之间的结果难以被准确置于具体语境中加以理解。对于专注于特定语言群体（例如古代语言）的研究者而言，几乎无法判断其他语境（如非洲或美洲原住民语言）中报道的突破性进展，究竟是源于更优的方法论，还是仅仅源于基准数据集构建本身带来的假象。为解决此问题，我们引入了FRED难度度量指标，该指标包括：繁衍率（Fertility Ratio, F）、检索代理指标（Retrieval Proxy, R）、预训练暴露度（Pre-training Exposure, E）和语料库多样性（Corpus Diversity, D）。这些指标作为数据集内在度量标准，用于为报告的性能分数提供语境化解释。这些度量指标揭示，结果变异性的相当大部分可由训练集-测试集重叠度及预训练暴露度来解释，而非模型能力本身。此外，我们发现某些语言——特别是已消亡语言和非拉丁字母的土著语言——存在分词覆盖度差（即分词繁衍率高）的问题，这凸显了从缺乏共享词汇表的高资源语言迁移模型时存在的一个根本性局限。通过在性能分数旁同时提供这些指标指数，我们能够实现更透明的跨语言迁移评估，并为极低资源机器翻译研究社群奠定更可靠的基础。

摘要 (Abstract)

The landscape of extremely low-resource machine translation (MT) is characterized by perplexing variability in reported performance, often making results across different language pairs difficult to contextualize. For researchers focused on specific language groups – such as ancient languages – it is nearly impossible to determine if breakthroughs reported in other contexts (e.g., native African or American languages) result from superior methodologies or are merely artifacts of benchmark collection. To address this problem, we introduce the FRED Difficulty Metrics, which include the Fertility Ratio (F), Retrieval Proxy (R), Pre-training Exposure (E), and Corpus Diversity (D) and serve as dataset-intrinsic metrics to contextualize reported scores. These metrics reveal that a significant portion of result variability is explained by train-test overlap and pre-training exposure rather than model capability. Additionally, we identify that some languages – particularly extinct and non-Latin indigenous languages – suffer from poor tokenization coverage (high token fertility), highlighting a fundamental limitation of transferring models from high-resource languages that lack a shared vocabulary. By providing these indices alongside performance scores, we enable more transparent evaluation of cross-lingual transfer and provide a more reliable foundation for the XLR MT community.

关键词: extremely low-resource machine translation, evaluation calibration, FRED difficulty metrics, train-test overlap, pre-training exposure, tokenization coverage, cross-lingual transfer, benchmark contextualization

131. ❌ Probabilistic Concept Graph Reasoning for Multimodal Misinformation Detection

作者: Ruichao Yang, Wei Gao, Xiaobin Zhu, Jing Ma, Hongzhan Lin, Ziyang Luo, Bo-Wen Zhang, Xu-Cheng Yin 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25203v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出PCGR框架，使用多模态大语言模型（MLLMs）自动发现和验证概念节点，构建概念图进行推理，这与’Large Language Models’相关（8分）。框架强调结构化、概念化推理，产生可解释的推理链，与’Chain of Thought’和’System 2 Thinking’相关（各8分）。研究针对虚假信息检测，涉及事实性和真实性，与’Hallucination Mitigation’相关（8分）。框架设计为可解释的，直接对应’Mechanistic Interpretability’（10分）。其他关键词如MoE、SLMs、训练方法、推理优化、代理系统等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态虚假信息检测的挑战，提出了一个可解释的、基于概念图推理的PCGR框架，利用多模态大语言模型自动发现概念，实现了最先进的检测准确性和对新兴操纵类型的鲁棒性。

摘要翻译

多模态虚假信息正构成日益严峻的挑战，其常能规避传统检测器——这些检测器多为不透明的黑箱系统，且面对新型操纵策略时表现脆弱。本文提出概率概念图推理框架，该框架具备可解释性与可演化特性，将多模态虚假信息检测重新定义为基于概念的结构化推理任务。PCGR遵循“先构建后推断”范式：首先构建由人类可理解概念节点组成的图谱，其中包含通过多模态大语言模型自动发现并验证的新型高层概念；随后在此概念图上应用分层注意力机制以推断声明真实性。该设计产生了将证据与结论相连接的可解释推理链。实验表明，PCGR在多模态虚假信息检测准确率及对新兴操纵类型的鲁棒性方面均达到最先进水平，在粗粒度检测与细粒度操纵识别任务上均优于现有方法。

摘要 (Abstract)

Multimodal misinformation poses an escalating challenge that often evades traditional detectors, which are opaque black boxes and fragile against new manipulation tactics. We present Probabilistic Concept Graph Reasoning (PCGR), an interpretable and evolvable framework that reframes multimodal misinformation detection (MMD) as structured and concept-based reasoning. PCGR follows a build-then-infer paradigm, which first constructs a graph of human-understandable concept nodes, including novel high-level concepts automatically discovered and validated by multimodal large language models (MLLMs), and then applies hierarchical attention over this concept graph to infer claim veracity. This design produces interpretable reasoning chains linking evidence to conclusions. Experiments demonstrate that PCGR achieves state-of-the-art MMD accuracy and robustness to emerging manipulation types, outperforming prior methods in both coarse detection and fine-grained manipulation recognition.

关键词: multimodal misinformation detection, probabilistic concept graph reasoning, interpretable framework, multimodal large language models, concept-based reasoning, hierarchical attention, state-of-the-art accuracy, robustness to manipulation

132. ❌ SafeMath: Inference-time Safety improves Math Accuracy

作者: Sagnik Basu, Subhrajit Mitra, Aman Juneja, Somnath Banerjee, Rima Hazra, Animesh Mukherjee 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25201v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在数学推理中的安全问题，提出安全对齐技术SafeMath，与’Large Language Models’和’Instruction Tuning/Alignment’高度相关（10分）。涉及数学推理过程，与’Chain of Thought’有一定关联（5分）。关注有害输出减少，与’Hallucination Mitigation’相关（8分）。其他关键词如MoE、SLMs、Scaling Laws等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLMs在数学应用题中可能传播有害内容的问题，提出了SafeMath安全对齐技术，能在减少有害输出的同时保持甚至提升数学推理准确性。

摘要翻译

近期研究表明，大语言模型可能通过对抗性或看似良性的输入被操纵，从而产生有害、偏见或违反政策的内容。本文探讨了一个尚未被充分研究的议题：有害且具有毒性的数学应用题。我们发现，数学问题——尤其是以自然语言叙事形式呈现的问题——可能成为传播偏见、不道德或心理有害内容的隐蔽媒介，这在涉及儿童的教育场景中风险尤为突出。为支持对此现象的系统性研究，我们提出了ToxicGSM数据集，该数据集包含1.9k个算术问题，这些问题在保持数学推理任务定义明确的同时，嵌入了有害或敏感的背景信息。基于此数据集，我们对现有大语言模型的行为进行了审计，并分析了安全约束与数学正确性之间的权衡关系。我们进一步提出SafeMath——一种安全对齐技术，能在减少有害输出的同时保持（某些情况下甚至提升）数学推理性能。研究结果凸显了将语言层面的危害与数学推理相剥离的重要性，并证明有效的安全对齐未必以牺牲准确性为代价。相关源代码与数据集已发布于https://github.com/Swagnick99/SafeMath/tree/main。

摘要 (Abstract)

Recent research points toward LLMs being manipulated through adversarial and seemingly benign inputs, resulting in harmful, biased, or policy-violating outputs. In this paper, we study an underexplored issue concerning harmful and toxic mathematical word problems. We show that math questions, particularly those framed as natural language narratives, can serve as a subtle medium for propagating biased, unethical, or psychologically harmful content, with heightened risks in educational settings involving children. To support a systematic study of this phenomenon, we introduce ToxicGSM, a dataset of 1.9k arithmetic problems in which harmful or sensitive context is embedded while preserving mathematically well-defined reasoning tasks. Using this dataset, we audit the behaviour of existing LLMs and analyse the trade-offs between safety enforcement and mathematical correctness. We further propose SafeMath – a safety alignment technique that reduces harmful outputs while maintaining, and in some cases improving, mathematical reasoning performance. Our results highlight the importance of disentangling linguistic harm from math reasoning and demonstrate that effective safety alignment need not come at the cost of accuracy. We release the source code and dataset at https://github.com/Swagnick99/SafeMath/tree/main.

关键词: LLMs, safety alignment, mathematical reasoning, harmful content, ToxicGSM dataset, inference-time safety, math accuracy, educational settings

133. ❌ A Catalog of Basque Dialectal Resources: Online Collections and Standard-to-Dialectal Adaptations

作者: Jaione Bengoetxea, Itziar Gonzalez-Dios, Rodrigo Agerri 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25189v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文聚焦于巴斯克语方言资源的编目和数据集创建，包括在线方言数据收集和标准语到方言的转换（手动和自动）。研究内容属于方言自然语言处理（NLP）和低资源语言数据工程领域，不涉及大模型、深度学习技术原理、模型训练/优化方法、推理技术、AI代理或科学AI应用。所有评分关键词均与大模型和深度学习技术直接相关，而本文未涉及这些技术，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对巴斯克语方言NLP中数据稀缺的问题，通过系统编目现有在线方言资源，并创建了手动和自动标准语到方言转换的高质量数据集，以支持方言NLP研究。

摘要翻译

近期关于方言自然语言处理的研究指出，数据稀缺是主要制约因素。为应对这一局限，本文系统梳理了当代巴斯克方言数据与资源，全面汇编了当前可用的巴斯克方言数据。研究区分了两类数据来源：原始以方言撰写的在线数据，以及由标准语向方言转换的适配数据。前者涵盖所有可从线上获取的方言数据，例如新闻与广播网站、非正式推文，以及词典、语言地图集、语法规则或视频等在线资源。后者包括通过人工或自动方式从标准变体转换为方言变体的数据。在人工适配方面，研究将XNLI自然语言推理数据集的测试集人工适配为三种巴斯克方言：西部方言、中部方言和纳瓦拉-拉布尔丹方言，从而构建了高质量的平行黄金标准评估数据集。在自动方言适配方面，研究通过母语者对自动适配的物理常识数据集（BasPhyCowest）进行了人工评估，以检验其质量，并确定其是否可作为完全人工适配（即白银数据创建）的有效替代方案。

摘要 (Abstract)

Recent research on dialectal NLP has identified data scarcity as a primary limitation. To address this limitation, this paper presents a catalog of contemporary Basque dialectal data and resources, offering a systematic and comprehensive compilation of the dialectal data currently available in Basque. Two types of data sources have been distinguished: online data originally written in some dialect, and standard-to-dialect adapted data. The former includes all dialectal data that can be found online, such as news and radio sites, informal tweets, as well as online resources such as dictionaries, atlases, grammar rules, or videos. The latter consists of data that has been adapted from the standard variety to dialectal varieties, either manually or automatically. Regarding the manual adaptation, the test split of the XNLI Natural Language Inference dataset was manually adapted into three Basque dialects: Western, Central, and Navarrese-Lapurdian, yielding a high-quality parallel gold standard evaluation dataset. With respect to the automatic dialectal adaptation, the automatically adapted physical commonsense dataset (BasPhyCowest) underwent additional manual evaluation by native speakers to assess its quality and determine whether it could serve as a viable substitute for full manual adaptation (i.e., silver data creation).

关键词: Basque dialects, dialectal NLP, data scarcity, resource catalog, standard-to-dialect adaptation, XNLI dataset adaptation, gold standard evaluation, silver data creation

134. ❌ Cross-Preference Learning for Sentence-Level and Context-Aware Machine Translation

作者: Ying Li, Xinglin Lyu, Junhui Li, Jinlong Yang, Hengchao Shang, Min Zhang, Shimin Tao, Daimeng Wei 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25183v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种基于偏好优化的训练框架（Cross-Preference Learning, CPL），用于改进上下文感知机器翻译。该方法直接应用了偏好优化（DPO属于RLHF/RLAIF/DPO关键词），属于大语言模型（LLMs）的后训练微调（SFT）技术。论文实验使用了Qwen3-4B、Qwen3-8B和Llama-3-8B等大语言模型，因此与’Large Language Models’和’Post-training/SFT’高度相关。其他关键词如MoE、量化、推理加速、科学AI等均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对上下文感知机器翻译中上下文信息利用不均衡的问题，提出了一种基于偏好优化的训练框架（CPL），通过整合句子级和上下文级偏好，在不修改模型架构的情况下，显著提升了多个大语言模型的翻译质量和鲁棒性。

摘要翻译

上下文感知机器翻译（Context-aware MT）利用文档级信息，但其性能并不总是优于句子级机器翻译，因为上下文信号在不同句子中的益处并不均衡。现有训练目标未能显式建模这种差异性，限制了模型自适应利用上下文的能力。本文提出交叉偏好学习（Cross-Preference Learning, CPL），这是一种基于偏好的训练框架，能够显式捕捉句子级与上下文感知机器翻译的互补优势。CPL通过将内部条件偏好与跨条件偏好同时整合到偏好优化目标中实现这一目标。内部与跨条件偏好的引入为“何时以及如何利用上下文信息提升翻译质量”提供了显式监督。我们在多个公开的上下文感知机器翻译任务上使用包括Qwen3-4B、Qwen3-8B和Llama-3-8B在内的多种模型验证了所提方法。实验结果表明，该方法无需任何架构修改，即可在两种输入条件下持续提升翻译质量与鲁棒性。

摘要 (Abstract)

Context-aware machine translation (MT) leverages document-level information, yet it does not consistently outperform sentence-level MT, as contextual signals are unevenly beneficial across sentences. Existing training objectives do not explicitly model this variability, limiting a model’s ability to adaptively exploit context. In this paper, we propose Cross-Preference Learning (CPL), a preference-based training framework that explicitly captures the complementary benefits of sentence-level and context-aware MT. CPL achieves this by integrating both intra- and cross-condition preferences into the preference optimization objective. The introduction of intra- and cross-condition preferences provides explicit supervision on when and how contextual information improves translation quality. We validate the proposed approach on several public context-aware MT tasks using multiple models, including Qwen3-4B, Qwen3-8B, and Llama-3-8B. Experimental results demonstrate consistent improvements in translation quality and robustness across both input conditions, achieved without any architectural modifications.

关键词: Cross-Preference Learning, context-aware machine translation, preference optimization, sentence-level translation, intra-condition preference, cross-condition preference, Qwen3, Llama-3

135. ❌ Bilingual Text-to-Motion Generation: A New Benchmark and Baselines

作者: Wanjiang Weng, Xiaofeng Tan, Xiangbo Shu, Guo-Sen Xie, Pan Zhou, Hongsong Wang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25178v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究双语文本到动作生成，使用LLM辅助标注构建数据集，并提出跨语言对齐方法。与’Large Language Models’相关（8分），因为使用LLM进行数据标注；与’Alignment’相关（8分），因为提出跨语言对齐策略；与’Scaling Laws AND Data Quality’有一定关联（5分），因为涉及数据集构建和质量控制。其他关键词如MoE、SLMs、RLHF、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对双语文本到动作生成缺乏数据集和跨语言语义理解的问题，提出了首个双语基准数据集BiHumanML3D和跨语言对齐的Bilingual Motion Diffusion模型，显著提升了双语和零样本代码切换场景下的动作生成质量。

摘要翻译

文本驱动动作生成在跨语言应用领域具有重要潜力，但其发展受限于双语数据集的缺乏以及现有语言模型较差的跨语言语义理解能力。为弥补这些不足，我们提出了首个双语文本驱动动作基准数据集BiHumanML3D，该数据集通过大语言模型辅助标注与严格人工校正构建而成。此外，我们设计了一种简单而有效的基线模型——双语动作扩散模型（BiMD），其核心特征为跨语言对齐机制（CLA）。该机制显式地对齐跨语言语义表征，构建出鲁棒的条件空间，使其能够从双语输入（包括零样本语码转换场景）生成高质量动作序列。大量实验表明，配备CLA的BiMD在BiHumanML3D数据集上取得了FID 0.045（对比基准0.169）和R@3 82.8%（对比基准80.8%）的优异表现，显著优于单语扩散模型及翻译基线方法，这充分验证了我们数据集的关键必要性与可靠性，同时证明了所提对齐策略在跨语言动作合成中的有效性。数据集与代码已发布于\href{https://wengwanjiang.github.io/BilingualT2M-page}{https://wengwanjiang.github.io/BilingualT2M-page}。

摘要 (Abstract)

Text-to-motion generation holds significant potential for cross-linguistic applications, yet it is hindered by the lack of bilingual datasets and the poor cross-lingual semantic understanding of existing language models. To address these gaps, we introduce BiHumanML3D, the first bilingual text-to-motion benchmark, constructed via LLM-assisted annotation and rigorous manual correction. Furthermore, we propose a simple yet effective baseline, Bilingual Motion Diffusion (BiMD), featuring Cross-Lingual Alignment (CLA). CLA explicitly aligns semantic representations across languages, creating a robust conditional space that enables high-quality motion generation from bilingual inputs, including zero-shot code-switching scenarios. Extensive experiments demonstrate that BiMD with CLA achieves an FID of 0.045 vs. 0.169 and R@3 of 82.8% vs. 80.8%, significantly outperforms monolingual diffusion models and translation baselines on BiHumanML3D, underscoring the critical necessity and reliability of our dataset and the effectiveness of our alignment strategy for cross-lingual motion synthesis. The dataset and code are released at \href{https://wengwanjiang.github.io/BilingualT2M-page}{https://wengwanjiang.github.io/BilingualT2M-page}

关键词: Bilingual text-to-motion generation, Cross-lingual alignment, LLM-assisted annotation, Motion diffusion model, Zero-shot code-switching, BiHumanML3D benchmark, Semantic representation alignment

136. ❌ Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models

作者: Hieu Xuan Le, Benjamin Goh, Quy Anh Tang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25176v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM作为安全法官检测提示攻击，直接涉及LLM应用和推理过程（CoT、System 2、Self-Reflection），并评估了Mixture-of-Models（MoM）方法。其他关键词如SLMs、训练技术、加速方法、科学应用等未在摘要中体现，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究如何利用轻量级大语言模型作为实时安全法官来检测提示攻击，通过结构化推理流程实现了有效的低延迟防护，并评估了混合模型方法的增益有限。

摘要翻译

提示攻击（包括越狱攻击和提示注入）对大型语言模型（LLM）系统构成了严重的安全威胁。在实际生产环境中，防护机制必须在严格的低延迟约束下缓解此类攻击，这导致了一种部署缺口：轻量级分类器和基于规则的系统难以在分布偏移下保持泛化能力，而基于高容量LLM的评判器又因速度过慢或成本过高而无法用于实时防护。本研究探讨了轻量级通用LLM是否能在实际生产约束下可靠地充当安全评判器。通过精心设计的提示和输出结构，我们引导轻量级LLM进行结构化推理，该过程包括明确的意图分解、安全信号验证、危害评估和自我反思。我们在一个精选数据集上评估了该方法，该数据集结合了来自真实聊天机器人的良性查询与通过自动化红队测试（ART）生成的对抗性提示，覆盖了多样且不断演变的攻击模式。结果表明，通用LLM（例如gemini-2.0-flash-lite-001）能够作为有效的低延迟评判器用于实时防护机制。该配置目前已作为集中式防护服务，部署于新加坡公共服务聊天机器人的生产环境中。此外，我们还评估了混合模型（MoM）设置，以探究聚合多个LLM评判器是否能比单一模型评判器更有效地提升提示攻击检测性能，仅观察到有限的改进。

摘要 (Abstract)

Prompt attacks, including jailbreaks and prompt injections, pose a critical security risk to Large Language Model (LLM) systems. In production, guardrails must mitigate these attacks under strict low-latency constraints, resulting in a deployment gap in which lightweight classifiers and rule-based systems struggle to generalize under distribution shift, while high-capacity LLM-based judges remain too slow or costly for live enforcement. In this work, we examine whether lightweight, general-purpose LLMs can reliably serve as security judges under real-world production constraints. Through careful prompt and output design, lightweight LLMs are guided through a structured reasoning process involving explicit intent decomposition, safety-signal verification, harm assessment, and self-reflection. We evaluate our method on a curated dataset combining benign queries from real-world chatbots with adversarial prompts generated via automated red teaming (ART), covering diverse and evolving patterns. Our results show that general-purpose LLMs, such as gemini-2.0-flash-lite-001, can serve as effective low-latency judges for live guardrails. This configuration is currently deployed in production as a centralized guardrail service for public service chatbots in Singapore. We additionally evaluate a Mixture-of-Models (MoM) setting to assess whether aggregating multiple LLM judges improves prompt-attack detection performance relative to single-model judges, with only modest gains observed.

关键词: Prompt Attack Detection, LLM-as-a-Judge, Mixture-of-Models, Security Guardrails, Low-latency, Structured Reasoning, Self-reflection, Adversarial Prompts

137. ❌ OMIND: Framework for Knowledge Grounded Finetuning and Multi-Turn Dialogue Benchmark for Mental Health LLMs

作者: Suraj Racha, Prashant Harish Joshi, Utkarsh Maurya, Nitin Yadav, Mridul Sharma, Ananya Kunisetty, Saranya Darisipudi, Nirmal Punjabi, Ganesh Ramakrishnan 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25105v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLMs在心理健康领域的应用，与’Large Language Models’和’AI for Science’高度相关（10分）。论文提出oMind框架，包含SFT数据集和训练对齐，与’Post-training/SFT’高度相关（10分）。论文涉及知识检索、数据质量、推理能力、LLM代理等，与’Scaling Laws & Data Quality’、‘Pre-training/Domain Adaptation’、‘Instruction Tuning/Alignment’、‘Retrieval-Augmented Generation’、‘Chain of Thought Reasoning’、‘LLM Agents’有一定关联（各5分）。其他关键词如MoE、SLMs、RLHF、PEFT、量化等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对LLMs在心理健康领域应用时面临的高质量数据缺乏、训练范式受限和多轮对话评估挑战，提出了oMind框架和数据集，实验表明其LLMs在核心能力和对话中均优于基线模型。

摘要翻译

大型语言模型（LLM）在复杂任务中展现出卓越能力，然而其在医学领域特别是心理健康领域的适应仍面临特定挑战。心理健康已成为全球日益关注的问题，而LLM在协助应对这一问题上具有巨大潜力。我们重点阐述了LLM在心理健康领域面临的三大主要挑战：缺乏高质量、可解释且基于知识的训练数据；训练范式局限于核心能力；以及多轮对话场景的评估。针对这些问题，我们提出了oMind框架，该框架包括针对多样化能力（包括对话）的训练与对齐LLM智能体；通过基于结构化知识检索、LLM驱动的剪枝及审核流程的生成管线，构建了包含约16.4万条数据的高质量多任务监督微调（SFT）数据集。我们还推出了oMind-Chat——一个新颖的多轮对话基准数据集，其包含专家标注的轮次级别与会话级别评估标准。我们在核心能力与对话场景上的多样化实验表明，oMind系列LLM模型持续优于基线模型。oMind-LLM还展现出显著更优的推理能力，胜率最高可达80%。

摘要 (Abstract)

Large Language Models (LLMs) have shown remarkable capabilities for complex tasks, yet adaptation in medical domain, specifically mental health, poses specific challenges. Mental health is a rising concern globally with LLMs having large potential to help address the same. We highlight three primary challenges for LLMs in mental health - lack of high quality interpretable and knowledge grounded training data; training paradigms restricted to core capabilities, and evaluation of multi turn dialogue settings. Addressing it, we present oMind framework which includes training and aligning LLM agents for diverse capabilities including conversations; high quality ~164k multi-task SFT dataset, as a result of our generation pipeline based on Structured Knowledge retrieval, LLM based pruning, and review actions. We also introduce oMind-Chat - a novel multi turn benchmark dataset with expert annotated turn level and conversation level rubrics. Our diverse experiments on both core capabilities and conversations shows oMind LLMs consistently outperform baselines. oMind-LLM also shows significantly better reasoning with up to 80% win rate.

关键词: Large Language Models, mental health, fine-tuning, multi-turn dialogue, knowledge grounded, SFT dataset, LLM agents, reasoning

138. ❌ Approaches to Analysing Historical Newspapers Using LLMs

作者: Filip Dobranić, Tina Munda, Oliver Pejić, Vojko Gorjanc, Uroš Šmajdek, David Bordon, Jakob Lenardič, Tjaša Konovšek, Kristina Pahor de Maiti Tekavčič, Ciril Bohak, Darja Fišer 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25051v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文明确使用LLMs（GaMS3-12B-Instruct模型）进行历史报纸的情感分析，属于大模型在数字人文领域的应用，因此与’Large Language Models’高度相关（10分）。研究涉及历史文本分析，属于AI在人文科学中的应用，与’AI for Science’有一定关联（5分）。论文未涉及其他关键词的具体技术细节（如MoE、SFT、RAG等），因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究使用大型语言模型（LLMs）对斯洛文尼亚历史报纸进行主题建模和情感分析，揭示了20世纪初公共话语中集体身份和政治取向的差异，并证明了计算方法与批判性解释结合在数字人文研究中的价值。

摘要翻译

本研究对sPeriodika语料库中斯洛文尼亚历史报纸《Slovenec》与《Slovenski narod》进行了计算分析，综合运用主题建模、基于大语言模型（LLM）的方面级情感分析、实体图可视化及质性话语分析方法，以探究二十世纪之交的公共话语如何呈现集体身份、政治取向与国家归属。通过BERTopic模型，我们识别出主要的主题模式，揭示了两份报纸既存在共同关切又存在明显意识形态分歧的特征，这反映了其保守-天主教与自由-进步的不同取向。我们进一步评估了四种指令跟随型LLM在OCR质量受损的历史斯洛文尼亚语文本中进行定向情感分类的表现，最终选定适应斯洛文尼亚语的GaMS3-12B-Instruct模型为最适合大规模应用的模型，同时也记录了其重要局限，尤其是该模型在中性情感识别上的表现优于积极或消极情感。将该模型应用于全数据集规模的分析后，研究揭示了集体身份描绘方式的显著差异：某些群体主要出现在中性描述语境中，而其他群体则更常出现于评价性或冲突相关的话语中。随后，我们构建了命名实体识别（NER）图谱以探索集体身份与地点之间的关联。采用混合研究方法分析命名实体图谱，将量化网络分析与批判性话语分析相结合。研究重点聚焦于交织的历史政治身份与社会经济身份的形成与发展过程。总体而言，本研究证明了将可扩展的计算方法与批判性阐释相结合的价值，为基于噪声历史报纸数据的数字人文研究提供了有力支持。

摘要 (Abstract)

This study presents a computational analysis of the Slovene historical newspapers \textit{Slovenec} and \textit{Slovenski narod} from the sPeriodika corpus, combining topic modelling, large language model (LLM)-based aspect-level sentiment analysis, entity-graph visualisation, and qualitative discourse analysis to examine how collective identities, political orientations, and national belonging were represented in public discourse at the turn of the twentieth century. Using BERTopic, we identify major thematic patterns and show both shared concerns and clear ideological differences between the two newspapers, reflecting their conservative-Catholic and liberal-progressive orientations. We further evaluate four instruction-following LLMs for targeted sentiment classification in OCR-degraded historical Slovene and select the Slovene-adapted GaMS3-12B-Instruct model as the most suitable for large-scale application, while also documenting important limitations, particularly its stronger performance on neutral sentiment than on positive or negative sentiment. Applied at dataset scale, the model reveals meaningful variation in the portrayal of collective identities, with some groups appearing predominantly in neutral descriptive contexts and others more often in evaluative or conflict-related discourse. We then create NER graphs to explore the relationships between collective identities and places. We apply a mixed methods approach to analyse the named entity graphs, combining quantitative network analysis with critical discourse analysis. The investigation focuses on the emergence and development of intertwined historical political and socionomic identities. Overall, the study demonstrates the value of combining scalable computational methods with critical interpretation to support digital humanities research on noisy historical newspaper data.

关键词: historical newspapers, large language models, sentiment analysis, topic modelling, digital humanities, collective identities, Slovene language, aspect-level analysis

139. ❌ Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

作者: Yicheng Zou, Dongsheng Zhu, Lin Zhu, Tong Zhu, Yunhua Zhou, Peiheng Zhou, Xinyu Zhou, Dongzhan Zhou, Zhiwang Zhou, Yuhao Zhou, Bowen Zhou, Zhanping Zhong, Zhijie Zhong, Haiteng Zhao, Penghao Zhao, Xiaomeng Zhao, Zhiyuan Zhao, Yechen Zhang, Jin Zhang, Wenwei Zhang, Hongjie Zhang, Zhuo Zhang, Wenlong Zhang, Bo Zhang, Chao Zhang, Chen Zhang, Yuhang Zang, Fei Yuan, Jiakang Yuan, Jiashuo Yu, Jinhui Yin, Haochen Ye, Qian Yao, Bowen Yang, Danni Yang, Kaichen Yang, Ziang Yan, Jun Xu, Yicheng Xu, Wanghan Xu, Xuenan Xu, Chao Xu, Ruiliang Xu, Shuhao Xing, Long Xing, Xinchen Xie, Ling-I Wu, Zijian Wu, Zhenyu Wu, Lijun Wu, Yue Wu, Jianyu Wu, Wen Wu, Fan Wu, Xilin Wei, Qi Wei, Bingli Wang, Rui Wang, Ziyi Wang, Zun Wang, Yi Wang, Haomin Wang, Yizhou Wang, Lintao Wang, Yiheng Wang, Longjiang Wang, Bin Wang, Jian Tong, Zhongbo Tian, Huanze Tang, Chen Tang, Shixiang Tang, Yu Sun, Qiushi Sun, Xuerui Su, Qisheng Su, Chenlin Su, Demin Song, Jin Shi, Fukai Shang, Yuchen Ren, Pengli Ren, Xiaoye Qu, Yuan Qu, Jiantao Qiu, Yu Qiao, Runyu Peng, Tianshuo Peng, Jiahui Peng, Qizhi Pei, Zhuoshi Pan, Linke Ouyang, Wenchang Ning, Yichuan Ma, Zerun Ma, Ningsheng Ma, Runyuan Ma, Chengqi Lyu, Haijun Lv, Han Lv, Lindong Lu, Kuikun Liu, Jiangning Liu, Yuhong Liu, Kai Liu, Hongwei Liu, Zhoumianze Liu, Mengjie Liu, Ziyu Liu, Wenran Liu, Yang Liu, Liwei Liu, Kaiwen Liu, Junyao Lin, Junming Lin, Tianyang Lin, Dahua Lin, Jianze Liang, Linyang Li, Peiji Li, Zonglin Li, Zehao Li, Pengze Li, Guoyan Li, Lingkai Kong, Linglin Jing, Zhenjiang Jin, Feifei Jiang, Qian Jiang, Junhao Huang, Zixian Huang, Haian Huang, Zhouqi Hua, Han Hu, Linfeng Hou, Yinan He, Conghui He, Tianyao He, Xu Guo, Qipeng Guo, Aijia Guo, Yuzhe Gu, Lixin Gu, Jingyang Gong, Qiming Ge, Jiaye Ge, Songyang Gao, Jianfei Gao, Xinyu Fang, Caihua fan, Yue Fan, Yanhui Duan, Zichen Ding, Shengyuan Ding, Xuanlang Dai, Erfei Cui, Ganqu Cui, Pei Chu, Tao Chu, Guangran Cheng, Yu Cheng, Kai Chen, Yongkang Chen, Chiyu Chen, Guanzhou Chen, Qiaosheng Chen, Sitao Chen, Xin Chen, Haojiong Chen, Yicheng Chen, Weihan Cao, Yuhang Cao, Qinglong Cao, Lei Bai 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25040v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是介绍万亿参数科学多模态基础模型Intern-S1-Pro，因此与’Large Language Models/Foundation Models’高度相关（10分）。模型在科学领域应用，与’AI for Science/Bioinformatics/Cheminformatics’高度相关（10分）。摘要提到’advanced agent capabilities’，与’LLM Agents/Autonomous Agents’高度相关（10分）。模型涉及预训练和领域适应，与’Pre-training/Domain Adaptation’高度相关（10分）。模型达到万亿规模，隐含了缩放规律和数据质量的重要性，与’Scaling Laws AND Data Quality’有一定关联（5分）。摘要提到’stronger reasoning’，与’Chain of Thought/Reasoning’有一定关联（5分）。模型使用强化学习训练，但未具体说明是RLHF/RLAIF/DPO，因此相关度为0。其他关键词如MoE、SLMs、RAG、量化等未在摘要中提及，相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了Intern-S1-Pro，首个万亿参数的科学多模态基础模型，通过高效强化学习训练在通用和科学领域（如化学、材料、生命科学）实现了能力增强，并在开源模型中达到顶级通用能力，同时在专业科学任务深度上超越专有模型。

摘要翻译

我们推出Intern-S1-Pro，这是首个万亿参数规模的科学多模态基础模型。通过扩展至这一前所未有的规模，该模型在通用领域与科学领域均实现了全面增强。除了更强大的推理与图文理解能力外，其智能还通过先进的智能体能力得到提升。同时，其科学专业知识大幅扩展，能够驾驭化学、材料、生命科学和地球科学等关键科学领域的超过100项专业任务。实现如此大规模的训练得益于XTuner与LMDeploy提供的强大基础设施支持，这些工具使得在万亿参数级别进行高效强化学习（Reinforcement Learning, RL）训练成为可能，同时确保了训练与推理间严格的精度一致性。通过无缝整合这些进展，Intern-S1-Pro进一步强化了通用智能与专业智能的融合，作为一种“可专精的通才”（Specializable Generalist）运作，在通用能力方面跻身开源模型顶尖行列，并在专业科学任务的深度上超越了闭源模型。

摘要 (Abstract)

We introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Scaling to this unprecedented size, the model delivers a comprehensive enhancement across both general and scientific domains. Beyond stronger reasoning and image-text understanding capabilities, its intelligence is augmented with advanced agent capabilities. Simultaneously, its scientific expertise has been vastly expanded to master over 100 specialized tasks across critical science fields, including chemistry, materials, life sciences, and earth sciences. Achieving this massive scale is made possible by the robust infrastructure support of XTuner and LMDeploy, which facilitates highly efficient Reinforcement Learning (RL) training at the 1-trillion parameter level while ensuring strict precision consistency between training and inference. By seamlessly integrating these advancements, Intern-S1-Pro further fortifies the fusion of general and specialized intelligence, working as a Specializable Generalist, demonstrating its position in the top tier of open-source models for general capabilities, while outperforming proprietary models in the depth of specialized scientific tasks.

关键词: trillion-parameter, scientific multimodal foundation model, agent capabilities, reinforcement learning training, chemistry, materials, life sciences, earth sciences

作者: Tony Mason 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25015v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的指令遵循行为，特别是社会语域（social register）如何影响不同语言中指令的交互拓扑结构，这直接涉及指令调优和对齐（Instruction Tuning/Alignment）问题，因为论文探讨了指令的语气（如祈使句）如何被模型理解为社会行为，并可能影响跨语言的模型对齐效果。论文明确使用多个大语言模型进行实验，因此与’Large Language Models’高度相关。其他关键词如MoE、SLMs、训练方法、推理加速、AI for Science等均未在论文标题或摘要中涉及，因此评分为0。

!!! tip deepseek-chat TL;DR

该研究发现大语言模型处理指令时受社会语域影响，导致相同语义的指令在不同语言中呈现相反的交互拓扑（如英语中合作，西班牙语中竞争），通过将祈使句改写为陈述句可显著减少跨语言差异，表明模型将指令视为社会行为而非技术规范。

摘要翻译

具有相同语义内容但相反交互拓扑结构的系统提示指令，在英语中表现为协作关系，在西班牙语中却呈现竞争关系。我们通过在四种语言和四种模型上进行的指令级消融实验表明，这种拓扑反转是由社交语域调节的：祈使语气在不同言语社群中承载不同的强制力，而基于多语言数据训练的模型已习得这些惯例。对单个指令块进行陈述性重写可将跨语言差异降低81%（p = 0.029，置换检验）。在十一个祈使句块中重写其中三个，即可使西班牙语指令拓扑从竞争性转为协作性，并对未重写的指令块产生溢出效应。这些发现表明，模型将指令作为社交行为而非技术规范进行处理：“绝不做X”是一种权威行使，其强制力具有语言依赖性；而“X：已禁用”则是可在语言间迁移的事实性描述。如果语域在推理阶段调节指令遵循行为，那么它在训练阶段很可能也起类似作用。我们将其表述为可检验的预测：以祈使语气撰写的宪法AI原则可能导致语言依赖的对齐。语料库：针对分解为56个指令块的生产系统提示，人工撰写了22条探测指令。

摘要 (Abstract)

System prompt instructions that cooperate in English compete in Spanish, with the same semantic content, but opposite interaction topology. We present instruction-level ablation experiments across four languages and four models showing that this topology inversion is mediated by social register: the imperative mood carries different obligatory force across speech communities, and models trained on multilingual data have learned these conventions. Declarative rewriting of a single instruction block reduces cross-linguistic variance by 81% (p = 0.029, permutation test). Rewriting three of eleven imperative blocks shifts Spanish instruction topology from competitive to cooperative, with spillover effects on unrewritten blocks. These findings suggest that models process instructions as social acts, not technical specifications: “NEVER do X” is an exercise of authority whose force is language-dependent, while “X: disabled” is a factual description that transfers across languages. If register mediates instruction-following at inference time, it plausibly does so during training. We state this as a testable prediction: constitutional AI principles authored in imperative mood may create language-dependent alignment. Corpus: 22 hand-authored probes against a production system prompt decomposed into 56 blocks.

关键词: Large Language Models, Instruction Following, Social Register, Multilingual, Imperative Mood, Alignment, Cross-linguistic Variance, Instruction Topology

141. ❌ Exons-Detect: Identifying and Amplifying Exonic Tokens via Hidden-State Discrepancy for Robust AI-Generated Text Detection

作者: Xiaowei Zhu, Yubing Ren, Fang Fang, Shi Wang, Yanan Cao, Li Guo 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.24981v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于AI生成文本检测方法，核心是提出Exons-Detect这一训练无关的检测技术。仅与"Large Language Models OR LLMs OR Foundation Models"高度相关（10分），因为论文明确研究LLM生成的文本检测，是LLM应用的重要下游任务。其他关键词涉及模型架构、训练方法、推理优化、特定应用领域等，论文均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

论文提出Exons-Detect方法，通过识别和放大外显子令牌来检测AI生成的文本，在对抗攻击和不同输入长度下实现了最先进的检测性能和强鲁棒性。

摘要翻译

大型语言模型的快速发展日益模糊了人类书写文本与人工智能生成文本之间的界限，引发了错误信息传播、作者身份不明确以及知识产权威胁等社会风险。这些担忧凸显了对有效可靠检测方法的迫切需求。现有的免训练方法通常通过将词元级信号聚合为全局分数来实现较强性能，但它们通常假设词元贡献均匀，导致在短文本序列或局部词元修改场景下鲁棒性不足。为应对这些局限性，我们提出了Exons-Detect——一种基于外显子感知词元重加权视角的免训练AI生成文本检测方法。该方法通过双模型设置下测量隐藏状态差异来识别并增强信息丰富的外显子词元，并基于生成的重要性加权词元序列计算可解释的翻译分数。实证评估表明，Exons-Detect实现了最先进的检测性能，并对对抗攻击和不同输入长度表现出强鲁棒性。特别是在DetectRL基准测试中，其平均AUROC指标相较于现有最强基线取得了2.2%的相对提升。

摘要 (Abstract)

The rapid advancement of large language models has increasingly blurred the boundary between human-written and AI-generated text, raising societal risks such as misinformation dissemination, authorship ambiguity, and threats to intellectual property rights. These concerns highlight the urgent need for effective and reliable detection methods. While existing training-free approaches often achieve strong performance by aggregating token-level signals into a global score, they typically assume uniform token contributions, making them less robust under short sequences or localized token modifications. To address these limitations, we propose Exons-Detect, a training-free method for AI-generated text detection based on an exon-aware token reweighting perspective. Exons-Detect identifies and amplifies informative exonic tokens by measuring hidden-state discrepancy under a dual-model setting, and computes an interpretable translation score from the resulting importance-weighted token sequence. Empirical evaluations demonstrate that Exons-Detect achieves state-of-the-art detection performance and exhibits strong robustness to adversarial attacks and varying input lengths. In particular, it attains a 2.2% relative improvement in average AUROC over the strongest prior baseline on DetectRL.

关键词: AI-generated text detection, Exons-Detect, training-free method, hidden-state discrepancy, token reweighting, robustness, adversarial attacks, AUROC improvement

142. ❌ LLM-Driven Reasoning for Constraint-Aware Feature Selection in Industrial Systems

作者: Yuhang Zhou, Zhuokai Zhao, Ke Li, Spilios Evmorfos, Gökalp Demirci, Mingyi Wang, Qiao Liu, Qifei Wang, Serena Li, Weiwei Li, Tingting Wang, Mingze Gao, Gedi Zhou, Abhishek Kumar, Xiangjun Fan, Lizhu Zhang, Jiayi Liu 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.24979v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出MoFA框架，使用LLM进行基于推理的特征选择，核心涉及LLM应用、推理过程（CoT/System 2）和智能体框架，与’Large Language Models’、‘Chain of Thought’、‘System 2 Thinking’、‘LLM Agents’高度相关（10分）。‘Explainable AI’因强调可解释性而有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于大语言模型推理的约束感知特征选择框架MoFA，在三个工业应用中有效提升了模型精度和效率。

摘要翻译

特征选择是大规模工业机器学习系统中的关键步骤，直接影响模型的准确性、效率与可维护性。传统特征选择方法依赖标注数据和统计启发式规则，难以适用于标注数据有限且需满足多重运营约束的生产环境。为此，我们提出模型特征代理（Model Feature Agent, MoFA），一种基于模型驱动的框架，通过结合语义与量化特征信息，执行顺序化的、基于推理的特征选择。MoFA将特征定义、重要性分数、相关性及元数据（如特征组或类型）整合至结构化提示中，并通过可解释、约束感知的推理过程进行特征选择。我们在三个实际工业应用中评估了MoFA：（1）真实兴趣与时间价值预测，该场景中MoFA在降低特征组复杂度的同时提升了准确性；（2）价值模型增强，该场景中MoFA发现了高阶交互项，在在线实验中带来显著的参与度提升；（3）通知行为预测，该场景中MoFA选择了紧凑且高价值的特征子集，同时提高了模型准确性与推理效率。这些结果共同证明了基于大语言模型（LLM）的推理方法在实际生产系统中进行特征选择的实用性与有效性。

摘要 (Abstract)

Feature selection is a crucial step in large-scale industrial machine learning systems, directly affecting model accuracy, efficiency, and maintainability. Traditional feature selection methods rely on labeled data and statistical heuristics, making them difficult to apply in production environments where labeled data are limited and multiple operational constraints must be satisfied. To address this, we propose Model Feature Agent (MoFA), a model-driven framework that performs sequential, reasoning-based feature selection using both semantic and quantitative feature information. MoFA incorporates feature definitions, importance scores, correlations, and metadata (e.g., feature groups or types) into structured prompts and selects features through interpretable, constraint-aware reasoning. We evaluate MoFA in three real-world industrial applications: (1) True Interest and Time-Worthiness Prediction, where it improves accuracy while reducing feature group complexity, (2) Value Model Enhancement, where it discovers high-order interaction terms that yield substantial engagement gains in online experiments, and (3) Notification Behavior Prediction, where it selects compact, high-value feature subsets that improve both model accuracy and inference efficiency. Together, these results demonstrate the practicality and effectiveness of LLM-based reasoning for feature selection in real production systems.

关键词: feature selection, LLM reasoning, constraint-aware, industrial systems, Model Feature Agent, interpretable reasoning, production environments, sequential reasoning

143. ❌ Can MLLMs Read Students’ Minds? Unpacking Multimodal Error Analysis in Handwritten Math

作者: Dingjie Song, Tianlong Xu, Yi-Fan Zhang, Hang Li, Zhiling Yan, Xing Fan, Haoyang Li, Lichao Sun, Qingsong Wen 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.24961v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）在数学教育领域的应用，属于大模型在科学/教育领域的创新应用。高度相关关键词：1）‘Large Language Models’（论文明确研究MLLMs）；2）‘AI for Science’（教育属于科学应用领域）。中等相关：‘Chain of Thought’和’System 2 Thinking’（涉及逻辑推理和错误诊断）；‘Explainable AI’（错误解释任务）。其余关键词（如MoE、量化、对齐等）未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对学生手写数学草稿中的错误诊断难题，提出了ScratchMath基准并评估了16个MLLMs，发现现有模型在视觉识别和逻辑推理方面与人类专家存在显著差距。

摘要翻译

评估学生手写演算过程对于个性化教育反馈至关重要，但由于笔迹多样、布局复杂及解题方法各异，这一任务面临独特挑战。现有教育自然语言处理（NLP）研究主要关注文本回答，忽视了真实手写演算中固有的复杂性与多模态特性。当前多模态大语言模型（MLLMs）虽擅长视觉推理，但通常采用“应试者视角”，优先追求生成正确答案而非诊断学生错误。为弥补这些不足，我们提出了ScratchMath——一个专门为解释和分类真实手写数学演算错误而设计的新型基准。我们的数据集包含来自中国中小学生1,720份数学手写样本，支持两项核心任务：错误原因解释（ECE）与错误原因分类（ECC），并定义了七种错误类型。该数据集通过严谨的人机协同方法进行精细标注，涵盖多阶段专家标记、审查与验证流程。我们在ScratchMath上系统评估了16个主流多模态大语言模型，发现其与人类专家相比存在显著性能差距，尤其在视觉识别与逻辑推理方面。专有模型明显优于开源模型，其中大型推理模型在错误解释任务中展现出强大潜力。所有评估数据与框架均已公开，以促进进一步研究。

摘要 (Abstract)

Assessing student handwritten scratchwork is crucial for personalized educational feedback but presents unique challenges due to diverse handwriting, complex layouts, and varied problem-solving approaches. Existing educational NLP primarily focuses on textual responses and neglects the complexity and multimodality inherent in authentic handwritten scratchwork. Current multimodal large language models (MLLMs) excel at visual reasoning but typically adopt an “examinee perspective”, prioritizing generating correct answers rather than diagnosing student errors. To bridge these gaps, we introduce ScratchMath, a novel benchmark specifically designed for explaining and classifying errors in authentic handwritten mathematics scratchwork. Our dataset comprises 1,720 mathematics samples from Chinese primary and middle school students, supporting two key tasks: Error Cause Explanation (ECE) and Error Cause Classification (ECC), with seven defined error types. The dataset is meticulously annotated through rigorous human-machine collaborative approaches involving multiple stages of expert labeling, review, and verification. We systematically evaluate 16 leading MLLMs on ScratchMath, revealing significant performance gaps relative to human experts, especially in visual recognition and logical reasoning. Proprietary models notably outperform open-source models, with large reasoning models showing strong potential for error explanation. All evaluation data and frameworks are publicly available to facilitate further research.

关键词: multimodal large language models, handwritten mathematics, error analysis, educational AI, benchmark evaluation, visual reasoning, logical reasoning, student assessment

144. ❌ Toward domain-specific machine translation and quality estimation systems

作者: Javad Pourmostafa Roshan Sharami 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.24955v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究机器翻译和质量评估的领域适应问题，核心贡献包括数据选择、领域适应训练流程、子词标记化分析和基于质量评估的上下文学习方法。与关键词的相关性分析如下：1) “Pre-training OR Continual Pre-training OR Domain Adaptation” 得10分，因为论文核心就是研究MT和QE系统的领域适应方法；2) “Post-training OR Supervised Fine-tuning OR SFT” 得8分，论文涉及微调策略（如第4章的子词标记化分析）；3) “Large Language Models OR LLMs OR Foundation Models” 得8分，第5章明确使用大语言模型进行上下文学习；4) “In-context Learning OR Many-shot Learning” 得10分，第5章专门提出QE-guided in-context learning方法；其他关键词如MoE、SLMs、Scaling Laws、RAG、RLHF等与论文内容无直接关联，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过数据选择、领域适应训练、子词标记化优化和基于质量评估的上下文学习等方法，使机器翻译和质量评估系统在专业领域中表现更可靠。

摘要翻译

机器翻译（Machine Translation，MT）与质量估计（Quality Estimation，QE）在通用领域表现良好，但在领域不匹配时性能下降。本论文研究如何通过一系列以数据为中心的贡献，使MT与QE系统适应专业领域。第二章提出一种基于相似性的数据选择方法用于机器翻译。小型、有针对性的领域内数据子集能够超越规模大得多的通用数据集，并以较低计算成本实现较强的翻译质量。第三章介绍一种分阶段的QE训练流程，将领域适应与轻量级数据增强相结合。该方法提升了跨领域、跨语言及不同资源设置（包括零样本和跨语言场景）下的性能。第四章研究子词分词与词汇表在微调中的作用。对齐的分词-词汇表设置可带来稳定的训练和更好的翻译质量，而不匹配的配置则会降低性能。第五章提出一种基于QE指导的上下文学习方法用于大语言模型。QE模型选择能提升翻译质量的示例，且无需参数更新，其表现优于标准检索方法。该方法还支持无参考设置，降低了对单一参考集的依赖。这些结果表明，领域适应依赖于数据选择、表示方法以及高效的适应策略。本论文为构建在特定领域设置中可靠运行的MT与QE系统提供了方法。

摘要 (Abstract)

Machine Translation (MT) and Quality Estimation (QE) perform well in general domains but degrade under domain mismatch. This dissertation studies how to adapt MT and QE systems to specialized domains through a set of data-focused contributions. Chapter 2 presents a similarity-based data selection method for MT. Small, targeted in-domain subsets outperform much larger generic datasets and reach strong translation quality at lower computational cost. Chapter 3 introduces a staged QE training pipeline that combines domain adaptation with lightweight data augmentation. The method improves performance across domains, languages, and resource settings, including zero-shot and cross-lingual cases. Chapter 4 studies the role of subword tokenization and vocabulary in fine-tuning. Aligned tokenization-vocabulary setups lead to stable training and better translation quality, while mismatched configurations reduce performance. Chapter 5 proposes a QE-guided in-context learning method for large language models. QE models select examples that improve translation quality without parameter updates and outperform standard retrieval methods. The approach also supports a reference-free setup, reducing reliance on a single reference set. These results show that domain adaptation depends on data selection, representation, and efficient adaptation strategies. The dissertation provides methods for building MT and QE systems that perform reliably in domain-specific settings.

关键词: Machine Translation, Quality Estimation, Domain Adaptation, Data Selection, In-context Learning, Fine-tuning, Subword Tokenization, Domain-specific Systems

145. ❌ Beyond Attention Magnitude: Leveraging Inter-layer Rank Consistency for Efficient Vision-Language-Action Models

作者: Peiju Liu, Jinming Liu, Xipeng Qiu, Xuanjing Huang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.24941v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Vision-Language-Action (VLA)模型在机器人操作中的推理效率问题，提出了一种基于层间排名一致性的动态令牌选择方法TIES。虽然涉及视觉-语言-动作模型，但论文核心是视觉令牌的高效处理，而非大语言模型(LLMs)或深度学习技术原理的创新。所有评分关键词均针对大语言模型(LLMs)的特定技术、训练方法、应用场景或评估指标，而本文专注于视觉令牌选择和机器人策略性能，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对Vision-Language-Action模型因处理密集视觉令牌导致推理延迟的问题，提出了一种基于层间令牌排名一致性的动态选择框架TIES，在CogACT+SIMPLER基准上实现了成功率提升6%同时令牌使用减少78%。

摘要翻译

视觉-语言-动作（Vision-Language-Action, VLA）模型在机器人操作任务中表现出色，但由于处理密集的视觉标记（tokens）而存在显著的推理延迟。现有的标记缩减方法主要依赖注意力大小作为静态选择依据。本研究挑战了这一假设，揭示了高注意力标记具有任务依赖性，甚至可能降低策略性能。为解决此问题，我们提出了 TIES（Tau引导的Inter-layer Efficient Selection，即 Tau 引导的层间高效选择框架），这是一个由层间标记排序一致性引导的动态框架。通过自适应地平衡注意力大小与排序一致性，TIES 能够在无需额外训练的情况下实现鲁棒的标记选择。在 CogACT + SIMPLER 基准测试中，TIES 将平均成功率提升了 6%，同时减少了 78% 的标记使用量，并在不同解码器和基准测试中展现出强大的泛化能力。

摘要 (Abstract)

Vision-Language-Action (VLA) models excel in robotic manipulation but suffer from significant inference latency due to processing dense visual tokens. Existing token reduction methods predominantly rely on attention magnitude as a static selection. In this work, we challenge this assumption, revealing that high-attention tokens are task-dependent and can even degrade policy performance. To address this, we introduce \textbf{TIES} (\textbf{T}au-guided \textbf{I}nter-layer \textbf{E}fficient \textbf{S}election), a dynamic framework guided by inter-layer token ranking consistency. By adaptively balancing attention magnitude with ranking consistency, TIES ensures robust token selection without requiring additional training. On the CogACT + SIMPLER benchmark, TIES improves average success rates by 6% while reducing token usage by 78%, and demonstrate strong generalization across diverse decoders and benchmarks.

关键词: Vision-Language-Action models, token reduction, attention magnitude, inter-layer ranking consistency, robotic manipulation, inference latency, TIES framework, policy performance

146. ❌ LogitScope: A Framework for Analyzing LLM Uncertainty Through Information Metrics

作者: Farhan Ahmed, Yuya Jeremy Ong, Chad DeLuca 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.24929v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是提出LogitScope框架，通过信息度量（如熵、变熵）分析LLM在生成过程中的token级不确定性，以识别幻觉和高不确定性决策点。因此，与’Large Language Models’高度相关（10分），因为论文专门研究LLM不确定性分析框架。与’Hallucination Mitigation’和’Mechanistic Interpretability’有一定关联（各5分），因为框架旨在识别潜在幻觉并提供模型行为分析，有助于可解释性。其他关键词如MoE、SFT、RAG等涉及具体技术或应用领域，论文未涉及，故得0分。

!!! tip deepseek-chat TL;DR

该研究针对大型语言模型输出中不确定性量化不足的问题，提出了LogitScope框架，通过token级信息度量分析模型置信度模式，无需标注数据即可识别幻觉和高不确定性决策点。

摘要翻译

理解和量化大型语言模型（LLM）输出中的不确定性对于其可靠部署至关重要。然而，传统评估方法在生成过程中对模型在单个词元（token）位置上的置信度提供的信息有限。为解决这一问题，我们提出了LogitScope，这是一个轻量级框架，通过基于概率分布计算的词元级信息指标来分析LLM的不确定性。通过在每个生成步骤测量熵（entropy）和变熵（varentropy）等指标，LogitScope能够揭示模型置信度的模式，识别潜在的幻觉（hallucinations），并暴露模型表现出高度不确定性的决策点，整个过程无需标注数据或语义解释。我们展示了LogitScope在不确定性量化、模型行为分析和生产监控等多种应用中的实用性。该框架与模型无关，通过惰性评估（lazy evaluation）实现计算高效，且兼容任何HuggingFace模型，使研究者和实践者能够在推理过程中检查LLM的行为。

摘要 (Abstract)

Understanding and quantifying uncertainty in large language model (LLM) outputs is critical for reliable deployment. However, traditional evaluation approaches provide limited insight into model confidence at individual token positions during generation. To address this issue, we introduce LogitScope, a lightweight framework for analyzing LLM uncertainty through token-level information metrics computed from probability distributions. By measuring metrics such as entropy and varentropy at each generation step, LogitScope reveals patterns in model confidence, identifies potential hallucinations, and exposes decision points where models exhibit high uncertainty, all without requiring labeled data or semantic interpretation. We demonstrate LogitScope’s utility across diverse applications including uncertainty quantification, model behavior analysis, and production monitoring. The framework is model-agnostic, computationally efficient through lazy evaluation, and compatible with any HuggingFace model, enabling both researchers and practitioners to inspect LLM behavior during inference.

关键词: LLM uncertainty, token-level analysis, information metrics, entropy, hallucination detection, model confidence, inference monitoring, probability distributions

147. ❌ GraphER: An Efficient Graph-Based Enrichment and Reranking Method for Retrieval-Augmented Generation

作者: Ruizhong Miao, Yuying Wang, Rongguang Wang, Chenyang Li, Tao Sheng, Sujith Ravi, Dan Roth 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.24925v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于改进检索增强生成（RAG）系统，提出了一种名为GraphER的基于图的增强和重排序方法。该方法旨在通过捕获超越语义相似性的多种邻近关系来提高检索质量，特别是在复杂信息需求场景下。论文的核心贡献在于RAG系统的检索组件优化，与关键词列表中的大多数技术（如LLM训练、对齐、推理、压缩、科学应用等）无直接关联。唯一高度相关的关键词是’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’，因为论文直接针对RAG系统的检索瓶颈提出解决方案，这是其核心研究内容。其他关键词涉及大模型技术原理、训练方法、推理优化、代理系统或特定领域应用，均未在论文中讨论。

!!! tip deepseek-chat TL;DR

该论文针对检索增强生成（RAG）系统中语义搜索在复杂信息需求下检索效率低的问题，提出了一种基于图的增强和重排序方法GraphER，该方法在离线索引时增强数据对象，在查询时对候选对象进行基于图的重排序，实验证明其能有效提高检索性能且与标准向量存储兼容。

摘要翻译

检索增强生成（RAG）系统中的语义搜索往往难以满足复杂的信息需求，尤其是在相关证据分散于多个来源的情况下。针对此问题的现有方法包括智能体检索策略，其通过生成额外查询来扩展语义搜索空间。然而，这些方法未能充分利用数据的组织结构，而是依赖于迭代式探索，可能导致检索效率低下。另一类方法采用知识图谱，通过图边建模非语义关系。尽管这类方法能有效捕捉更丰富的邻近关系，但其维护成本高昂，且通常与多数生产系统中使用的向量数据库不兼容。为克服这些局限，我们提出了GraphER——一种基于图的增强与重排序方法，能够捕捉超越语义相似性的多种邻近关系。GraphER在离线索引阶段独立增强数据对象，并在查询时对候选对象执行基于图的重排序。该设计无需依赖知识图谱，使得GraphER能够无缝集成于标准向量数据库。此外，GraphER与检索器无关，且引入的延迟开销可忽略不计。在多个检索基准测试上的实验验证了所提方法的有效性。

摘要 (Abstract)

Semantic search in retrieval-augmented generation (RAG) systems is often insufficient for complex information needs, particularly when relevant evidence is scattered across multiple sources. Prior approaches to this problem include agentic retrieval strategies, which expand the semantic search space by generating additional queries. However, these methods do not fully leverage the organizational structure of the data and instead rely on iterative exploration, which can lead to inefficient retrieval. Another class of approaches employs knowledge graphs to model non-semantic relationships through graph edges. Although effective in capturing richer proximities, such methods incur significant maintenance costs and are often incompatible with the vector stores used in most production systems. To address these limitations, we propose GraphER, a graph-based enrichment and reranking method that captures multiple forms of proximity beyond semantic similarity. GraphER independently enriches data objects during offline indexing and performs graph-based reranking over candidate objects at query time. This design does not require a knowledge graph, allowing GraphER to integrate seamlessly with standard vector stores. In addition, GraphER is retriever-agnostic and introduces negligible latency overhead. Experiments on multiple retrieval benchmarks demonstrate the effectiveness of the proposed approach.

关键词: Retrieval-Augmented Generation, RAG, Graph-based enrichment, Reranking, Semantic search, Vector stores, Retrieval benchmarks, Proximity capture

148. ❌ Estimating near-verbatim extraction risk in language models with decoding-constrained beam search

作者: A. Feder Cooper, Mark A. Lemley, Christopher De Sa, Lea Duesterwald, Allison Casasola, Jamie Hayes, Katherine Lee, Daniel E. Ho, Percy Liang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.24917v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM中的记忆提取风险评估方法，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为全文围绕LLM的隐私和版权风险展开。其他关键词如MoE、SLMs、训练方法、推理技术、代理系统、科学应用等均未在标题或摘要中提及，因此评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为解码约束束搜索的新方法，用于高效评估语言模型中近逐字记忆提取的风险，解决了现有方法成本高且忽略近逐字实例的问题。

摘要翻译

近期研究表明，用于量化大语言模型记忆现象的标准贪婪解码提取方法，未能捕捉提取风险在不同序列间的动态变化。概率性提取方法——即在特定解码方案下计算给定前缀生成目标后缀的概率——虽能解决此问题，但仅适用于逐字记忆的量化，忽略了具有相似隐私与版权风险的近似逐字记忆实例。量化近似逐字提取风险的成本高昂：近似逐字后缀集合在组合层面极为庞大，可靠的蒙特卡洛估计可能需要对每个序列进行约10万次采样。为降低这一成本，我们提出解码约束束搜索方法，该方法能以每个序列约20次蒙特卡洛采样的计算代价，为近似逐字提取风险提供确定性下界。实验表明，我们的方法揭示了逐字检测方法无法观测的信息：包括更多可提取序列、显著更高的单序列提取质量，以及近似逐字提取风险随模型规模和文本类型变化的规律模式。

摘要 (Abstract)

Recent work shows that standard greedy-decoding extraction methods for quantifying memorization in LLMs miss how extraction risk varies across sequences. Probabilistic extraction – computing the probability of generating a target suffix given a prefix under a decoding scheme – addresses this, but is tractable only for verbatim memorization, missing near-verbatim instances that pose similar privacy and copyright risks. Quantifying near-verbatim extraction risk is expensive: the set of near-verbatim suffixes is combinatorially large, and reliable Monte Carlo (MC) estimation can require ~100,000 samples per sequence. To mitigate this cost, we introduce decoding-constrained beam search, which yields deterministic lower bounds on near-verbatim extraction risk at a cost comparable to ~20 MC samples per sequence. Across experiments, our approach surfaces information invisible to verbatim methods: many more extractable sequences, substantially larger per-sequence extraction mass, and patterns in how near-verbatim extraction risk manifests across model sizes and types of text.

关键词: language models, memorization, extraction risk, near-verbatim extraction, decoding-constrained beam search, privacy, copyright, Monte Carlo estimation

149. ❌ LogSigma at SemEval-2026 Task 3: Uncertainty-Weighted Multitask Learning for Dimensional Aspect-Based Sentiment Analysis

作者: Baraa Hikal, Jonas Becker, Bela Gipp 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.24896v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于情感分析任务中的多任务学习技术，使用不确定性加权方法平衡不同回归目标，属于自然语言处理中的特定应用领域。论文未涉及大模型、深度学习技术原理创新、科学领域AI应用或任何评分关键词中的技术概念（如LLM、MoE、Scaling Laws、微调方法、推理技术、模型优化等），与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了维度方面情感分析任务中如何通过不确定性加权多任务学习自动平衡不同语言的效价和唤醒度预测难度，并在SemEval-2026竞赛中取得了第一名。

摘要翻译

本文介绍了我们为SemEval-2026任务3：维度方面级情感分析（DimABSA）所开发的LogSigma系统。与预测离散情感标签的传统方面级情感分析不同，DimABSA要求预测1-9范围内的连续效价与唤醒度得分。一个核心挑战在于，效价与唤醒度在不同语言和领域中的预测难度存在差异。我们通过引入学习型同方差不确定性来解决这一问题，该模型通过学习任务特定的对数方差参数，在训练过程中自动平衡每个回归目标。结合语言专用编码器和多种子集成方法，LogSigma在两个赛道共五个数据集上均取得了第一名。由于不同语言的效价-唤醒度难度分布存在差异——从德语的0.66倍到英语的2.18倍——学习得到的方差权重在不同语言间差异显著，这表明最优的任务平衡具有语言依赖性，无法通过先验知识确定。

摘要 (Abstract)

This paper describes LogSigma, our system for SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA). Unlike traditional Aspect-Based Sentiment Analysis (ABSA), which predicts discrete sentiment labels, DimABSA requires predicting continuous Valence and Arousal (VA) scores on a 1-9 scale. A central challenge is that Valence and Arousal differ in prediction difficulty across languages and domains. We address this using learned homoscedastic uncertainty, where the model learns task-specific log-variance parameters to automatically balance each regression objective during training. Combined with language-specific encoders and multi-seed ensembling, LogSigma achieves 1st place on five datasets across both tracks. The learned variance weights vary substantially across languages due to differing Valence-Arousal difficulty profiles-from 0.66x for German to 2.18x for English-demonstrating that optimal task balancing is language-dependent and cannot be determined a priori.

关键词: Dimensional Aspect-Based Sentiment Analysis, Valence-Arousal prediction, Multitask Learning, Uncertainty Weighting, Homoscedastic Uncertainty, Language-specific Encoders, Multi-seed Ensembling, SemEval Competition

150. ❌ How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning

作者: Luyu Yang, Yutong Dai, An Yan, Viraj Prabhu, Ran Xu, Zeyuan Chen 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24866v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于视觉语言模型（VLMs）的物理生成推理能力评估，而非大语言模型（LLMs）或深度学习技术原理本身。因此，绝大多数关键词（如LLMs、MoE、Scaling Laws、各种训练方法、推理加速等）完全不相关（0分）。与论文相关的关键词包括：‘Chain of Thought’和’System 2 Thinking’（5分），因为论文评估模型的多步规划和深度推理能力；‘Self-Correction’（8分），因为基准支持迭代交互和模型自我修正；‘LLM Agents’（8分），因为基准设计用于评估智能体在交互环境中的规划和行动能力。其他关键词如’AI for Science’不直接相关，因为论文虽涉及工程领域，但核心是评估方法而非科学发现应用。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为DreamHouse的新基准，用于评估视觉语言模型在住宅木结构建造中的物理生成推理能力，发现当前先进模型在此任务上存在显著能力差距，凸显了物理有效性作为与视觉真实性正交的关键评估维度。

摘要翻译

物理世界不仅关乎视觉呈现，更受严谨的结构与程序性约束所支配。然而，当前对视觉-语言模型（VLMs）的评估仍严重偏向感知真实性，优先关注生成视觉上合理的3D布局、形状与外观。现有基准测试很少检验模型是否真正掌握了实际构建这些实体所需的分步流程与物理依赖关系，而这种能力对于实现设计到建造流程的自动化至关重要。为此，我们提出了DreamHouse——一个用于物理生成推理的新型基准测试：该能力旨在合成同时满足几何、结构、可建造性与规范符合性约束的实体。我们将此基准建立在住宅木结构建造领域，该领域拥有完全成文化的工程标准与可客观验证的正确性。我们收集了涵盖13种建筑风格的超过26,000个结构，每个均按施工图标准（LOD 350）验证，并开发了一个确定性的10项结构验证框架。与仅评估最终输出的静态基准不同，DreamHouse支持迭代式智能体交互：模型可观察中间建造状态、生成施工动作，并接收结构化环境反馈，从而实现对规划、结构推理与自我修正能力的细粒度评估。通过对前沿视觉-语言模型的大量实验，我们发现了现有排行榜难以揭示的显著能力差距。这些发现确立了物理有效性作为与视觉真实性正交的关键评估维度，并凸显物理生成推理作为多模态智能中一个独立且尚未充分发展的前沿领域。项目地址：https://luluyuyuyang.github.io/dreamhouse

摘要 (Abstract)

The physical world is not merely visual; it is governed by rigorous structural and procedural constraints. Yet, the evaluation of vision-language models (VLMs) remains heavily skewed toward perceptual realism, prioritizing the generation of visually plausible 3D layouts, shapes, and appearances. Current benchmarks rarely test whether models grasp the step-by-step processes and physical dependencies required to actually build these artifacts, a capability essential for automating design-to-construction pipelines. To address this, we introduce DreamHouse, a novel benchmark for physical generative reasoning: the capacity to synthesize artifacts that concurrently satisfy geometric, structural, constructability, and code-compliance constraints. We ground this benchmark in residential timber-frame construction, a domain with fully codified engineering standards and objectively verifiable correctness. We curate over 26,000 structures spanning 13 architectural styles, ach verified to construction-document standards (LOD 350) and develop a deterministic 10-test structural validation framework. Unlike static benchmarks that assess only final outputs, DreamHouse supports iterative agentic interaction. Models observe intermediate build states, generate construction actions, and receive structured environmental feedback, enabling a fine-grained evaluation of planning, structural reasoning, and self-correction. Extensive experiments with state-of-the-art VLMs reveal substantial capability gaps that are largely invisible on existing leaderboards. These findings establish physical validity as a critical evaluation axis orthogonal to visual realism, highlighting physical generative reasoning as a distinct and underdeveloped frontier in multimodal intelligence. Available at https://luluyuyuyang.github.io/dreamhouse

关键词: vision-language models, physical generative reasoning, benchmark, structural validation, agentic interaction, self-correction, construction, DreamHouse

151. ❌ AI Security in the Foundation Model Era: A Comprehensive Survey from a Unified Perspective

作者: Zhenyi Wang, Siyu Luan 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24857v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于基础模型（Foundation Models）的安全威胁分类框架，与关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文明确以基础模型为背景，提出统一的安全威胁分类。其他关键词主要涉及大模型的具体技术、训练方法、推理优化、应用领域等，而本文是安全领域的综述性研究，不涉及这些具体技术细节，因此相关度为0分。

!!! tip deepseek-chat TL;DR

该论文针对基础模型时代机器学习安全威胁缺乏统一框架的问题，提出了一个基于数据-模型双向交互的四类威胁分类法，为分析和防御基础模型安全风险提供了系统性视角。

摘要翻译

随着机器学习（ML）系统在规模和功能上的不断扩展，安全态势日益复杂，攻击与防御手段层出不穷。然而，现有研究大多孤立地看待这些威胁，缺乏一个能够揭示其共同原理与相互依赖关系的统一框架。这种碎片化的视角阻碍了系统性的理解，也限制了对全面防御方案的设计。关键在于，机器学习的两个核心资产——数据和模型——不再相互独立；一方的漏洞会直接危及另一方。由于缺乏整体性框架，关于这些双向风险如何在机器学习流程中传播的问题仍未得到解答。为填补这一关键空白，我们提出了一种统一闭环威胁分类法，该分类法沿着四个方向轴明确地构建了模型与数据之间的交互关系。我们的框架为分析和防御基础模型提供了一个原则性的视角。由此产生的四类安全威胁代表了不同但相互关联的攻击类别：（1）数据→数据（D→D）：包括数据解密攻击与水印去除攻击；（2）数据→模型（D→M）：包括投毒攻击、有害微调攻击与越狱攻击；（3）模型→数据（M→D）：包括模型反演攻击、成员推理攻击与训练数据提取攻击；（4）模型→模型（M→M）：包括模型提取攻击。我们的统一框架阐明了这些安全威胁之间的内在联系，并为开发可扩展、可迁移、跨模态的安全策略奠定了基础，尤其适用于基础模型领域。

摘要 (Abstract)

As machine learning (ML) systems expand in both scale and functionality, the security landscape has become increasingly complex, with a proliferation of attacks and defenses. However, existing studies largely treat these threats in isolation, lacking a coherent framework to expose their shared principles and interdependencies. This fragmented view hinders systematic understanding and limits the design of comprehensive defenses. Crucially, the two foundational assets of ML – \textbf{data} and \textbf{models} – are no longer independent; vulnerabilities in one directly compromise the other. The absence of a holistic framework leaves open questions about how these bidirectional risks propagate across the ML pipeline. To address this critical gap, we propose a \emph{unified closed-loop threat taxonomy} that explicitly frames model-data interactions along four directional axes. Our framework offers a principled lens for analyzing and defending foundation models. The resulting four classes of security threats represent distinct but interrelated categories of attacks: (1) Data$\rightarrow$Data (D$\rightarrow$D): including \emph{data decryption attacks and watermark removal attacks}; (2) Data$\rightarrow$Model (D$\rightarrow$M): including \emph{poisoning, harmful fine-tuning attacks, and jailbreak attacks}; (3) Model$\rightarrow$Data (M$\rightarrow$D): including \emph{model inversion, membership inference attacks, and training data extraction attacks}; (4) Model$\rightarrow$Model (M$\rightarrow$M): including \emph{model extraction attacks}. Our unified framework elucidates the underlying connections among these security threats and establishes a foundation for developing scalable, transferable, and cross-modal security strategies, particularly within the landscape of foundation models.

关键词: AI Security, Foundation Models, Threat Taxonomy, Model-Data Interaction, Security Threats, Unified Framework, Machine Learning Security, Closed-loop Threat Analysis

152. ❌ Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

作者: Isha Puri, Mehul Damani, Idan Shenfeld, Marzyeh Ghassemi, Jacob Andreas, Yoon Kim 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24844v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究语言模型（LLMs）的分布推理能力，通过多答案强化学习（RL）方法改进后训练过程，与’Large Language Models’、‘Post-training’和’RLHF’高度相关（10分），因为论文明确研究LLMs的后训练RL方法；与’AI for Science’有一定关联（5分），因为论文在医疗诊断等科学应用场景中评估方法；其他关键词如MoE、SLMs、Scaling Laws等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种多答案强化学习方法，使语言模型能够在单次前向传播中生成多个候选答案，从而在问答、医疗诊断和编程任务中提高答案多样性、覆盖率和校准分数，同时减少生成多个答案所需的计算开销。

摘要翻译

给定一个问题，语言模型（LM）隐式编码了可能答案的分布。在实践中，语言模型的后训练过程通常会将此分布坍缩至单一主导模式。尽管这对于假设存在单一正确答案的基准式评估通常不构成问题，但许多现实任务本质上涉及多个有效答案或不可约的不确定性。例如医疗诊断、模糊问题解答以及信息不完整的场景。在这些情况下，我们希望语言模型能够生成多个合理假设，最好能附带每个假设的置信度估计，且无需通过计算密集的重复采样来生成非模态答案。本文描述了一种多答案强化学习方法，用于训练语言模型在推理过程中对多个答案进行分布推理。我们修改了强化学习目标，使模型能够在单次前向传播中显式生成多个候选答案，将推理时搜索的某些方面内化到模型的生成过程中。在问答、医疗诊断和代码生成基准测试中，与单答案训练的基线相比，我们观察到多样性、覆盖率和集合层面的校准分数均得到提升。采用我们方法训练的模型生成多个答案所需的标记数少于竞争方法。在代码任务中，其准确性也显著更高。这些结果表明，多答案强化学习可作为一种原则性强且计算高效的替代方案，以取代推理时的扩展过程（如k最佳采样）。代码及更多信息请访问 https://multi-answer-rl.github.io/。

摘要 (Abstract)

Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post-training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark-style evaluations that assume one correct answer, many real-world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete information. In these cases, we would like LMs to generate multiple plausible hypotheses, ideally with confidence estimates for each one, and without computationally intensive repeated sampling to generate non-modal answers. This paper describes a multi-answer reinforcement learning approach for training LMs to perform distributional reasoning over multiple answers during inference. We modify the RL objective to enable models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference-time search into the model’s generative process. Across question-answering, medical diagnostic, and coding benchmarks, we observe improved diversity, coverage, and set-level calibration scores compared to single answer trained baselines. Models trained with our approach require fewer tokens to generate multiple answers than competing approaches. On coding tasks, they are also substantially more accurate. These results position multi-answer RL as a principled and compute-efficient alternative to inference-time scaling procedures such as best-of-k. Code and more information can be found at https://multi-answer-rl.github.io/.

关键词: language models, distributional reasoning, reinforcement learning, multi-answer generation, post-training, medical diagnosis, inference efficiency, calibration

153. ❌ Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining

作者: Thales Sales Almeida, Rodrigo Nogueira, Hélio Pedrini 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24826v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	8.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在葡萄牙语持续预训练（Continual Pre-training）中的应用，使用7B指令调优模型生成合成数据，因此与’Large Language Models’和’Pre-training’高度相关（10分）。研究重点考察数据质量与模型规模的关系，与’Scaling Laws AND Data Quality’相关（8分）。使用指令调优模型生成数据，与’Instruction Tuning’有一定关联（5分）。其他关键词如MoE、SFT、RAG、量化等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究探讨了在葡萄牙语持续预训练中，合成数据重写如何作为数据质量的乘数，发现重写高质量数据能显著提升模型性能，而重写低质量数据收益有限，且这种效应依赖于模型规模。

摘要翻译

通过文档重写生成合成数据已成为改进语言模型预训练的一项前景广阔的技术，但多数研究集中于英语领域，且未能系统控制待重写源数据的质量。本文以葡萄牙语持续预训练为背景，对合成重写与源数据质量间的交互关系展开对照研究。基于已标注STEM与教育质量评分的葡萄牙语语料库ClassiCC-PT，我们构建了两个不同质量等级的百亿词元子集，并采用70亿参数的指令微调模型将每个子集重写为四种风格，每种条件下生成约400亿词元的合成数据。我们在每种条件下训练两个以英语为中心的基座模型（11亿和70亿参数），并在涵盖44项任务的葡萄牙语综合基准测试集PoETa V2上进行评估。在70亿参数规模下，对高质量数据进行重写相比未修改的同等数据可获得+3.4 NPM（标准化百分比提升）的增益，而对低质量数据进行重写仅产生+0.5 NPM增益。在11亿参数规模下，这种交互作用较弱，未修改的低质量数据与重写后的高质量数据表现相当。我们的研究结果表明，合成重写主要发挥质量倍增器的作用，而非数据筛选的替代方案，且这种效应具有规模依赖性。

摘要 (Abstract)

Synthetic data generation through document rewriting has emerged as a promising technique for improving language model pretraining, yet most studies focus on English and do not systematically control for the quality of the source data being rewritten. We present a controlled study of how synthetic rewriting interacts with source data quality in the context of Portuguese continued pretraining. Starting from ClassiCC-PT, a Portuguese corpus annotated with STEM and Educational quality scores, we construct two 10B-token subsets at different quality levels and rewrite each into four styles using a 7B instruction-tuned model, producing approximately 40B tokens of synthetic data per condition. We train two English-centric base models (1.1B and 7B parameters) on each condition and evaluate on PoETa V2, a comprehensive 44-task Portuguese benchmark. At the 7B scale, rewriting high-quality data yields a +3.4 NPM gain over the same data unmodified, while rewriting low-quality data provides only +0.5 NPM. At the 1.1B scale, this interaction is weaker, with unmodified low-quality data performing comparably to rewritten high-quality data. Our results demonstrate that synthetic rewriting acts primarily as a quality multiplier rather than a substitute for data curation, and that this effect is scale-dependent.

关键词: synthetic data generation, continued pretraining, Portuguese language models, data quality, instruction-tuned model, scale-dependent effects, document rewriting, language model evaluation

154. ❌ Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR

作者: Haobo Xu, Sirui Chen, Ruizhong Qiu, Yuchen Yan, Chen Luo, Monica Cheng, Jingrui He, Hanghang Tong 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24840v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RLVR（Reinforcement Learning with Verifiable Rewards）方法在LLMs推理中的应用，提出ARRoL方法通过在线剪枝rollouts来加速训练并提高准确性。与关键词高度相关的是’Large Language Models’（论文明确使用LLMs如Qwen-3和LLaMA-3.2），评10分。与’Speculative Decoding OR Inference Acceleration’有一定关联（论文关注训练加速和推理效率），评5分。其他关键词如MoE、SFT、RAG、CoT等未在论文中涉及，评0分。

!!! tip deepseek-chat TL;DR

该论文针对RLVR方法在大型语言模型中训练成本高、学习信号弱的问题，提出了ARRoL在线剪枝方法，在Qwen-3和LLaMA-3.2模型上实现了最高1.7倍训练加速和平均准确率提升2.30-2.99个百分点的效果。

摘要翻译

可验证奖励强化学习（RLVR）显著提升了大语言模型（LLMs）的推理能力。然而，诸如GRPO和DAPO等方法存在高昂的计算成本，因为它们依赖于为每个提示采样大量轨迹展开。此外，在RLVR中相对优势往往是稀疏的：许多样本变得几乎全对或全错，导致组内奖励方差低，从而产生微弱的学习信号。本文提出arrol（通过在线轨迹剪枝加速RLVR），这是一种在线轨迹剪枝方法，能在生成过程中剪除轨迹，同时明确引导存留的轨迹向更均衡的正确性分布，以增强学习信号。具体而言，arrol动态训练一个轻量级质量预测头来预估部分轨迹展开的成功概率，并利用其做出早期剪枝决策。习得的质量预测头可进一步在测试时扩展中加权候选答案以提高推理准确性。为提升效率，我们提出一种系统设计，在推理引擎内部进行轨迹剪枝，并对剩余轨迹重新批处理以计算对数概率并更新策略。在Qwen-3和LLaMA-3.2模型（1B-8B）上对GRPO和DAPO的实验中，arrol将平均准确率提升了+2.30至+2.99，同时实现了最高1.7倍的训练加速，并在测试时扩展中带来最高+8.33的额外平均准确率增益。代码发布于https://github.com/Hsu1023/ARRoL。

摘要 (Abstract)

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, methods such as GRPO and DAPO suffer from substantial computational cost, since they rely on sampling many rollouts for each prompt. Moreover, in RLVR the relative advantage is often sparse: many samples become nearly all-correct or all-incorrect, yielding low within-group reward variance and thus weak learning signals. In this paper, we introduce arrol (Accelerating RLVR via online Rollout Pruning), an online rollout pruning method that prunes rollouts during generation while explicitly steering the surviving ones more correctness-balanced to enhance learning signals. Specifically, arrol trains a lightweight quality head on-the-fly to predict the success probability of partial rollouts and uses it to make early pruning decisions. The learned quality head can further weigh candidates to improve inference accuracy during test-time scaling. To improve efficiency, we present a system design that prunes rollouts inside the inference engine and re-batches the remaining ones for log-probability computation and policy updates. Across GRPO and DAPO on Qwen-3 and LLaMA-3.2 models (1B-8B), arrol improves average accuracy by +2.30 to +2.99 while achieving up to 1.7x training speedup, and yielding up to +8.33 additional gains in average accuracy in test-time scaling. The code is available at https://github.com/Hsu1023/ARRoL.

关键词: Reinforcement Learning with Verifiable Rewards, RLVR, Large Language Models, rollout pruning, training acceleration, inference efficiency, Qwen-3, LLaMA-3.2

155. ❌ Enhancing Structured Meaning Representations with Aspect Classification

作者: Claire Benét Post, Paul Bontempo, August Milliken, Alvin Po-Chun Chen, Nicholas Derby, Saksham Khatwani, Sumeyye Nabieva, Karthik Sairam, Alexis Palmer 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24797v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于语义表示和自然语言处理中的方面分类任务，研究的是语义图标注、数据集创建和基线模型实验，属于传统的NLP语义分析领域。所有评分关键词均涉及大模型、深度学习技术原理或AI for Science应用，而本文完全不涉及这些内容：没有使用LLM/SLM等模型，没有讨论预训练/微调/对齐等技术，没有涉及推理加速、幻觉缓解、模型压缩等大模型技术，也没有在生物信息学等科学领域应用。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文针对语义表示中方面信息标注不足的问题，创建了一个带有UMR方面标签的英文句子数据集，并通过三种建模方法建立了自动UMR方面预测的初始基准。

摘要翻译

为完整捕捉句子的意义，语义表征应编码体貌信息——即描述事件内部时间结构的特征。在基于图结构的语义表征框架（如统一语义表征UMR）中，体貌信息能揭示事件随时间展开的方式，包括状态、活动与已完成事件等区别性特征。尽管体貌至关重要，但在各类语义表征框架中其标注仍显稀疏。这不仅制约了当前人工标注实践，也阻碍了能够预测体貌信息的自动化系统的发展。本文引入一个新型英语句子数据集，该数据集在缺乏体貌特征的抽象语义表征（AMR）图基础上标注了UMR体貌标签。我们详细阐述了依据UMR体貌层级结构对事件性谓词进行标注的方案与准则，并介绍了通过多阶段裁定流程确保标注者间一致性与质量的标注流程。为证明本数据集对未来自动化研究的实用价值，我们采用三种建模方法进行了基线实验。实验结果建立了自动UMR体貌预测的初步基准，为更广泛地将体貌信息整合进语义表征体系奠定了基础。

摘要 (Abstract)

To fully capture the meaning of a sentence, semantic representations should encode aspect, which describes the internal temporal structure of events. In graph-based meaning representation frameworks such as Uniform Meaning Representations (UMR), aspect lets one know how events unfold over time, including distinctions such as states, activities, and completed events. Despite its importance, aspect remains sparsely annotated across semantic meaning representation frameworks. This has, in turn, hindered not only current manual annotation, but also the development of automatic systems capable of predicting aspectual information. In this paper, we introduce a new dataset of English sentences annotated with UMR aspect labels over Abstract Meaning Representation (AMR) graphs that lack the feature. We describe the annotation scheme and guidelines used to label eventive predicates according to the UMR aspect lattice, as well as the annotation pipeline used to ensure consistency and quality across annotators through a multi-step adjudication process. To demonstrate the utility of our dataset for future automation, we present baseline experiments using three modeling approaches. Our results establish initial benchmarks for automatic UMR aspect prediction and provide a foundation for integrating aspect into semantic meaning representations more broadly.

关键词: aspect classification, semantic representations, Uniform Meaning Representations, dataset annotation, Abstract Meaning Representation, eventive predicates, baseline experiments, automatic prediction

156. ❌ Fine-Tuning A Large Language Model for Systematic Review Screening

作者: Kweku Yamoah, Noah Schroeder, Emmanuel Dorley, Neha Rani, Caleb Schutz 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24767v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文核心研究是使用监督微调（SFT）方法对1.2B参数的小型大语言模型（LLM）进行微调，应用于系统综述的文献筛选任务，属于AI for Science（科学AI）领域。因此，与’Large Language Models’和’Post-training/SFT’高度相关（10分），与’AI for Science’相关（8分），与’Small Language Models’有一定关联（5分，因为使用了1.2B参数模型但未强调其小型化特性），其他关键词如MoE、Scaling Laws、RLHF等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究通过监督微调一个1.2B参数的大语言模型，显著提升了其在系统综述中标题和摘要筛选的准确性和效率，与人工编码的一致性达到86.40%。

摘要翻译

传统系统综述的完成通常需要耗费大量人力与时间，部分原因在于必须对海量标题和摘要进行潜在纳入审查。近期，研究者开始探索如何利用大语言模型提升这一过程的效率。然而，迄今的研究成果显示出不一致的结论。我们认为，这仅依靠提示工程可能无法为模型提供足够的上下文以保障其良好性能。在本研究中，我们针对一项人工评估了超过8500条标题与摘要的系统综述，专门对一个12亿参数的开源权重大语言模型进行了研究筛选任务的微调。结果显示，微调后的模型性能显著提升，加权F1分数较基础模型提高了80.79%。在全部8277篇文献的数据集上，微调模型与人工编码者的一致性达到86.40%，其真阳性率为91.18%，真阴性率为86.38%，且在多次推理运行中表现完全一致。综上所述，我们的研究表明，通过微调大语言模型用于大规模系统综述的标题与摘要筛选具有广阔的应用前景。

摘要 (Abstract)

Systematic reviews traditionally have taken considerable amounts of human time and energy to complete, in part due to the extensive number of titles and abstracts that must be reviewed for potential inclusion. Recently, researchers have begun to explore how to use large language models (LLMs) to make this process more efficient. However, research to date has shown inconsistent results. We posit this is because prompting alone may not provide sufficient context for the model(s) to perform well. In this study, we fine-tune a small 1.2 billion parameter open-weight LLM specifically for study screening in the context of a systematic review in which humans rated more than 8500 titles and abstracts for potential inclusion. Our results showed strong performance improvements from the fine-tuned model, with the weighted F1 score improving 80.79% compared to the base model. When run on the full dataset of 8,277 studies, the fine-tuned model had 86.40% agreement with the human coder, a 91.18% true positive rate, a 86.38% true negative rate, and perfect agreement across multiple inference runs. Taken together, our results show that there is promise for fine-tuning LLMs for title and abstract screening in large-scale systematic reviews.

关键词: large language models, fine-tuning, systematic review, study screening, title and abstract screening, supervised fine-tuning, AI for science, biomedical informatics

157. ❌ Evaluating Fine-Tuned LLM Model For Medical Transcription With Small Low-Resource Languages Validated Dataset

作者: Mohammed Nowshad Ruhani Chowdhury, Mohammed Nowaz Rabbani Chowdhury, Sakari Lukkarinen 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24772v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究是使用LLaMA 3.1-8B大语言模型进行医学转录的微调应用，属于大模型在生物医学领域的应用研究。高度相关的关键词包括：‘Large Language Models’（论文使用LLaMA 3.1-8B）、‘Post-training/Supervised Fine-tuning’（论文核心方法是微调）、‘AI for Science/Bioinformatics’（医学转录属于生物信息学应用）。‘Pre-training/Domain Adaptation’得5分，因为论文提到’domain-aligned’但未详细描述预训练或领域适应过程。其他关键词如MoE、SLMs、RLHF、RAG等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

本研究通过微调LLaMA 3.1-8B大语言模型，开发了一个用于芬兰语医学转录的领域专用模型，在小型验证数据集上实现了高语义相似度的转录结果，证明了微调方法在低资源语言临床文档处理中的可行性。

摘要翻译

临床文档记录是影响患者安全、诊断与护理连续性的关键因素。电子健康记录（EHR）带来的行政负担是导致医生职业倦怠的重要因素。对于包括芬兰语在内的低资源语言而言，这是一个至关重要的问题。本研究旨在探讨领域对齐的自然语言处理（NLP）大语言模型在芬兰语医疗转录中的有效性，方法是在梅特罗波利亚应用科学大学学生提供的少量经过验证的模拟临床对话语料上对LLaMA 3.1-8B模型进行微调。医疗转录的微调过程采用了受控的预处理与优化方法，并通过七折交叉验证评估微调效果。微调后LLaMA 3.1-8B模型的评估指标为BLEU = 0.1214、ROUGE-L = 0.4982、BERTScore F1 = 0.8230。结果显示，模型输出与参考转录文本的n元语法重叠度较低，但语义相似性很强。本研究表明，微调可以是处理芬兰语口语医疗对话的有效途径，并支持了为芬兰语临床文档开发以隐私为导向的领域专用大语言模型的可行性。此外，研究也为未来工作提供了方向。

摘要 (Abstract)

Clinical documentation is a critical factor for patient safety, diagnosis, and continuity of care. The administrative burden of EHRs is a significant factor in physician burnout. This is a critical issue for low-resource languages, including Finnish. This study aims to investigate the effectiveness of a domain-aligned natural language processing (NLP); large language model for medical transcription in Finnish by fine-tuning LLaMA 3.1-8B on a small validated corpus of simulated clinical conversations by students at Metropolia University of Applied Sciences. The fine-tuning process for medical transcription used a controlled preprocessing and optimization approach. The fine-tuning effectiveness was evaluated by sevenfold cross-validation. The evaluation metrics for fine-tuned LLaMA 3.1-8B were BLEU = 0.1214, ROUGE-L = 0.4982, and BERTScore F1 = 0.8230. The results showed a low n-gram overlap but a strong semantic similarity with reference transcripts. This study indicate that fine-tuning can be an effective approach for translation of medical discourse in spoken Finnish and support the feasibility of fine-tuning a privacy-oriented domain-specific large language model for clinical documentation in Finnish. Beside that provide directions for future work.

关键词: medical transcription, large language model, fine-tuning, Finnish language, clinical documentation, low-resource languages, LLaMA 3.1-8B, domain-specific NLP

158. ❌ SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

作者: Gabriel Orlanski, Devjeet Roy, Alexander Yun, Changho Shin, Alex Gu, Albert Ge, Dyah Adila, Frederic Sala, Aws Albarghouthi 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24755v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究大模型驱动的编码代理在长期迭代任务中的性能退化问题，与LLM Agents高度相关（10分），因为这是核心研究对象；与Chain of Thought、System 2 Thinking、Self-Correction、Tool Use有一定关联（5分），因为这些是代理能力的一部分；其他关键词如MoE、量化、RAG等未涉及（0分）。

!!! tip deepseek-chat TL;DR

论文研究了编码代理在长期迭代任务中代码质量会持续退化的问题，发现当前代理缺乏迭代软件开发所需的设计纪律，其代码比人类代码更冗长且结构更易侵蚀。

摘要翻译

软件开发本质上是迭代的，然而当前主流的智能体编码基准测试大多针对完整需求规格评估单次生成的解决方案。代码或许能通过测试套件，但随着迭代扩展会变得越来越困难。近期的迭代基准测试试图弥补这一差距，但过度限制了智能体的设计决策，无法真实衡量代码质量如何影响后续扩展。我们提出了SlopCodeBench——一个与编程语言无关的基准测试集，包含20个问题和93个检查点。在该框架中，智能体需根据不断演化的需求规格反复扩展其先前解决方案，这些规格强制要求架构决策但不规定内部实现结构。我们追踪两个轨迹级别的质量指标：冗余度（即冗余或重复代码的比例）和结构侵蚀度（即高复杂度函数所集中的复杂性质量占比）。在测试的11个模型中，没有任何智能体能端到端解决全部问题；最高检查点解决率仅为17.2%。代码质量持续恶化：80%的轨迹中出现结构侵蚀加剧，89.8%的轨迹中冗余度上升。与48个开源Python代码库相比，智能体生成代码的冗余度高出2.2倍，且结构侵蚀显著更严重。对其中20个代码库的历时追踪显示：人类代码质量保持稳定，而智能体代码随每次迭代持续劣化。提示干预研究表明，初始质量虽可提升，但无法阻止劣化趋势。这些结果证明，仅以通过率为标准的基准测试系统性地低估了代码扩展鲁棒性，且当前智能体缺乏迭代软件开发所要求的设计规范性。

摘要 (Abstract)

Software development is iterative, yet agentic coding benchmarks overwhelmingly evaluate single-shot solutions against complete specifications. Code can pass the test suite but become progressively harder to extend. Recent iterative benchmarks attempt to close this gap, but constrain the agent’s design decisions too tightly to faithfully measure how code quality shapes future extensions. We introduce SlopCodeBench, a language-agnostic benchmark comprising 20 problems and 93 checkpoints, in which agents repeatedly extend their own prior solutions under evolving specifications that force architectural decisions without prescribing internal structure. We track two trajectory-level quality signals: verbosity, the fraction of redundant or duplicated code, and structural erosion, the share of complexity mass concentrated in high-complexity functions. No agent solves any problem end-to-end across 11 models; the highest checkpoint solve rate is 17.2%. Quality degrades steadily: erosion rises in 80% of trajectories and verbosity in 89.8%. Against 48 open-source Python repositories, agent code is 2.2x more verbose and markedly more eroded. Tracking 20 of those repositories over time shows that human code stays flat, while agent code deteriorates with each iteration. A prompt-intervention study shows that initial quality can be improved, but it does not halt degradation. These results demonstrate that pass-rate benchmarks systematically undermeasure extension robustness, and that current agents lack the design discipline iterative software development demands.

关键词: coding agents, iterative tasks, benchmark, code quality degradation, verbosity, structural erosion, LLM agents, software development

159. ❌ Demystifying When Pruning Works via Representation Hierarchies

作者: Shwai He, Guoheng Sun, Haichao Zhang, Yun Fu, Ang Li 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24652v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究网络剪枝对语言模型的影响，属于大模型效率优化和可解释性范畴。核心相关关键词：1）‘Large Language Models’（8分）- 论文明确研究语言模型剪枝；2）‘Quantization OR Model Compression’（8分）- 剪枝是模型压缩的重要技术；3）‘Mechanistic Interpretability’（8分）- 从表示层次分析剪枝机制；4）‘PEFT’（5分）- 剪枝属于参数高效技术；5）‘Mixture of Experts’和’Small Language Models’（各5分）- 与稀疏模型和轻量化相关；6）‘Speculative Decoding’（5分）- 与推理效率间接相关。其他关键词如对齐、RAG、科学AI等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文通过表示层次分析揭示了网络剪枝在语言模型中为何对生成任务效果差但对非生成任务有效，发现logit到概率的非线性变换会放大剪枝引起的偏差，导致生成性能下降。

摘要翻译

网络剪枝通过移除重要性较低的参数或架构，常被期望在保持性能的同时提升效率。然而，这一预期在语言任务中并不总是一致的：剪枝后的模型在非生成任务上可能表现良好，但在生成式场景中却常常失效。为理解这种差异，我们从表征层级的角度分析了网络剪枝，将语言模型的内部计算分解为三个连续空间：嵌入空间（隐藏表征）、逻辑空间（softmax前输出）和概率空间（softmax后分布）。我们发现，嵌入空间和逻辑空间中的表征对剪枝引起的扰动具有较强鲁棒性。然而，从逻辑值到概率的非线性变换放大了这些偏差，这些偏差在时间步上累积，导致生成过程中性能显著下降。相比之下，类别-标记概率子空间的稳定性，连同嵌入空间的鲁棒性，共同支撑了剪枝在检索和多项选择等非生成任务中的有效性。我们的分析揭示了剪枝在不同任务中的差异化影响，并为其实际应用提供了指导。代码发布于 https://github.com/CASE-Lab-UMD/Pruning-on-Representations

摘要 (Abstract)

Network pruning, which removes less important parameters or architectures, is often expected to improve efficiency while preserving performance. However, this expectation does not consistently hold across language tasks: pruned models can perform well on non-generative tasks but frequently fail in generative settings. To understand this discrepancy, we analyze network pruning from a representation-hierarchy perspective, decomposing the internal computation of language models into three sequential spaces: embedding (hidden representations), logit (pre-softmax outputs), and probability (post-softmax distributions). We find that representations in the embedding and logit spaces are largely robust to pruning-induced perturbations. However, the nonlinear transformation from logits to probabilities amplifies these deviations, which accumulate across time steps and lead to substantial degradation during generation. In contrast, the stability of the categorical-token probability subspace, together with the robustness of the embedding space, supports the effectiveness of pruning for non-generative tasks such as retrieval and multiple-choice selection. Our analysis disentangles the effects of pruning across tasks and provides practical guidance for its application. Code is available at https://github.com/CASE-Lab-UMD/Pruning-on-Representations

关键词: network pruning, language models, representation hierarchy, generative tasks, non-generative tasks, model efficiency, robustness analysis, parameter reduction

160. ❌ When Consistency Becomes Bias: Interviewer Effects in Semi-Structured Clinical Interviews

作者: Hasindri Watawana, Sergio Burdisso, Diego A. Moreno-Galván, Fernando Sánchez-Vega, A. Pastor López-Monroy, Petr Motlicek, Esaú Villatoro-Tello 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24651v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究自动抑郁症检测中的模型偏差问题，主要涉及语言模型在临床访谈中的应用和可解释性分析。与大多数关键词（如LLM技术原理、训练方法、推理优化等）完全无关。仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文关注模型决策证据的定位和解释；与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为涉及AI在心理健康/生物医学领域的应用。其他关键词均无直接联系。

!!! tip deepseek-chat TL;DR

该论文发现，在半结构化临床访谈中，用于自动抑郁症检测的模型会系统性地利用访谈者提示的脚本伪影来做出预测，而不是基于参与者的真实语言线索，这夸大了模型性能并揭示了跨数据集的架构无关偏差。

摘要翻译

基于公开语料库的可用性及语言建模技术的进步，从医患对话中自动检测抑郁症的研究已获得显著发展。然而，其可解释性仍存在局限：现有研究常报告优异的性能表现，却未揭示驱动预测的具体因素。本研究分析了三个数据集：ANDROIDS、DAIC-WOZ与E-DAIC，发现半结构化访谈中访谈者引导语存在系统性偏差。基于访谈者话轮训练的模型会利用固定引导语和位置来区分抑郁患者与对照组受试者，即使未使用参与者语言也常获得高分类分数。将模型输入限制为参与者话语后，决策依据分布更为广泛，并能反映真实的语言线索。尽管半结构化访谈协议确保了评估一致性，但包含访谈者引导语会通过利用脚本人为特征而虚增性能表现。我们的结果揭示了一种跨数据集、与模型架构无关的偏差，强调需要通过时间和说话者定位决策依据的分析方法，以确保模型真正从参与者的语言中学习。

摘要 (Abstract)

Automatic depression detection from doctor-patient conversations has gained momentum thanks to the availability of public corpora and advances in language modeling. However, interpretability remains limited: strong performance is often reported without revealing what drives predictions. We analyze three datasets: ANDROIDS, DAIC-WOZ, E-DAIC and identify a systematic bias from interviewer prompts in semi-structured interviews. Models trained on interviewer turns exploit fixed prompts and positions to distinguish depressed from control subjects, often achieving high classification scores without using participant language. Restricting models to participant utterances distributes decision evidence more broadly and reflects genuine linguistic cues. While semi-structured protocols ensure consistency, including interviewer prompts inflates performance by leveraging script artifacts. Our results highlight a cross-dataset, architecture-agnostic bias and emphasize the need for analyses that localize decision evidence by time and speaker to ensure models learn from participants’ language.

关键词: automatic depression detection, semi-structured clinical interviews, interviewer bias, language modeling, interpretability, decision evidence localization, cross-dataset bias, participant language cues

161. ❌ Less Gaussians, Texture More: 4K Feed-Forward Textured Splatting

作者: Yixing Lao, Xuyang Bai, Xiaoyang Wu, Nuoyuan Yan, Zixin Luo, Tian Fang, Jean-Daniel Nahmias, Yanghai Tsin, Shiwei Li, Hengshuang Zhao 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25745v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉和图形学领域的3D高斯泼溅技术，提出了一种名为LGTM的feed-forward框架来解决高分辨率合成中的可扩展性问题。论文内容涉及3D重建、新颖视图合成和渲染优化，但完全不涉及大语言模型、深度学习技术原理、AI for Science或任何评分关键词中提到的具体技术（如MoE、RLHF、RAG等）。所有关键词均与论文主题无关，因此相关度评分均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为LGTM的feed-forward框架，通过预测紧凑的高斯基元并结合每基元纹理，解决了现有3D高斯泼溅方法在分辨率提升时基元数量二次增长的问题，从而实现了无需逐场景优化的高保真4K新颖视图合成。

摘要翻译

现有前馈式三维高斯泼溅方法预测像素对齐的图元，导致图元数量随分辨率增加呈二次增长。这从根本上限制了其可扩展性，使得高分辨率合成（如4K）难以实现。我们提出LGTM（更少高斯，更多纹理），一种克服此分辨率缩放障碍的前馈框架。通过预测紧凑的高斯图元并结合逐图元纹理，LGTM将几何复杂度与渲染分辨率解耦。该方法无需逐场景优化即可实现高保真度的4K新视角合成，这是前馈方法此前无法达到的能力，同时使用的三维高斯图元数量显著减少。项目页面：https://yxlao.github.io/lgtm/

摘要 (Abstract)

Existing feed-forward 3D Gaussian Splatting methods predict pixel-aligned primitives, leading to a quadratic growth in primitive count as resolution increases. This fundamentally limits their scalability, making high-resolution synthesis such as 4K intractable. We introduce LGTM (Less Gaussians, Texture More), a feed-forward framework that overcomes this resolution scaling barrier. By predicting compact Gaussian primitives coupled with per-primitive textures, LGTM decouples geometric complexity from rendering resolution. This approach enables high-fidelity 4K novel view synthesis without per-scene optimization, a capability previously out of reach for feed-forward methods, all while using significantly fewer Gaussian primitives. Project page: https://yxlao.github.io/lgtm/

关键词: 3D Gaussian Splatting, feed-forward framework, 4K novel view synthesis, primitive prediction, texture decoupling, rendering resolution, high-fidelity synthesis, scalability improvement

162. ❌ ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

作者: Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu, Xintao Wang, Pengfei Wan, Tianfan Xue 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25746v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ShotStream专注于多镜头视频生成技术，属于计算机视觉和视频生成领域，而非大语言模型或深度学习技术原理的核心研究。论文涉及对文本到视频模型进行微调（fine-tuning），这与’Post-training OR Supervised Fine-tuning OR SFT’有一定关联，因此给予5分。其他关键词主要围绕大语言模型、推理、对齐、压缩、科学AI等主题，与论文的视频生成和交互式叙事焦点无直接关系，故均为0分。

!!! tip deepseek-chat TL;DR

ShotStream提出了一种新颖的因果多镜头视频生成架构，通过双向到因果的蒸馏和双缓存记忆机制，解决了交互式叙事中多镜头视频生成的延迟和一致性挑战，实现了实时连贯的视频生成。

摘要翻译

多镜头视频生成对于长叙事故事讲述至关重要，然而当前的双向架构存在交互性有限和延迟较高的问题。我们提出了ShotStream，一种新颖的因果多镜头架构，能够实现交互式故事讲述和高效即时帧生成。通过将该任务重新定义为基于历史上下文条件的下一镜头生成，ShotStream允许用户通过流式提示动态指导正在进行的叙事。我们首先将文本到视频模型微调为双向下一镜头生成器，然后通过分布匹配蒸馏将其提炼为因果学生模型，以此实现这一目标。为了克服自回归生成固有的镜头间一致性和错误累积挑战，我们引入了两项关键创新。首先，一种双缓存记忆机制保持了视觉连贯性：全局上下文缓存保留条件帧以确保镜头间一致性，而局部上下文缓存则存储当前镜头内生成的帧以确保镜头内一致性。同时，我们采用RoPE不连续性指示器来明确区分这两个缓存，以消除歧义。其次，为了减轻错误累积，我们提出了一种两阶段蒸馏策略。该策略从基于真实历史镜头的镜头内自强制开始，逐步扩展到使用自生成历史记录的镜头间自强制，从而有效弥合训练与测试之间的差距。大量实验表明，ShotStream能够以亚秒级延迟生成连贯的多镜头视频，在单GPU上达到16 FPS。其质量与较慢的双向模型相当或更优，为实时交互式故事讲述铺平了道路。训练和推理代码以及模型已在我们的项目页面公开。

摘要 (Abstract)

Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. We propose ShotStream, a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame generation. By reformulating the task as next-shot generation conditioned on historical context, ShotStream allows users to dynamically instruct ongoing narratives via streaming prompts. We achieve this by first fine-tuning a text-to-video model into a bidirectional next-shot generator, which is then distilled into a causal student via Distribution Matching Distillation. To overcome the challenges of inter-shot consistency and error accumulation inherent in autoregressive generation, we introduce two key innovations. First, a dual-cache memory mechanism preserves visual coherence: a global context cache retains conditional frames for inter-shot consistency, while a local context cache holds generated frames within the current shot for intra-shot consistency. And a RoPE discontinuity indicator is employed to explicitly distinguish the two caches to eliminate ambiguity. Second, to mitigate error accumulation, we propose a two-stage distillation strategy. This begins with intra-shot self-forcing conditioned on ground-truth historical shots and progressively extends to inter-shot self-forcing using self-generated histories, effectively bridging the train-test gap. Extensive experiments demonstrate that ShotStream generates coherent multi-shot videos with sub-second latency, achieving 16 FPS on a single GPU. It matches or exceeds the quality of slower bidirectional models, paving the way for real-time interactive storytelling. Training and inference code, as well as the models, are available on our

关键词: multi-shot video generation, interactive storytelling, causal architecture, streaming generation, distribution matching distillation, dual-cache memory, autoregressive generation, real-time video generation

163. ❌ MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models

作者: Bocheng Zou, Mu Cai, Mark Stanley, Dingfu Lu, Yong Jae Lee 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25744v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域，研究视觉基础模型（VFMs）的多尺度推理方法MuRF，所有给定的关键词均针对大语言模型（LLMs）或通用大模型技术，与论文的视觉模型研究内容无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对视觉基础模型在推理时通常局限于单一尺度的问题，提出了无需训练的多分辨率融合方法MuRF，通过处理多分辨率图像并融合特征来提升视觉表示能力，并在多种视觉任务和模型上验证了其有效性。

摘要翻译

视觉基础模型已成为现代计算机视觉的基石，为广泛的任务提供了鲁棒的表征。尽管近期进展允许这些模型在训练期间处理不同输入尺寸，但推理过程通常仍局限于单一固定尺度。这种普遍的单尺度范式忽视了视觉感知的一个基本特性：不同分辨率能提供互补的归纳偏置——低分辨率视图擅长全局语义识别，而高分辨率视图则对细粒度精细化至关重要。本研究提出多分辨率融合，这是一种在推理阶段利用这种协同效应的简单而普遍有效的策略。该方法不依赖单一视图，而是通过冻结的视觉基础模型处理多分辨率图像并融合所得特征，从而构建统一表征。多分辨率融合的普适性是其最引人注目的属性：它不绑定于特定架构，而是作为一种根本性的、无需训练的可视表征增强手段。我们通过将多分辨率融合应用于多个不同视觉基础模型家族（以DINOv2为主，同时成功推广至SigLIP2等对比模型）所涵盖的广泛关键计算机视觉任务，对此进行了实证验证。

摘要 (Abstract)

Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. The universality of MuRF is its most compelling attribute. It is not tied to a specific architecture, serving instead as a fundamental, training-free enhancement to visual representation. We empirically validate this by applying MuRF to a broad spectrum of critical computer vision tasks across multiple distinct VFM families - primarily DINOv2, but also demonstrating successful generalization to contrastive models like SigLIP2.

关键词: Vision Foundation Models, Multi-Resolution Fusion, MuRF, Computer Vision, Inference Enhancement, Feature Fusion, DINOv2, SigLIP2

164. ❌ RefAlign: Representation Alignment for Reference-to-Video Generation

作者: Lei Wang, YuXin Song, Ge Wu, Haocheng Feng, Hang Zhou, Jingdong Wang, Yaxing Wang, jian Yang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25743v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是视频生成任务中的表示对齐问题，使用视觉基础模型（VFM）和扩散Transformer（DiT），属于计算机视觉和生成式AI领域。与绝大多数关键词（如LLM、MoE、Scaling Laws、RLHF等）完全无关，因为这些关键词主要针对语言模型和大模型技术。唯一有一定关联的是’Instruction Tuning OR Alignment OR Value Alignment’，因为论文提到了’对齐’（alignment）概念，但这是视觉特征对齐而非语言模型的对齐，因此给5分。其他关键词均不涉及。

!!! tip deepseek-chat TL;DR

该论文提出了RefAlign框架，通过显式对齐扩散Transformer的参考分支特征到视觉基础模型的语义空间，解决了参考到视频生成中的复制粘贴伪影和多主体混淆问题，在OpenS2V-Eval基准上超越了现有方法。

摘要翻译

参考到视频（R2V）生成是一种可控的视频合成范式，它通过文本提示和参考图像共同约束生成过程，可应用于个性化广告和虚拟试穿等场景。实践中，现有的R2V方法通常在参考图像的变分自编码器（VAE）潜在表示之外，额外引入高层语义或跨模态特征，并将其共同输入扩散Transformer（DiT）。这些辅助表示提供了语义指导并作为隐式对齐信号，能够部分缓解VAE潜在空间中的像素级信息泄漏问题。然而，它们仍难以完全解决由异构编码器特征间的模态失配所导致的复制-粘贴伪影及多主体混淆问题。本文提出RefAlign，一种表示对齐框架，其显式地将DiT参考分支特征对齐到视觉基础模型（VFM）的语义空间。RefAlign的核心是一个参考对齐损失，该损失拉近同一主体的参考特征与VFM特征以提升身份一致性，同时推远不同主体的对应特征以增强语义区分度。这一简洁而有效的策略仅在训练阶段应用，不引入推理时开销，并在文本可控性与参考保真度之间实现了更好的平衡。在OpenS2V-Eval基准上的大量实验表明，RefAlign在综合评分（TotalScore）上优于当前最先进方法，验证了显式参考对齐在R2V任务中的有效性。

摘要 (Abstract)

Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additional high-level semantic or cross-modal features alongside the VAE latent representation of the reference image and jointly feed them into the diffusion Transformer (DiT). These auxiliary representations provide semantic guidance and act as implicit alignment signals, which can partially alleviate pixel-level information leakage in the VAE latent space. However, they may still struggle to address copy–paste artifacts and multi-subject confusion caused by modality mismatch across heterogeneous encoder features. In this paper, we propose RefAlign, a representation alignment framework that explicitly aligns DiT reference-branch features to the semantic space of a visual foundation model (VFM). The core of RefAlign is a reference alignment loss that pulls the reference features and VFM features of the same subject closer to improve identity consistency, while pushing apart the corresponding features of different subjects to enhance semantic discriminability. This simple yet effective strategy is applied only during training, incurring no inference-time overhead, and achieves a better balance between text controllability and reference fidelity. Extensive experiments on the OpenS2V-Eval benchmark demonstrate that RefAlign outperforms current state-of-the-art methods in TotalScore, validating the effectiveness of explicit reference alignment for R2V tasks.

关键词: Reference-to-video generation, Representation alignment, Diffusion Transformer, Visual foundation model, Identity consistency, Semantic discriminability, OpenS2V-Eval benchmark

165. ❌ PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow

作者: Xincheng Shuai, Song Tang, Yutong Huang, Henghui Ding, Dacheng Tao 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25738v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PSDesigner提出了一种模拟人类设计师创意工作流程的自动化图形设计系统，其核心创新在于构建了一个包含大量高质量PSD设计文件和操作轨迹的数据集CreativePSD，并利用该系统实现自主推断和执行工具调用来操作设计文件。这与’LLM Agents/Autonomous Agents/Agentic Workflow’高度相关（10分），因为系统本质上是一个能够自主执行设计任务的智能体；与’Tool Use/Function Calling/API Tool Use’高度相关（10分），因为系统核心能力是学习和执行专业设计工具的操作。与’Large Language Models/LLMs/Foundation Models’有一定关联（8分），因为摘要提到系统可能基于MLLMs（多模态大语言模型）构建，但未明确说明具体模型类型。其他关键词如MoE、Scaling Laws、RLHF、RAG等均未在摘要中提及，与论文内容完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文解决了自动化图形设计系统难以将用户意图准确转换为可编辑设计文件的挑战，通过提出PSDesigner系统模拟人类设计师工作流程并构建CreativePSD数据集，实现了在多样化设计任务上超越现有方法的性能，使非专业人士也能便捷创建生产级设计。

摘要翻译

平面设计是一种具有创造性与创新性的过程，在电子商务和广告等应用场景中发挥着关键作用。然而，开发一个能够将用户意图忠实转化为可编辑设计文件的自动化设计系统，仍然是一个开放的挑战。尽管近期研究借助强大的文本到图像模型和多模态大语言模型来辅助平面设计，但这些方法通常简化了专业工作流程，导致灵活性和直观性有限。为应对这些局限，我们提出了PSDesigner，一个模拟人类设计师创作流程的自动化平面设计系统。基于多个专用组件，PSDesigner能够根据用户指令收集主题相关素材，并自主推断和执行工具调用来操作设计文件，例如整合新素材或优化欠佳元素。为使系统具备强大的工具使用能力，我们构建了一个设计数据集CreativePSD，其中包含大量高质量PSD设计文件，这些文件标注了涵盖广泛设计场景与艺术风格的操作轨迹，使模型能够学习专家设计流程。大量实验表明，PSDesigner在多样化的平面设计任务中均优于现有方法，使非专业人士能够便捷地创作出具备生产质量的设计作品。

摘要 (Abstract)

Graphic design is a creative and innovative process that plays a crucial role in applications such as e-commerce and advertising. However, developing an automated design system that can faithfully translate user intentions into editable design files remains an open challenge. Although recent studies have leveraged powerful text-to-image models and MLLMs to assist graphic design, they typically simplify professional workflows, resulting in limited flexibility and intuitiveness. To address these limitations, we propose PSDesigner, an automated graphic design system that emulates the creative workflow of human designers. Building upon multiple specialized components, PSDesigner collects theme-related assets based on user instructions, and autonomously infers and executes tool calls to manipulate design files, such as integrating new assets or refining inferior elements. To endow the system with strong tool-use capabilities, we construct a design dataset, CreativePSD, which contains a large amount of high-quality PSD design files annotated with operation traces across a wide range of design scenarios and artistic styles, enabling models to learn expert design procedures. Extensive experiments demonstrate that PSDesigner outperforms existing methods across diverse graphic design tasks, empowering non-specialists to conveniently create production-quality designs.

关键词: Automated Graphic Design, Human-Like Creative Workflow, Tool Use, Design Dataset, PSD Files, Operation Traces, Multi-modal LLMs, Design Agents

166. ❌ MegaFlow: Zero-Shot Large Displacement Optical Flow

作者: Dingxi Zhang, Fangjinhua Wang, Marc Pollefeys, Haofei Xu 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25739v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文MegaFlow专注于计算机视觉中的光流估计任务，主要利用预训练的Vision Transformer特征进行全局匹配，属于AI在科学/工程领域的应用。与大多数大语言模型（LLM）相关的关键词（如LLMs、MoE、RLHF、RAG等）完全无关。唯一相关的关键词是：1. “Pre-training OR Continual Pre-training OR Domain Adaptation”（8分）：论文明确提到"adapts powerful pre-trained vision priors"和"leveraging pre-trained global Vision Transformer features"，核心是利用预训练视觉模型进行领域适应。2. “AI for Science OR Bioinformatics OR Cheminformatics”（5分）：光流估计是计算机视觉的基础研究，可视为AI在科学/工程领域的应用，但非生物信息学或化学信息学等具体科学领域。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出MegaFlow模型，通过利用预训练的Vision Transformer特征进行全局匹配，解决了大位移光流估计的零样本泛化挑战，在多个基准测试中实现了最先进的零样本性能。

摘要翻译

大位移光流的精确估计仍是一个关键挑战。现有方法通常依赖于迭代局部搜索和/或领域特定的微调，这严重限制了其在大位移和零样本泛化场景中的性能。为克服这一问题，我们提出了MegaFlow——一个简单而强大的零样本大位移光流模型。MegaFlow并非依赖高度复杂、任务特定的架构设计，而是通过适配强大的预训练视觉先验来生成时间一致的运动场。具体而言，我们利用预训练的全局视觉Transformer（Vision Transformer）特征，将光流估计构建为全局匹配问题，从而自然地捕捉大位移运动。随后通过少量轻量级迭代优化进一步提升亚像素精度。大量实验表明，MegaFlow在多个光流基准测试中实现了最先进的零样本性能。此外，我们的模型在长程点跟踪基准测试中也展现出极具竞争力的零样本性能，这证明了其强大的可迁移性，并为可泛化的运动估计提出了统一范式。项目页面位于：https://kristen-z.github.io/projects/megaflow。

摘要 (Abstract)

Accurate estimation of large displacement optical flow remains a critical challenge. Existing methods typically rely on iterative local search or/and domain-specific fine-tuning, which severely limits their performance in large displacement and zero-shot generalization scenarios. To overcome this, we introduce MegaFlow, a simple yet powerful model for zero-shot large displacement optical flow. Rather than relying on highly complex, task-specific architectural designs, MegaFlow adapts powerful pre-trained vision priors to produce temporally consistent motion fields. In particular, we formulate flow estimation as a global matching problem by leveraging pre-trained global Vision Transformer features, which naturally capture large displacements. This is followed by a few lightweight iterative refinements to further improve the sub-pixel accuracy. Extensive experiments demonstrate that MegaFlow achieves state-of-the-art zero-shot performance across multiple optical flow benchmarks. Moreover, our model also delivers highly competitive zero-shot performance on long-range point tracking benchmarks, demonstrating its robust transferability and suggesting a unified paradigm for generalizable motion estimation. Our project page is at: https://kristen-z.github.io/projects/megaflow.

关键词: optical flow, zero-shot, large displacement, Vision Transformer, pre-trained features, global matching, motion estimation, transferability

167. ❌ How good was my shot? Quantifying Player Skill Level in Table Tennis

作者: Akihiro Kubota, Tomoya Hasegawa, Ryo Kawahara, Ko Nishino 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25736v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是通过生成模型和潜在空间嵌入来量化乒乓球运动员的技能水平，属于计算机视觉、行为分析和运动分析领域。所有关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文通过训练生成模型将乒乓球运动员的战术击球嵌入到共同潜在空间，实现了对复杂交互行为中技能水平的自动量化评估。

摘要翻译

评估个体的技能水平至关重要，因为它从根本上塑造了其行为表现。然而，量化技能具有挑战性，因为技能本身潜藏于可观察的动作之中。为探究人类行为中的技能理解，我们聚焦于双人对抗性运动——特别是乒乓球——其中技能不仅体现在复杂的动作中，更展现在受比赛情境制约的细微执行差异上。我们的核心思路是学习每位运动员战术性击球的生成模型，并将其共同嵌入一个共同的潜空间，该空间编码了个体特征，包括与技能水平相关的特性。通过在三维重建的职业比赛大规模数据集上训练这些运动员模型，并以全面的比赛情境（包括运动员站位和对手行为）为条件，这些模型在其潜空间中捕捉了个体的战术身份。我们探索这一学习到的运动员空间，发现它反映了不同的打法风格和属性，这些特征共同表征了技能水平。通过在这些嵌入表示上训练一个简单的相对排序网络，我们证明可以实现相对和绝对的技能预测。这些结果表明，学习到的运动员空间能有效量化技能水平，为复杂交互行为中的自动化技能评估奠定了基础。

摘要 (Abstract)

Gauging an individual’s skill level is crucial, as it inherently shapes their behavior. Quantifying skill, however, is challenging because it is latent to the observed actions. To explore skill understanding in human behavior, we focus on dyadic sports – specifically table tennis – where skill manifests not just in complex movements, but in the subtle nuances of execution conditioned on game context. Our key idea is to learn a generative model of each player’s tactical racket strokes and jointly embed them in a common latent space that encodes individual characteristics, including those pertaining to skill levels. By training these player models on a large-scale dataset of 3D-reconstructed professional matches and conditioning them on comprehensive game context – including player positioning and opponent behaviors – the models capture individual tactical identities within their latent space. We probe this learned player space and find that it reflects distinct play styles and attributes that collectively represent skill. By training a simple relative ranking network on these embeddings, we demonstrate that both relative and absolute skill predictions can be achieved. These results demonstrate that the learned player space effectively quantifies skill levels, providing a foundation for automated skill assessment in complex, interactive behaviors.

关键词: skill quantification, table tennis, generative model, latent space embedding, player behavior analysis, tactical racket strokes, automated skill assessment, 3D-reconstructed matches

168. ❌ Unleashing Guidance Without Classifiers for Human-Object Interaction Animation

作者: Ziyin Wang, Sirui Xu, Chuan Guo, Bing Zhou, Jiangshan Gong, Jian Wang, Yu-Xiong Wang, Liang-Yan Gui 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25734v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究人类-物体交互动画生成，使用扩散模型和数据驱动方法，核心是模态特定表示、异步去噪和接触感知引导。所有关键词均与大语言模型、深度学习技术原理或科学AI应用直接相关，而本文专注于计算机视觉和图形学的扩散模型应用，未涉及任何关键词领域，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为LIGHT的数据驱动方法，通过模态特定表示和异步去噪计划，在人类-物体交互动画生成中实现无分类器的接触感知引导，提高了接触保真度和泛化能力。

摘要翻译

生成逼真的人-物交互（Human-Object Interaction, HOI）动画仍然具有挑战性，因为它需要联合建模动态的人体动作与多样化的物体几何形态。以往基于扩散模型的方法通常依赖手工设计的接触先验或人为施加的运动学约束来提升接触质量。我们提出了LIGHT，一种数据驱动的替代方案，其引导机制源于去噪节奏本身，从而减少了对人工设计先验的依赖。基于扩散强迫技术，我们将表征分解为特定模态的组件，并通过异步去噪计划分配差异化的噪声水平。在此范式中，更纯净的组件通过交叉注意力机制引导噪声较多的组件，无需辅助分类器即可实现引导。我们发现这种数据驱动的引导机制本质上是接触感知的，并且当训练过程中引入广泛的合成物体几何形态以增强数据时，其能力可得到进一步提升，从而鼓励接触语义对几何多样性保持不变量。大量实验表明，相较于传统的无分类器引导，节奏诱导的引导能更有效地体现接触先验的优势，同时实现更高的接触保真度、更逼真的HOI生成效果，以及对未见过的物体和任务更强的泛化能力。

摘要 (Abstract)

Generating realistic human-object interaction (HOI) animations remains challenging because it requires jointly modeling dynamic human actions and diverse object geometries. Prior diffusion-based approaches often rely on hand-crafted contact priors or human-imposed kinematic constraints to improve contact quality. We propose LIGHT, a data-driven alternative in which guidance emerges from the denoising pace itself, reducing dependence on manually designed priors. Building on diffusion forcing, we factor the representation into modality-specific components and assign individualized noise levels with asynchronous denoising schedules. In this paradigm, cleaner components guide noisier ones through cross-attention, yielding guidance without auxiliary classifiers. We find that this data-driven guidance is inherently contact-aware, and can be enhanced when training is augmented with a broad spectrum of synthetic object geometries, encouraging invariance of contact semantics to geometric diversity. Extensive experiments show that pace-induced guidance more effectively mirrors the benefits of contact priors than conventional classifier-free guidance, while achieving higher contact fidelity, more realistic HOI generation, and stronger generalization to unseen objects and tasks.

关键词: human-object interaction animation, diffusion models, data-driven guidance, contact-aware, asynchronous denoising, cross-attention, generalization, synthetic object geometries

169. ❌ BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation

作者: Yan Li, Zezi Zeng, Ziwei Zhou, Xin Gao, Muzhao Tian, Yifan Yang, Mingxi Cheng, Qi Dai, Yuqing Yang, Lili Qiu, Zhendong Wang, Zhengyuan Yang, Xue Yang, Lijuan Wang, Ji Li, Chong Luo 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25732v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation》专注于图像生成模型的评估基准，特别是针对商业视觉内容生成（如幻灯片、图表、网页、海报、科学图表）。所有给定的关键词均与大语言模型（LLM）或深度学习技术原理相关，而本文的研究对象是图像生成模型（如扩散模型、GANs等），属于计算机视觉领域，与文本大模型技术无直接关联。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有图像生成模型在商业视觉内容创作中缺乏系统性评估的问题，提出了一个名为BizGenEval的基准，涵盖五种文档类型和四个能力维度，并通过大规模测试揭示了当前模型与专业需求之间的显著差距。

摘要翻译

近期图像生成模型的进展已将其应用从美学图像扩展至实用视觉内容创作领域。然而，现有基准主要关注自然图像合成，未能系统评估模型在现实商业设计任务所要求的结构化、多约束条件下的表现。本研究提出BizGenEval——一个面向商业视觉内容生成的系统性基准。该基准涵盖五种代表性文档类型：演示文稿、图表、网页、海报和科学图示，并评估四个关键能力维度：文本渲染、布局控制、属性绑定和基于知识的推理，共同构成20项多样化评估任务。BizGenEval包含400个精心设计的提示词和8000个人工核验的清单式问题，用于严格评估生成图像是否满足复杂的视觉与语义约束。我们对26个主流图像生成系统进行了大规模基准测试，包括最先进的商业API和领先的开源模型。测试结果揭示了当前生成模型与专业视觉内容创作需求之间存在的显著能力差距。我们希望BizGenEval能成为现实世界商业视觉内容生成领域的标准化基准。

摘要 (Abstract)

Recent advances in image generation models have expanded their applications beyond aesthetic imagery toward practical visual content creation. However, existing benchmarks mainly focus on natural image synthesis and fail to systematically evaluate models under the structured and multi-constraint requirements of real-world commercial design tasks. In this work, we introduce BizGenEval, a systematic benchmark for commercial visual content generation. The benchmark spans five representative document types: slides, charts, webpages, posters, and scientific figures, and evaluates four key capability dimensions: text rendering, layout control, attribute binding, and knowledge-based reasoning, forming 20 diverse evaluation tasks. BizGenEval contains 400 carefully curated prompts and 8000 human-verified checklist questions to rigorously assess whether generated images satisfy complex visual and semantic constraints. We conduct large-scale benchmarking on 26 popular image generation systems, including state-of-the-art commercial APIs and leading open-source models. The results reveal substantial capability gaps between current generative models and the requirements of professional visual content creation. We hope BizGenEval serves as a standardized benchmark for real-world commercial visual content generation.

关键词: commercial visual content generation, image generation models, benchmark, BizGenEval, layout control, text rendering, attribute binding, knowledge-based reasoning

170. ❌ SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding

作者: Jiwook Han, Geo Ahn, Youngrae Kim, Jinwoo Choi 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25733v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出SlotVTG，一种用于多模态大语言模型（MLLMs）的轻量级slot adapter，以提升视频时序定位（VTG）的泛化能力。核心相关关键词：1）‘Large Language Models OR LLMs OR Foundation Models’（8分）：论文基于MLLMs，属于大模型应用；2）‘Post-training OR Supervised Fine-tuning OR SFT’（8分）：论文针对VTG任务进行微调，解决泛化问题；3）‘PEFT OR LoRA OR Parameter-efficient Fine-tuning’（10分）：SlotVTG是轻量级适配器，属于参数高效微调技术，是核心创新。其他关键词如’Pre-training OR Continual Pre-training OR Domain Adaptation’（5分）涉及领域适应，但非重点；其余关键词与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

论文针对多模态大语言模型在视频时序定位任务中因微调导致的泛化能力差问题，提出了一种轻量级的对象中心适配器SlotVTG，显著提升了跨域鲁棒性并保持了域内性能。

摘要翻译

多模态大语言模型（MLLMs）在视频时序定位（VTG）任务中展现出强大性能。然而，其粗粒度的识别能力不足以支撑细粒度的时间理解，这使得任务特定的微调不可或缺。这种微调会导致模型记忆数据集特定的捷径，而非忠实依据实际视觉内容进行定位，从而导致较差的域外（OOD）泛化能力。以对象为中心的学习通过将场景分解为实体级表征，提供了一种有前景的解决方案，但现有方法需要从头重新运行整个多阶段训练流程。我们提出了SlotVTG，这是一个能够以最小成本引导MLLMs进行以对象为中心、基于输入的视觉推理的框架。SlotVTG引入了一个轻量级的槽位适配器，该适配器通过槽位注意力将视觉令牌分解为抽象槽位，并重建原始序列，其中来自自监督视觉模型的对象性先验鼓励形成语义连贯的槽位。在标准VTG基准上的跨域评估表明，我们的方法在保持竞争力的域内（ID）性能并仅引入极小开销的同时，显著提升了OOD鲁棒性。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) have shown strong performance on Video Temporal Grounding (VTG). However, their coarse recognition capabilities are insufficient for fine-grained temporal understanding, making task-specific fine-tuning indispensable. This fine-tuning causes models to memorize dataset-specific shortcuts rather than faithfully grounding in the actual visual content, leading to poor Out-of-Domain (OOD) generalization. Object-centric learning offers a promising remedy by decomposing scenes into entity-level representations, but existing approaches require re-running the entire multi-stage training pipeline from scratch. We propose SlotVTG, a framework that steers MLLMs toward object-centric, input-grounded visual reasoning at minimal cost. SlotVTG introduces a lightweight slot adapter that decomposes visual tokens into abstract slots via slot attention and reconstructs the original sequence, where objectness priors from a self-supervised vision model encourage semantically coherent slot formation. Cross-domain evaluation on standard VTG benchmarks demonstrates that our approach significantly improves OOD robustness while maintaining competitive In-Domain (ID) performance with minimal overhead.

关键词: Multimodal Large Language Models, Video Temporal Grounding, Object-centric Learning, Slot Adapter, Parameter-efficient Fine-tuning, Out-of-Domain Generalization, Slot Attention, Visual Reasoning

171. ❌ AnyHand: A Large-Scale Synthetic Dataset for RGB(-D) Hand Pose Estimation

作者: Chen Si, Yulin Liu, Bo Ai, Jianwen Xie, Rolandos Alexandros Potamias, Chuanxia Zheng, Hao Su 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25726v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于计算机视觉中的手部姿态估计任务，通过创建大规模合成数据集（AnyHand）来提升模型性能，属于AI在特定科学应用（计算机视觉）中的研究。与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐等）完全无关，仅与’Scaling Laws AND Data Quality’有一定关联（因为论文探讨了数据规模和质量对性能的影响），以及与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（属于AI在科学领域的应用，但非生物信息学或化学信息学）。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为AnyHand的大规模合成数据集，用于提升RGB和RGB-D输入下的3D手部姿态估计性能，实验表明该数据集能显著提高现有模型的准确性和泛化能力。

摘要翻译

我们提出AnyHand——一个大规模合成数据集，旨在推动基于纯RGB及RGB-D输入的3D手部姿态估计技术发展。尽管近期基于基础模型的研究表明，增加训练数据的数量与多样性能够显著提升手部姿态估计的性能与鲁棒性，但现有基于真实采集数据的该任务数据集在覆盖范围上存在局限，而先前的合成数据集则难以大规模同步提供遮挡、手臂细节及对齐的深度信息。为突破这一瓶颈，我们的AnyHand数据集包含250万张单手图像与410万张手-物体交互的RGB-D图像，并配有丰富的几何标注。在纯RGB设定下，我们证明将AnyHand扩展至现有基线模型的原始训练集后，即使在保持网络架构与训练方案不变的情况下，也能在多个基准测试（FreiHAND与HO-3D）上取得显著性能提升。更令人印象深刻的是，使用AnyHand训练的模型在未经微调的情况下，对领域外数据集HO-Cap展现出更强的泛化能力。我们还提出了一个轻量级深度融合模块，可便捷集成至现有基于RGB的模型中。结合AnyHand训练后，所得的RGB-D模型在HO-3D基准测试中实现了优越性能，这既证明了深度信息融合的优势，也验证了我们合成数据的有效性。

摘要 (Abstract)

We present AnyHand, a large-scale synthetic dataset designed to advance the state of the art in 3D hand pose estimation from both RGB-only and RGB-D inputs. While recent works with foundation approaches have shown that an increase in the quantity and diversity of training data can markedly improve performance and robustness in hand pose estimation, existing real-world-collected datasets on this task are limited in coverage, and prior synthetic datasets rarely provide occlusions, arm details, and aligned depth together at scale. To address this bottleneck, our AnyHand contains 2.5M single-hand and 4.1M hand-object interaction RGB-D images, with rich geometric annotations. In the RGB-only setting, we show that extending the original training sets of existing baselines with AnyHand yields significant gains on multiple benchmarks (FreiHAND and HO-3D), even when keeping the architecture and training scheme fixed. More impressively, the model trained with AnyHand shows stronger generalization to the out-of-domain HO-Cap dataset, without any fine-tuning. We also contribute a lightweight depth fusion module that can be easily integrated into existing RGB-based models. Trained with AnyHand, the resulting RGB-D model achieves superior performance on the HO-3D benchmark, showing the benefits of depth integration and the effectiveness of our synthetic data.

关键词: hand pose estimation, synthetic dataset, RGB-D, 3D hand pose, data augmentation, generalization, depth fusion, computer vision

172. ❌ No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

作者: Hai X. Pham, David T. Hoffmann, Ricardo Guerrero, Brais Martinez 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25722v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究对比学习视觉语言模型（V&L）的组合性表示学习问题，提出通过概念中心学习和跨模态注意力池化来改进模型，而不需要生成困难负样本。所有评分关键词均与大语言模型（LLM）相关，而论文专注于视觉语言模型（如CLIP），未涉及LLM技术、训练方法、推理优化、对齐、代理系统、科学AI应用等任何关键词领域。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了对比学习视觉语言模型在组合性表示学习方面的局限性，通过概念中心学习和跨模态注意力池化方法，在不增加推理成本的情况下实现了组合性基准测试的SOTA性能，同时保持或提升了零样本和检索能力。

摘要翻译

对比式视觉语言模型（V&L）依然是各类应用的热门选择。然而，该模型已显现出若干局限性，其中最突出的是其学习组合表征的能力有限。先前的方法通常通过生成定制训练数据来获取困难负样本以应对这一局限。困难负样本已被证明能提升组合性任务的表现，但这些方法往往仅针对单一基准设计、缺乏泛化能力，并可能导致视觉语言模型基础能力（如零样本学习或检索性能）显著下降，因而实用性不足。本研究采用了一种不同的思路。我们识别出限制视觉语言模型组合性表现的两个根本原因：1）过长的训练描述无需组合表征即可被理解；2）文本与图像编码器中的最终全局池化操作从根本上导致了学习绑定关系所需信息的完全丢失。为此，我们提出了两种简洁的解决方案：1）利用标准自然语言处理工具获取以概念为中心的简短描述片段，并将其与图像对齐；2）引入一种无需参数化的跨模态注意力池化机制，从图像编码器中提取以概念为中心的视觉嵌入。通过这两项改进并辅以简单的辅助对比损失，我们在标准组合性基准测试中取得了最先进的性能，同时保持甚至增强了模型的零样本学习与检索能力。这一成果的实现并未增加推理成本。本工作的代码已发布于 https://github.com/SamsungLabs/concept_centric_clip。

摘要 (Abstract)

Contrastive vision-language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic V&L capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of V&Ls: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders lead to a complete loss of the necessary information to learn binding in the first place. As a remedy, we propose two simple solutions: 1) We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter-free cross-modal attention-pooling to obtain concept centric visual embeddings from the image encoder. With these two changes and simple auxiliary contrastive losses, we obtain SOTA performance on standard compositionality benchmarks, while maintaining or improving strong zero-shot and retrieval capabilities. This is achieved without increasing inference cost. We release the code for this work at https://github.com/SamsungLabs/concept_centric_clip.

关键词: contrastive learning, vision-language models, compositionality, concept centric learning, cross-modal attention, zero-shot capability, retrieval performance, hard negatives

173. ❌ TRACE: Object Motion Editing in Videos with First-Frame Trajectory Guidance

作者: Quynh Phung, Long Mai, Cusuh Ham, Feng Liu, Jia-Bin Huang, Aniruddha Mahapatra 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25707v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视频中物体运动路径编辑，提出TRACE框架，通过两阶段流程实现基于第一帧轨迹引导的运动编辑。该研究属于计算机视觉和视频处理领域，专注于物体轨迹编辑、视频合成和运动控制，不涉及大语言模型、深度学习技术原理创新或AI在科学领域的应用。所有关键词均与大模型、深度学习技术、AI科学应用相关，与论文的计算机视觉视频编辑主题完全无关，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文研究视频中物体运动路径编辑问题，提出TRACE框架，通过第一帧轨迹引导和两阶段流程实现了更连贯、真实和可控的运动编辑效果。

摘要翻译

我们研究视频中的物体运动路径编辑，其目标是在保持原始场景内容的同时改变目标物体的运动轨迹。与先前主要操纵外观或依赖基于点跟踪的轨迹控制的视频编辑方法不同——这类方法在推理过程中往往需要用户提供轨迹点，尤其在存在相机运动的视频中操作难度较大——我们提出了一种实用、易用的可控物体中心运动编辑方法。我们提出了Trace框架，该框架允许用户在单个锚点帧中设计期望轨迹，随后合成时间一致的编辑视频。我们的方法通过两阶段流程解决此任务：跨视角运动变换模块将首帧路径设计映射为相机运动下的帧对齐边界框轨迹，以及运动条件视频重合成模块遵循这些轨迹重新生成物体，同时保留输入视频的其余内容。在多样化真实世界视频上的实验表明，与近期图像到视频及视频到视频方法相比，我们的方法能产生更连贯、真实且可控的运动编辑效果。

摘要 (Abstract)

We study object motion path editing in videos, where the goal is to alter a target object’s trajectory while preserving the original scene content. Unlike prior video editing methods that primarily manipulate appearance or rely on point-track-based trajectory control, which is often challenging for users to provide during inference, especially in videos with camera motion, we offer a practical, easy-to-use approach to controllable object-centric motion editing. We present Trace, a framework that enables users to design the desired trajectory in a single anchor frame and then synthesizes a temporally consistent edited video. Our approach addresses this task with a two-stage pipeline: a cross-view motion transformation module that maps first-frame path design to frame-aligned box trajectories under camera motion, and a motion-conditioned video re-synthesis module that follows these trajectories to regenerate the object while preserving the remaining content of the input video. Experiments on diverse real-world videos show that our method produces more coherent, realistic, and controllable motion edits than recent image-to-video and video-to-video methods.

关键词: object motion editing, video editing, trajectory guidance, cross-view motion transformation, motion-conditioned video re-synthesis, temporal consistency, camera motion, controllable editing

174. ❌ Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs

作者: Vishal Narnaware, Animesh Gupta, Kevin Zhai, Zhenyi Wang, Mubarak Shah 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25711v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Multimodal Diffusion Large Language Models (MDLLMs)中的幻觉问题，属于大模型技术范畴，因此与’Large Language Models’高度相关（10分）。论文直接针对’幻觉缓解’问题提出解决方案，与’Hallucination Mitigation’高度相关（10分）。论文通过分析注意力分布来解释模型行为，与’Mechanistic Interpretability’有一定关联（5分）。论文未涉及其他关键词如MoE、SLMs、训练技术、推理加速、智能体等具体内容，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态扩散大语言模型（MDLLMs）中因文本概率主导而产生的视觉幻觉问题，提出了一种无需训练的推理时解码框架VISAGE，通过量化交叉注意力分布的空间熵来校准目标函数，从而显著提升了模型在幻觉敏感基准上的性能。

摘要翻译

多模态扩散大语言模型（MDLLMs）通过并行掩码解码实现高并发生成，但其架构仍易产生多模态幻觉。这种结构性缺陷源于算法层面的漏洞：解码器仅依据文本似然性对候选词元进行排序，而未验证局部视觉支持。我们证实，这种纯语言排序机制导致了目标错位——语言概率质量充当了多模态任务目标的错误代理指标。因此，我们将幻觉重新阐释为局部优化误差：解码器为最大化代理分数而利用语言捷径，牺牲了视觉 grounding。为解决目标错位问题，我们提出VISAGE——一种无需训练的推理时解码框架，通过校准目标函数实现优化。VISAGE通过量化交叉注意力分布的空间熵来估计代理指标偏差，通过强制注意力头达成定位共识，该方法惩罚空间均匀分布并重新排序词元选择，以优先产生视觉 grounded 的结果。我们提供了分析稳定性证明，确立VISAGE在估计误差下能保持有界目标损失。在幻觉敏感基准和通用基准上的评估验证了该框架的鲁棒性，在MMMU-val和HallusionBench上分别实现8.59%和7.75%的相对性能提升。

摘要 (Abstract)

Multimodal Diffusion Large Language Models (MDLLMs) achieve high-concurrency generation through parallel masked decoding, yet the architectures remain prone to multimodal hallucinations. This structural vulnerability stems from an algorithmic flaw: the decoder ranks candidate tokens based on textual likelihood without verifying localized visual support. We establish that this language-only ranking induces an objective mismatch, where language probability mass acts as a misspecified proxy for the intended multimodal task. Consequently, we reinterpret hallucination as a localized optimization error, a phenomenon where the decoder exploits language shortcuts to maximize a proxy score at the expense of visual grounding. To address this objective mismatch, we introduce VISAGE, a training-free decoding framework that calibrates the objective at inference time. VISAGE estimates the proxy discrepancy by quantifying the spatial entropy of cross-attention distributions. By enforcing a localization consensus across attention heads, the method penalizes spatially uniform distributions and re-ranks token commitments to favor visually grounded outcomes. We provide an analytical stability guarantee establishing that VISAGE maintains a bounded objective loss under estimation error. Evaluations across hallucination-sensitive and general-purpose benchmarks demonstrate the robustness of the framework, yielding relative gains of 8.59% on MMMU-val and 7.75% on HallusionBench.

关键词: Multimodal Diffusion Large Language Models, Hallucination Mitigation, Visual Grounding, Cross-attention, Decoding Framework, Objective Mismatch, Spatial Entropy, Training-free

作者: Jinbo Xing, Zeyinzi Jiang, Yuxiang Tuo, Chaojie Mao, Xiaotang Gai, Xi Chen, Jingfeng Zhang, Yulin Pan, Zhen Han, Jie Xiao, Keyu Yan, Chenwei Xie, Chongyang Zhong, Kai Zhu, Tong Shen, Lianghua Huang, Yu Liu, Yujiu Yang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25706v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Wan-Weaver框架解决多模态交错生成问题，核心涉及大模型（LLMs）在文本规划和视觉一致性建模中的应用，通过分解训练（pre-training/SFT）和数据构建（scaling/data quality）实现，与长上下文建模和任务推理（CoT）有一定关联，但未涉及MoE、小型模型、对齐、RAG、压缩、代理等具体技术。

!!! tip deepseek-chat TL;DR

论文提出Wan-Weaver框架，通过分解为文本规划和视觉一致性建模，使用大规模文本代理数据和参考引导图像数据训练，解决了多模态交错生成的挑战，实现了长距离文本连贯性和视觉一致性的涌现能力。

摘要翻译

近期统一模型在理解与生成任务上取得了前所未有的进展。然而，尽管多数模型能够接受多模态输入，它们通常仅能产生单模态输出。这种交错内容生成的挑战主要源于训练数据稀缺以及长程跨模态上下文建模的困难。为解决这一问题，我们将交错生成分解为文本规划与视觉一致性建模，并引入一个由规划器和可视化器构成的框架。规划器为视觉内容生成密集的文本描述，而可视化器则据此合成图像。在此指导下，我们构建了大规模文本代理交错数据（其中视觉内容以文本形式表示）以训练规划器，并整理参考引导的图像数据以训练可视化器。这些设计催生了万维编织者模型，该模型展现出具有长程文本连贯性与视觉一致性的涌现交错生成能力。同时，通过将多样化的理解与生成数据整合到规划器训练中，万维编织者能够实现稳健的任务推理与生成能力。为评估模型在交错生成中的性能，我们进一步构建了一个涵盖多维度广泛使用场景的基准测试。大量实验表明，即使在未接触任何真实交错数据的情况下，万维编织者仍优于现有方法，取得了卓越的性能表现。

摘要 (Abstract)

Recent unified models have made unprecedented progress in both understanding and generation. However, while most of them accept multi-modal inputs, they typically produce only single-modality outputs. This challenge of producing interleaved content is mainly due to training data scarcity and the difficulty of modeling long-range cross-modal context. To address this issue, we decompose interleaved generation into textual planning and visual consistency modeling, and introduce a framework consisting of a planner and a visualizer. The planner produces dense textual descriptions for visual content, while the visualizer synthesizes images accordingly. Under this guidance, we construct large-scale textual-proxy interleaved data (where visual content is represented in text) to train the planner, and curate reference-guided image data to train the visualizer. These designs give rise to Wan-Weaver, which exhibits emergent interleaved generation ability with long-range textual coherence and visual consistency. Meanwhile, the integration of diverse understanding and generation data into planner training enables Wan-Weaver to achieve robust task reasoning and generation proficiency. To assess the model’s capability in interleaved generation, we further construct a benchmark that spans a wide range of use cases across multiple dimensions. Extensive experiments demonstrate that, even without access to any real interleaved data, Wan-Weaver achieves superior performance over existing methods.

关键词: interleaved multi-modal generation, decoupled training, textual planning, visual consistency modeling, large-scale textual-proxy data, emergent generation ability, task reasoning, benchmark evaluation

176. ❌ Persistent Robot World Models: Stabilizing Multi-Step Rollouts via Reinforcement Learning

作者: Jai Bardhan, Patrik Drozdik, Josef Sivic, Vladimir Petrik 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25685v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于机器人世界模型（World Models）的稳定性问题，通过强化学习（RL）后训练方案解决多步预测中的误差累积问题。这与关键词’World Models AND General World Models’高度相关（10分），因为论文的核心就是机器人世界模型的改进。关键词’AI for Science OR Bioinformatics OR Cheminformatics’得5分，因为机器人技术可视为AI在科学/工程领域的应用，但论文未明确涉及生物信息学或化学信息学。其他关键词（如LLMs、MoE、SFT、RAG等）均未在论文中提及或相关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文解决了机器人世界模型在自回归多步预测中误差累积导致视觉质量下降的问题，通过引入基于强化学习的后训练方案，在DROID数据集上实现了最先进的预测保真度。

摘要翻译

动作条件机器人世界模型能够根据给定的机器人动作序列生成被操控场景的未来视频帧，为模拟传统物理引擎难以建模的任务提供了有前景的替代方案。然而，这些模型通常针对短期预测进行优化，在自回归部署时会出现失效：每个预测的视频片段会作为下一个片段的上下文反馈，导致误差累积和视觉质量迅速下降。我们通过以下贡献解决这一问题。首先，我们引入一种强化学习（RL）后训练方案，该方案在世界模型自身的自回归推演上进行训练，而非基于真实历史数据。我们通过将近期一项针对扩散模型的对比性强化学习目标适配到我们的场景中来实现这一点，并证明其收敛性保证完全适用。其次，我们设计了一种训练协议，能够从相同的推演状态生成并比较多个候选的、可变长度的未来序列，从而强化高保真度预测而非低保真度预测。第三，我们开发了高效的多视角视觉保真度奖励函数，该函数结合了不同相机视角间互补的感知度量指标，并在片段级别进行聚合，以提供密集、低方差的训练信号。第四，我们在DROID数据集上证明了我们的方法为推演保真度确立了新的最优性能，在所有指标上均超越了最强基线（例如，外部摄像头的LPIPS降低了14%，腕部摄像头的SSIM提升了9.1%），在配对比较中赢得了98%的胜率，并在盲测人类研究中获得了80%的偏好率。

摘要 (Abstract)

Action-conditioned robot world models generate future video frames of the manipulated scene given a robot action sequence, offering a promising alternative for simulating tasks that are difficult to model with traditional physics engines. However, these models are optimized for short-term prediction and break down when deployed autoregressively: each predicted clip feeds back as context for the next, causing errors to compound and visual quality to rapidly degrade. We address this through the following contributions. First, we introduce a reinforcement learning (RL) post-training scheme that trains the world model on its own autoregressive rollouts rather than on ground-truth histories. We achieve this by adapting a recent contrastive RL objective for diffusion models to our setting and show that its convergence guarantees carry over exactly. Second, we design a training protocol that generates and compares multiple candidate variable-length futures from the same rollout state, reinforcing higher-fidelity predictions over lower-fidelity ones. Third, we develop efficient, multi-view visual fidelity rewards that combine complementary perceptual metrics across camera views and are aggregated at the clip level for dense, low-variance training signal. Fourth, we show that our approach establishes a new state-of-the-art for rollout fidelity on the DROID dataset, outperforming the strongest baseline on all metrics (e.g., LPIPS reduced by 14% on external cameras, SSIM improved by 9.1% on the wrist camera), winning 98% of paired comparisons, and achieving an 80% preference rate in a blind human study.

关键词: robot world models, reinforcement learning, autoregressive rollouts, multi-step prediction, visual fidelity, diffusion models, DROID dataset, post-training

177. ❌ LEMMA: Laplacian pyramids for Efficient Marine SeMAntic Segmentation

作者: Ishaan Gakhar, Laven Srivastava, Sankarshanaa Sagaram, Aditya Kasliwal, Ujjwal Verma 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25689v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于海洋遥感图像的语义分割，提出了一种轻量级CNN架构LEMMA，利用拉普拉斯金字塔增强边缘识别。所有关键词均与大语言模型（LLM）、深度学习技术原理创新或大模型在不同领域的应用相关，但论文完全不涉及LLM、MoE、缩放定律、预训练、对齐、推理加速、智能体等大模型技术，也未提及生物信息学或化学信息学等具体科学AI应用。仅与’AI for Science’有微弱关联（5分），因为论文将AI应用于海洋环境监测（可视为科学应用的一个边缘领域），但并非核心的生物/化学信息学。因此，除最后一个关键词得5分外，其余均为0分。

!!! tip deepseek-chat TL;DR

该论文针对海洋环境中现有语义分割模型计算成本高、资源密集的问题，提出了一种轻量级模型LEMMA，通过集成拉普拉斯金字塔增强边缘识别，在显著减少参数、计算量和推理时间的同时，在多个数据集上实现了先进的性能。

摘要翻译

海洋环境中的语义分割对于无人水面艇（USV）的自主导航以及如溢油事件等海岸地球观测任务至关重要。然而，现有方法通常依赖于深度卷积神经网络（CNN）和基于Transformer的架构，因其高计算成本和资源密集特性，在部署时面临挑战。这些限制阻碍了在实际海洋环境中实现实时、低成本应用的可行性。
为此，我们提出LEMMA，一种专为资源受限条件下精确遥感分割设计的轻量级语义分割模型。该架构利用拉普拉斯金字塔（Laplacian Pyramids）来增强边缘识别能力，这是针对灾害响应、环境监测和海岸监控等复杂海洋环境中进行有效特征提取的关键组成部分。通过在特征提取过程的早期阶段整合边缘信息，LEMMA避免了在深层网络中进行计算代价高昂的特征图计算，从而显著减少了模型大小、复杂度和推理时间。与现有模型相比，LEMMA在不同平台采集的数据集上均展现出先进的性能，同时将可训练参数和计算需求降低了高达71倍，GFLOPs减少高达88.5%，推理时间缩短高达84.65%。实验结果突显了其有效性和实际适用性，包括在溢油数据集上达到93.42%的交并比（IoU），在Mastr1325数据集上达到98.97%的平均交并比（mIoU）。

摘要 (Abstract)

Semantic segmentation in marine environments is crucial for the autonomous navigation of unmanned surface vessels (USVs) and coastal Earth Observation events such as oil spills. However, existing methods, often relying on deep CNNs and transformer-based architectures, face challenges in deployment due to their high computational costs and resource-intensive nature. These limitations hinder the practicality of real-time, low-cost applications in real-world marine settings. To address this, we propose LEMMA, a lightweight semantic segmentation model designed specifically for accurate remote sensing segmentation under resource constraints. The proposed architecture leverages Laplacian Pyramids to enhance edge recognition, a critical component for effective feature extraction in complex marine environments for disaster response, environmental surveillance, and coastal monitoring. By integrating edge information early in the feature extraction process, LEMMA eliminates the need for computationally expensive feature map computations in deeper network layers, drastically reducing model size, complexity and inference time. LEMMA demonstrates state-of-the-art performance across datasets captured from diverse platforms while reducing trainable parameters and computational requirements by up to 71x, GFLOPs by up to 88.5%, and inference time by up to 84.65%, as compared to existing models. Experimental results highlight its effectiveness and real-world applicability, including 93.42% IoU on the Oil Spill dataset and 98.97% mIoU on Mastr1325.

关键词: semantic segmentation, marine environments, lightweight model, Laplacian Pyramids, edge recognition, remote sensing, computational efficiency, real-time applications

178. ❌ Can Users Specify Driving Speed? Bench2Drive-Speed: Benchmark and Baselines for Desired-Speed Conditioned Autonomous Driving

作者: Yuqian Shao, Xiaosong Jia, Langechuan Liu, Junchi Yan 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25672v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于端到端自动驾驶（E2E-AD）领域，研究用户自定义期望速度和超车指令的驾驶策略条件化问题。论文内容涉及自动驾驶基准构建、数据集创建、监督策略实验和性能评估。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是具体的自动驾驶工程问题，未涉及大模型技术、深度学习创新或AI for Science（如生物信息学、化学信息学）等主题。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个用于期望速度条件化自动驾驶的基准Bench2Drive-Speed，通过构建数据集和实验发现，在适当重新标注后，使用常规驾驶数据训练的模型可以达到与专家演示数据相当的性能，但执行超车指令仍然具有挑战性。

摘要翻译

端到端自动驾驶（E2E-AD）已取得显著进展。然而，一个实用且重要的功能长期被忽视：用户可能希望定制策略的期望速度，或指定是否允许自动驾驶车辆进行超车。为弥补这一差距，我们提出了Bench2Drive-Speed——一个包含评估指标、数据集与基准模型的、面向期望速度条件化自动驾驶的基准平台。我们向驾驶策略模型引入了用户期望目标速度与超车/跟随指令的显式输入。我们设计了量化指标，包括速度遵循评分（Speed-Adherence Score）和超车评分（Overtake Score），以衡量策略遵循用户指令的忠实程度，同时保持与标准自动驾驶指标的兼容性。为训练速度条件化策略，一种方法是收集严格遵循速度要求的专家示范数据，但这在现实世界中成本高昂且难以扩展。另一种方案是利用现有常规驾驶数据，将未来帧中观测到的速度视为训练目标速度。为此，我们构建了CustomizedSpeedDataset数据集，包含2,100段标注有专家示范的片段，以支持对监督策略的系统性研究。实验表明，在适当的重新标注下，基于常规驾驶数据训练的模型表现与基于专家示范数据的模型相当，这说明无需额外复杂的现实世界数据收集即可引入速度监督。此外，我们发现虽然遵循目标速度不会降低常规驾驶性能，但由于交互行为固有的复杂性，执行超车指令仍具挑战性。所有代码、数据集与基准模型均公开于https://github.com/Thinklab-SJTU/Bench2Drive-Speed。

摘要 (Abstract)

End-to-end autonomous driving (E2E-AD) has achieved remarkable progress. However, one practical and useful function has been long overlooked: users may wish to customize the desired speed of the policy or specify whether to allow the autonomous vehicle to overtake. To bridge this gap, we present Bench2Drive-Speed, a benchmark with metrics, dataset, and baselines for desired-speed conditioned autonomous driving. We introduce explicit inputs of users’ desired target-speed and overtake/follow instructions to driving policy models. We design quantitative metrics, including Speed-Adherence Score and Overtake Score, to measure how faithfully policies follow user specifications, while remaining compatible with standard autonomous driving metrics. To enable training of speed-conditioned policies, one approach is to collect expert demonstrations that strictly follow speed requirements, an expensive and unscalable process in the real world. An alternative is to adapt existing regular driving data by treating the speed observed in future frames as the target speed for training. To investigate this, we construct CustomizedSpeedDataset, composed of 2,100 clips annotated with experts demonstrations, enabling systematic investigation of supervision strategies. Our experiments show that, under proper re-annotation, models trained on regular driving data perform comparably to on expert demonstrations, suggesting that speed supervision can be introduced without additional complex real-world data collection. Furthermore, we find that while target-speed following can be achieved without degrading regular driving performance, executing overtaking commands remains challenging due to the inherent difficulty of interactive behaviors. All code, datasets and baselines are available at https://github.com/Thinklab-SJTU/Bench2Drive-Speed

关键词: autonomous driving, end-to-end autonomous driving, speed-conditioned policy, benchmark, user specification, overtaking, dataset, supervision strategies

179. ❌ Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance

作者: Wenxuan Song, Jiayi Chen, Shuai Chen, Jingbo Wang, Pengxiang Ding, Han Zhao, Yikai Qin, Xinhu Zheng, Donglin Wang, Yan Wang, Haoang Li 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25661v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	5.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究视觉语言动作（VLA）模型的微调方法，核心贡献是提出一种新的训练策略来提升标准监督微调（SFT）的效果。因此，与"Post-training OR Supervised Fine-tuning OR SFT"高度相关（10分），因为论文明确聚焦于改进SFT过程。与"Model Merging OR Model Soups OR Weight Averaging"有一定关联（5分），因为方法涉及合并模型参数（capability vectors与pretrained parameters合并）。其他关键词（如LLMs、MoE、RLHF等）均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种新的训练策略，通过解耦辅助任务训练目标并合并能力向量，在减少计算开销的同时，使标准监督微调（SFT）达到与辅助微调基线相当的性能，并在多种机器人任务中验证了有效性。

摘要翻译

本文提出一种新方法，以解决预训练视觉语言动作模型在标准监督微调过程中常无法有效提升性能并降低适应成本的问题。现有一些采用辅助训练目标的高级微调方法虽能提升性能并减少收敛步数，但其通常因辅助任务产生的额外损失而带来显著计算开销。为在保持标准监督微调简洁性的同时实现辅助训练的能力增强效果，我们将辅助任务训练在参数空间内的两个目标——即增强通用能力与拟合任务特定动作分布——进行解耦。为实现该目标，我们仅需使用两种不同训练策略让模型在小型任务集上收敛，所得模型参数间的差异即可解释为辅助任务提供的能力向量。这些向量随后与预训练参数融合，形成能力增强的元模型。此外，当标准监督微调辅以轻量级正交正则化损失时，融合模型能够以更低计算开销达到与辅助微调基线相当的性能。实验结果表明，该方法在多样化机器人任务中均具有显著有效性。项目页面：https://chris1220313648.github.io/Fast-dVLA/

摘要 (Abstract)

This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance and reduce the number of convergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary tasks. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary task training within the parameter space, namely, enhancing general capabilities and fitting task-specific action distributions. To deliver this goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies. The difference between the resulting model parameters can then be interpreted as capability vectors provided by auxiliary tasks. These vectors are then merged with pretrained parameters to form a capability-enhanced meta model. Moreover, when standard SFT is augmented with a lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Experimental results demonstrate that this approach is highly effective across diverse robot tasks. Project page: https://chris1220313648.github.io/Fast-dVLA/

关键词: VLA models, supervised fine-tuning, SFT, auxiliary training, parameter merging, robot tasks, computational overhead, capability enhancement

180. ❌ Colon-Bench: An Agentic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos

作者: Abdullah Hamdi, Changchun Yang, Xin Gao 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25645v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心贡献是开发了一个用于结肠镜检查视频密集标注的智能体工作流（Agentic Workflow），并创建了Colon-Bench数据集，这直接与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分）。论文涉及多模态大语言模型（MLLMs）在医学领域的评估，与’Large Language Models OR LLMs OR Foundation Models’相关（8分）。研究属于生物信息学/医学AI应用，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。数据集创建涉及数据质量考量，与’Scaling Laws AND Data Quality’有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对结肠镜检查视频缺乏密集标注数据集的问题，开发了一个智能体工作流来创建Colon-Bench数据集，并利用该数据集评估多模态大语言模型在医学任务中的性能，同时提出了一种新的提示策略来提升模型表现。

摘要翻译

通过结肠镜进行早期筛查对预防结肠癌至关重要，但该领域稳健人工智能系统的开发因缺乏密集标注的长序列视频数据集而受阻。现有数据集主要聚焦于单类别息肉检测，缺乏评估现代多模态大语言模型（MLLMs）所需的丰富空间、时间及语言标注。为填补这一关键空白，我们通过一种新颖的多阶段智能体工作流程构建并推出了Colon-Bench。我们的流程无缝整合了时序提议、边界框跟踪、人工智能驱动的视觉确认以及人机协同审核，从而可扩展地对完整手术视频进行标注。最终生成的已验证基准在规模上前所未有，包含528段视频、14种不同的病灶类别（包括息肉、溃疡和出血）、超过30万个边界框、21.3万个分割掩码以及13.3万字的临床描述。我们利用Colon-Bench对前沿的MLLMs在病灶分类、开放词汇视频目标分割（OV-VOS）和视频视觉问答（VQA）任务上进行了严格评估。结果显示，与SAM-3相比，MLLMs在医学领域展现出惊人的高定位性能。最后，我们通过分析MLLMs在VQA中的常见错误，提出了一种新颖的“结肠技能”提示策略，该策略将大多数MLLM的零样本性能提升了高达9.7%。数据集与代码公开于https://abdullahamdi.com/colon-bench。

摘要 (Abstract)

Early screening via colonoscopy is critical for colon cancer prevention, yet developing robust AI systems for this domain is hindered by the lack of densely annotated, long-sequence video datasets. Existing datasets predominantly focus on single-class polyp detection and lack the rich spatial, temporal, and linguistic annotations required to evaluate modern Multimodal Large Language Models (MLLMs). To address this critical gap, we introduce Colon-Bench, generated via a novel multi-stage agentic workflow. Our pipeline seamlessly integrates temporal proposals, bounding-box tracking, AI-driven visual confirmation, and human-in-the-loop review to scalably annotate full-procedure videos. The resulting verified benchmark is unprecedented in scope, encompassing 528 videos, 14 distinct lesion categories (including polyps, ulcers, and bleeding), over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of clinical descriptions. We utilize Colon-Bench to rigorously evaluate state-of-the-art MLLMs across lesion classification, Open-Vocabulary Video Object Segmentation (OV-VOS), and video Visual Question Answering (VQA). The MLLM results demonstrate surprisingly high localization performance in medical domains compared to SAM-3. Finally, we analyze common VQA errors from MLLMs to introduce a novel “colon-skill” prompting strategy, improving zero-shot MLLM performance by up to 9.7% across most MLLMs. The dataset and the code are available at https://abdullahamdi.com/colon-bench .

关键词: Colon-Bench, agentic workflow, Multimodal Large Language Models, colonoscopy videos, dense lesion annotation, medical AI, video VQA, colon-skill prompting

181. ❌ UNIC: Neural Garment Deformation Field for Real-time Clothed Character Animation

作者: Chengfeng Zhao, Junbo Qi, Yulou Liu, Zhiyang Dou, Minchen Li, Taku Komura, Ziwei Liu, Wenping Wang, Yuan Liu 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25580v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文UNIC专注于计算机图形学中的实时服装动画，使用神经变形场方法解决物理模拟的计算效率问题。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用相关，而本文研究的是特定领域的神经网络应用（服装动画），不涉及大模型架构、训练方法、推理优化、对齐技术、代理系统或科学AI应用等主题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为UNIC的神经变形场方法，用于实时生成物理逼真的服装动画，解决了传统物理模拟方法计算成本高的问题，并通过实例特定学习方案提高了变形质量。

摘要翻译

模拟物理真实的服装变形是虚拟沉浸体验中的关键任务，通常通过物理模拟方法实现。然而，这些方法往往耗时、计算需求高且需要昂贵的硬件，不适用于实时应用。近期基于学习的方法尝试通过训练图神经网络来学习顶点上的服装变形以解决此问题，但这些方法难以捕捉具有复杂拓扑结构的服装网格的精细变形。本文提出一种新颖的基于神经变形场的方法，命名为UNIC，用于根据运动序列实时驱动虚拟形象的服装动画。我们的核心思想是学习实例特定的神经变形场来驱动服装网格动画。这种实例特定的学习方案不要求UNIC泛化到新服装，而仅需适应新的运动序列，这极大地降低了训练难度并提升了变形质量。此外，神经变形场将三维点映射到其变形偏移量，不仅避免了处理复杂服装的拓扑结构，还为变形学习注入了自然的平滑性约束。我们在多种服装网格上进行了大量实验，证明了UNIC相较于基线方法的有效性和高效性，使其在电子游戏等现实交互应用中具有潜在实用价值。

摘要 (Abstract)

Simulating physically realistic garment deformations is an essential task for virtual immersive experience, which is often achieved by physics simulation methods. However, these methods are typically time-consuming, computationally demanding, and require costly hardware, which is not suitable for real-time applications. Recent learning-based methods tried to resolve this problem by training graph neural networks to learn the garment deformation on vertices, which, however, fail to capture the intricate deformation of complex garment meshes with complex topologies. In this paper, we introduce a novel neural deformation field-based method, named UNIC, to animate the garments of an avatar in real time, given the motion sequences. Our key idea is to learn the instance-specific neural deformation field to animate the garment meshes. Such an instance-specific learning scheme does not require UNIC to generalize to new garments but only to new motion sequences, which greatly reduces the difficulty in training and improves the deformation quality. Moreover, neural deformation fields map the 3D points to their deformation offsets, which not only avoids handling topologies of the complex garments but also injects a natural smoothness constraint in the deformation learning. Extensive experiments have been conducted on various kinds of garment meshes to demonstrate the effectiveness and efficiency of UNIC over baseline methods, making it potentially practical and useful in real-world interactive applications like video games.

关键词: neural deformation field, garment animation, real-time simulation, instance-specific learning, 3D point mapping, physics simulation, virtual immersive experience, graph neural networks

182. ❌ LanteRn: Latent Visual Structured Reasoning

作者: André G. Viveiros, Nuno Gonçalves, Matthias Lindemann, André Martins 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25629v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文LanteRn提出了一种新的视觉推理框架，通过引入潜在视觉表示来增强大型多模态模型（LMMs）的视觉推理能力。该研究与以下关键词高度相关：1）‘Post-training OR Supervised Fine-tuning OR SFT’（10分）：论文明确提到使用监督微调来将视觉特征与潜在状态对齐；2）‘RLHF OR RLAIF OR Direct Preference Optimization OR DPO’（10分）：论文使用强化学习来对齐潜在推理与任务级效用；3）‘Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’（8分）：论文涉及视觉推理过程，属于多步推理范畴；4）‘System 2 Thinking OR Slow Thinking OR In-depth Reasoning’（8分）：论文关注深度视觉推理，与系统2思维相关；5）‘Large Language Models OR LLMs OR Foundation Models’（8分）：论文基于大型多模态模型（LMMs），属于大模型范畴。其他关键词如MoE、量化、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

论文LanteRn解决了大型多模态模型在视觉推理任务中依赖文本化视觉内容的问题，通过引入潜在视觉表示框架，实现了在潜在空间中进行视觉推理，并在多个视觉基准测试中显著提升了视觉定位和细粒度推理性能。

摘要翻译

尽管语言推理模型在众多任务中表现出色，视觉推理对当前的大型多模态模型而言仍具挑战性。因此，大多数多模态模型默认将感知内容转化为文本，这对需要细粒度空间与视觉理解的任务构成了显著限制。近期方法虽通过调用工具或生成中间图像向“图像化思考”迈进，但它们要么依赖外部模块，要么因直接在像素空间进行推理而产生不必要的计算开销。本文提出LanteRn框架，使多模态模型能够在紧凑的潜在视觉表征中穿插语言，从而直接在潜在空间进行视觉推理。LanteRn通过增强视觉-语言Transformer的能力，使其在推理过程中能够生成并关注连续的视觉思维嵌入。我们通过两阶段训练该模型：首先进行监督微调以将视觉特征锚定于潜在状态，随后通过强化学习将潜在推理与任务级效用对齐。我们在三个以感知为核心的基准测试（VisCoT、V*和Blink）上评估LanteRn，观察到其在视觉定位与细粒度推理方面均取得持续改进。这些结果表明，内部潜在表征为更高效的多模态推理提供了有前景的发展方向。

摘要 (Abstract)

While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks requiring fine-grained spatial and visual understanding. While recent approaches take steps toward thinking with images by invoking tools or generating intermediate images, they either rely on external modules, or incur unnecessary computation by reasoning directly in pixel space. In this paper, we introduce LanteRn, a framework that enables LMMs to interleave language with compact latent visual representations, allowing visual reasoning to occur directly in latent space. LanteRn augments a vision-language transformer with the ability to generate and attend to continuous visual thought embeddings during inference. We train the model in two stages: supervised fine-tuning to ground visual features in latent states, followed by reinforcement learning to align latent reasoning with task-level utility. We evaluate LanteRn on three perception-centric benchmarks (VisCoT, V*, and Blink), observing consistent improvements in visual grounding and fine-grained reasoning. These results suggest that internal latent representations provide a promising direction for more efficient multimodal reasoning.

关键词: visual reasoning, large multimodal models, latent representations, supervised fine-tuning, reinforcement learning, visual grounding, multimodal reasoning, LanteRn

183. ❌ Hierarchy-Guided Multimodal Representation Learning for Taxonomic Inference

作者: Sk Miraj Ahmed, Xi Yu, Yunqi Li, Yuewei Lin, Wei Xu 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25573v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于生物多样性分类的多模态学习，提出了两种层次感知的表示学习方法（CLiBD-HiR和CLiBD-HiR-Fuse），用于处理图像和DNA数据的分类任务。论文的核心贡献在于编码生物分类的层次结构以提高鲁棒性和准确性，属于AI在科学领域的应用（生物信息学）。然而，论文未涉及任何大语言模型（LLM）或深度学习技术原理的创新，也未讨论任何评分关键词中的具体技术（如MoE、Scaling Laws、RLHF、RAG等）。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文直接应用AI于生物信息学问题，因此给予10分。其他所有关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了生物多样性分类中多模态数据（图像和DNA）的层次感知表示学习问题，通过引入层次信息正则化和灵活融合方法，在大型生物多样性基准上提高了分类准确性超过14%。

摘要翻译

从大规模野外数据中实现准确的生物多样性识别是一个基础性问题，对生态学、保护生物学和环境监测具有直接影响。实践中，核心任务是分类学预测——即从标本图像、DNA条形码或两者结合的不完美输入中推断其目、科、属或种。现有的多模态方法通常将分类学视为扁平标签空间，因此未能编码生物分类的层级结构，而这种结构对于噪声和模态缺失情况下的鲁棒性至关重要。我们提出了两种用于层级感知多模态学习的端到端变体：CLiBD-HiR，该方法引入层级信息正则化（Hierarchical Information Regularization, HiR）以塑造跨分类层级的嵌入几何，从而产生结构化且抗噪声的表征；以及CLiBD-HiR-Fuse，它额外训练了一个轻量级融合预测器，支持仅图像、仅DNA或联合推理，并能抵御模态损坏。在大规模生物多样性基准测试中，相较于强大的多模态基线方法，我们的方法将分类学预测准确率提升了超过14%，在DNA数据部分缺失或损坏的条件下提升尤为显著。这些结果表明，显式编码生物层级结构并结合灵活融合，是构建实用生物多样性基础模型的关键。

摘要 (Abstract)

Accurate biodiversity identification from large-scale field data is a foundational problem with direct impact on ecology, conservation, and environmental monitoring. In practice, the core task is taxonomic prediction - inferring order, family, genus, or species from imperfect inputs such as specimen images, DNA barcodes, or both. Existing multimodal methods often treat taxonomy as a flat label space and therefore fail to encode the hierarchical structure of biological classification, which is critical for robustness under noise and missing modalities. We present two end-to-end variants for hierarchy-aware multimodal learning: CLiBD-HiR, which introduces Hierarchical Information Regularization (HiR) to shape embedding geometry across taxonomic levels, yielding structured and noise-robust representations; and CLiBD-HiR-Fuse, which additionally trains a lightweight fusion predictor that supports image-only, DNA-only, or joint inference and is resilient to modality corruption. Across large-scale biodiversity benchmarks, our approach improves taxonomic classification accuracy by over 14 percent compared to strong multimodal baselines, with particularly large gains under partial and corrupted DNA conditions. These results highlight that explicitly encoding biological hierarchy, together with flexible fusion, is key for practical biodiversity foundation models.

关键词: multimodal learning, taxonomic inference, hierarchical representation, biodiversity identification, DNA barcodes, noise robustness, foundation models, biological classification

184. ❌ Towards Comprehensive Real-Time Scene Understanding in Ophthalmic Surgery through Multimodal Image Fusion

作者: Nikolo Rohrmoser, Ghazal Ghazaei, Michael Sommersperger, Nassir Navab 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25555v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于眼科手术中的多模态图像融合和实时场景理解，使用CNN和YoloNAS进行特征提取和融合，属于计算机视觉和医学图像分析领域。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理等）完全无关，因为这些关键词特指大语言模型及相关技术。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将AI应用于医学（眼科手术），属于AI在科学/医学领域的应用，但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究提出了一种多模态实时网络架构，融合手术显微镜和光学相干断层扫描图像，显著提高了眼科手术中器械跟踪和组织距离估计的准确性。

摘要翻译

目的：多模态成像与手术室的整合为全面理解手术场景开辟了道路。在眼科手术中，目前已有两种互补的成像模态可用：手术显微镜（OPMI）成像和实时术中光学相干断层扫描（iOCT）。这项首次针对时序性OPMI与iOCT特征融合的研究，通过玻璃体视网膜手术中精确器械跟踪的实例，展示了多模态图像处理在多任务预测方面的潜力。方法：我们提出了一种多模态、时序性、具备实时处理能力的网络架构，以执行联合器械检测、关键点定位和器械-组织距离估计。我们的网络设计集成了一个交叉注意力融合模块，用于融合OPMI和iOCT图像特征，这些特征分别通过YoloNAS和CNN编码器高效提取。此外，一个基于区域的循环模块利用了时序连贯性。结果：我们的实验证明了可靠的器械定位和关键点检测（95.79% mAP50），并表明iOCT的融入显著改善了器械-组织距离估计，同时实现了每帧22.5毫秒的实时处理速度。特别是对于接近视网膜的距离（小于1毫米），距离估计精度从仅使用OPMI时的284微米提升至多模态融合时的33微米。结论：与单模态处理相比，多模态成像的特征融合能够提高多任务预测的准确性，并且通过定制化的网络设计可以实现实时处理性能。虽然我们的结果证明了多模态处理在图像引导玻璃体视网膜手术中的潜力，但它们也突显了关键挑战，这些挑战将推动未来研究朝着更可靠、一致和全面的手术场景理解方向发展。

摘要 (Abstract)

Purpose: The integration of multimodal imaging into operating rooms paves the way for comprehensive surgical scene understanding. In ophthalmic surgery, by now, two complementary imaging modalities are available: operating microscope (OPMI) imaging and real-time intraoperative optical coherence tomography (iOCT). This first work toward temporal OPMI and iOCT feature fusion demonstrates the potential of multimodal image processing for multi-head prediction through the example of precise instrument tracking in vitreoretinal surgery. Methods: We propose a multimodal, temporal, real-time capable network architecture to perform joint instrument detection, keypoint localization, and tool-tissue distance estimation. Our network design integrates a cross-attention fusion module to merge OPMI and iOCT image features, which are efficiently extracted via a YoloNAS and a CNN encoder, respectively. Furthermore, a region-based recurrent module leverages temporal coherence. Results: Our experiments demonstrate reliable instrument localization and keypoint detection (95.79% mAP50) and show that the incorporation of iOCT significantly improves tool-tissue distance estimation, while achieving real-time processing rates of 22.5 ms per frame. Especially for close distances to the retina (below 1 mm), the distance estimation accuracy improved from 284 $μm$ (OPMI only) to 33 $μm$ (multimodal). Conclusion: Feature fusion of multimodal imaging can enhance multi-task prediction accuracy compared to single-modality processing and real-time processing performance can be achieved through tailored network design. While our results demonstrate the potential of multi-modal processing for image-guided vitreoretinal surgery, they also underline key challenges that motivate future research toward more reliable, consistent, and comprehensive surgical scene understanding.

关键词: ophthalmic surgery, multimodal image fusion, real-time scene understanding, instrument tracking, tool-tissue distance estimation, cross-attention fusion, vitreoretinal surgery, intraoperative OCT

185. ❌ PAWS: Perception of Articulation in the Wild at Scale from Egocentric Videos

作者: Yihao Wang, Yang Miao, Wenshuai Zhao, Wenyan Yang, Zihan Wang, Joni Pajarinen, Luc Van Gool, Danda Pani Paudel, Juho Kannala, Xi Wang, Arno Solin 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25539v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PAWS专注于从第一人称视频中提取铰接物体的运动与结构，属于计算机视觉与机器人感知领域。其核心方法涉及视频分析、3D重建与手物交互理解，未涉及任何大语言模型、深度学习技术原理创新或AI for Science的具体应用。所有评分关键词均与大模型技术、训练方法、推理优化、对齐技术、代理系统等直接相关，而本文研究内容与这些关键词完全无关，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

论文提出PAWS方法，从大规模野外第一人称视频中直接提取铰接物体的运动与结构，解决了现有方法依赖高质量3D数据与人工标注的局限性，并在公开数据集上显著超越基线，同时证明提取的铰接信息有益于下游任务如机器人操作。

摘要翻译

关节感知旨在恢复铰接物体（如抽屉和橱柜）的运动与结构，是机器人学、仿真与动画领域中三维场景理解的基础。现有基于学习的方法严重依赖高质量三维数据与人工标注的监督训练，限制了方法的可扩展性与多样性。为突破这一局限，我们提出PAWS方法，能够直接从大规模真实世界第一人称视角视频中的手-物交互数据中提取物体关节结构。我们在包括HD-EPIC和Arti4D在内的公开数据集上评估了本方法，相较于基线模型取得了显著提升。我们进一步证明，所提取的关节信息能够有效赋能下游任务，包括微调三维关节预测模型以及实现机器人操作。项目网站详见：https://aaltoml.github.io/PAWS/。

摘要 (Abstract)

Articulation perception aims to recover the motion and structure of articulated objects (e.g., drawers and cupboards), and is fundamental to 3D scene understanding in robotics, simulation, and animation. Existing learning-based methods rely heavily on supervised training with high-quality 3D data and manual annotations, limiting scalability and diversity. To address this limitation, we propose PAWS, a method that directly extracts object articulations from hand-object interactions in large-scale in-the-wild egocentric videos. We evaluate our method on the public data sets, including HD-EPIC and Arti4D data sets, achieving significant improvements over baselines. We further demonstrate that the extracted articulations benefit downstream tasks, including fine-tuning 3D articulation prediction models and enabling robot manipulation. See the project website at https://aaltoml.github.io/PAWS/.

关键词: articulation perception, egocentric videos, hand-object interactions, 3D scene understanding, robot manipulation, unsupervised learning, articulated objects, PAWS method

186. ❌ BFMD: A Full-Match Badminton Dense Dataset for Dense Shot Captioning

作者: Ning Ding, Keisuke Fujii, Toru Tamaki 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25533v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究体育视频分析领域，具体针对羽毛球比赛，开发了一个密集标注的全场比赛数据集（BFMD）和一个基于VideoMAE的多模态字幕生成框架。论文内容完全聚焦于计算机视觉、视频理解和体育分析任务，没有涉及任何大语言模型（LLM）、深度学习技术原理创新或AI for Science（如生物信息学）的核心内容。所有评分关键词均与大模型技术、训练方法、推理优化、对齐、代理系统等直接相关，而本论文的研究对象、方法、实验和应用场景均与这些关键词无关。因此，所有关键词的相关度评分均为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有羽毛球数据集缺乏全场比赛密集标注的问题，提出了首个羽毛球全场比赛密集数据集（BFMD）和一个结合语义反馈的多模态字幕生成框架，实验表明多模态建模和语义反馈能提升击球字幕的生成质量。

摘要翻译

理解羽毛球战术动态需要分析完整比赛而非孤立片段。然而现有羽毛球数据集主要集中于短片段或特定任务标注，鲜少提供带有密集多模态标注的全场比赛数据。这一局限使得生成准确的击球描述和进行比赛级分析变得困难。为突破此限制，我们首次提出羽毛球全场密集标注数据集（BFMD），包含19场转播比赛（含单双打），覆盖超过20小时比赛时长，由1,687个回合和16,751次击球事件构成，每个击球均配有描述性标注。该数据集提供分层标注体系，包括比赛片段、回合事件以及密集的回合级多模态标注（如击球类型、羽毛球轨迹、运动员姿态关键点及击球描述）。我们开发了基于VideoMAE的多模态描述生成框架，引入语义反馈机制，利用击球语义指导描述生成并提升语义一致性。实验结果表明，多模态建模与语义反馈机制相比纯RGB基线显著提升了击球描述质量。我们进一步通过分析全场比赛中战术模式的时序演变，展示了BFMD数据集的应用潜力。

摘要 (Abstract)

Understanding tactical dynamics in badminton requires analyzing entire matches rather than isolated clips. However, existing badminton datasets mainly focus on short clips or task-specific annotations and rarely provide full-match data with dense multimodal annotations. This limitation makes it difficult to generate accurate shot captions and perform match-level analysis. To address this limitation, we introduce the first Badminton Full Match Dense (BFMD) dataset, with 19 broadcast matches (including both singles and doubles) covering over 20 hours of play, comprising 1,687 rallies and 16,751 hit events, each annotated with a shot caption. The dataset provides hierarchical annotations including match segments, rally events, and dense rally-level multimodal annotations such as shot types, shuttle trajectories, player pose keypoints, and shot captions. We develop a VideoMAE-based multimodal captioning framework with a Semantic Feedback mechanism that leverages shot semantics to guide caption generation and improve semantic consistency. Experimental results demonstrate that multimodal modeling and semantic feedback improve shot caption quality over RGB-only baselines. We further showcase the potential of BFMD by analyzing the temporal evolution of tactical patterns across full matches.

关键词: badminton dataset, full-match analysis, dense annotation, multimodal captioning, VideoMAE, semantic feedback, shot captioning, tactical pattern analysis

187. ❌ Beyond the Golden Data: Resolving the Motion-Vision Quality Dilemma via Timestep Selective Training

作者: Xiangyang Luo, Qingyu Li, Yuming Li, Guanbo Huang, Yongjie Zhu, Wenyu Qin, Meng Wang, Pengfei Wan, Shao-Lun Huang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25527v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视频生成模型中的数据质量困境和训练方法创新，与绝大多数关键词（主要关于大语言模型技术）完全无关。仅与’Scaling Laws AND Data Quality’有一定关联（5分），因为论文探讨了数据质量对模型性能的影响，但未涉及扩展定律本身。

!!! tip deepseek-chat TL;DR

该论文针对视频生成中视觉质量与运动质量难以兼得的'Motion-Vision Quality Dilemma'，提出了基于时间步选择的训练方法TQD，使模型能在不平衡数据上训练并超越使用完美数据的传统方法。

摘要翻译

视频生成模型近期取得了显著进展，但这些模型高度依赖于同时具备高视觉质量与高运动质量的高质量数据。本文指出了视频数据构建中的一个核心挑战：运动-视觉质量困境。我们发现，视觉质量与运动强度本质上呈现负相关关系，因此难以获得在两方面均表现优异的黄金数据。为应对这一挑战，我们首先研究了视频扩散模型的层次化学习动态，并对质量退化样本进行了基于梯度的分析。我们发现，在适当的时间步，质量不平衡数据能够产生与黄金数据相似的梯度。基于此，我们提出了训练过程中时间步选择的新概念。我们提出时间步感知质量解耦方法，该方法通过调整数据采样分布以更好地匹配模型的学习过程。对于特定类型的数据，运动丰富的数据其采样分布偏向于较高时间步，而高视觉质量数据则更可能在较低时间步被采样。通过大量实验，我们证明TQD能够仅使用分离的不平衡数据进行训练，实现超越传统使用更优数据训练的性能，从而对视频生成中完美数据的必要性提出了挑战。此外，当使用高质量数据训练时，我们的方法也能提升模型性能，展现了其在多种数据场景下的有效性。

摘要 (Abstract)

Recent advances in video generation models have achieved impressive results. However, these models heavily rely on the use of high-quality data that combines both high visual quality and high motion quality. In this paper, we identify a key challenge in video data curation: the Motion-Vision Quality Dilemma. We discovered that visual quality and motion intensity inherently exhibit a negative correlation, making it hard to obtain golden data that excels in both aspects. To address this challenge, we first examine the hierarchical learning dynamics of video diffusion models and conduct gradient-based analysis on quality-degraded samples. We discover that quality-imbalanced data can produce gradients similar to golden data at appropriate timesteps. Based on this, we introduce the novel concept of Timestep selection in Training Process. We propose Timestep-aware Quality Decoupling (TQD), which modifies the data sampling distribution to better match the model’s learning process. For certain types of data, the sampling distribution is skewed toward higher timesteps for motion-rich data, while high visual quality data is more likely to be sampled during lower timesteps. Through extensive experiments, we demonstrate that TQD enables training exclusively on separated imbalanced data to achieve performance surpassing conventional training with better data, challenging the necessity of perfect data in video generation. Moreover, our method also boosts model performance when trained on high-quality data, showcasing its effectiveness across different data scenarios.

关键词: video generation, video diffusion models, data quality, motion-vision quality dilemma, timestep selection, training process, imbalanced data, gradient analysis

188. ❌ RealRestorer: Towards Generalizable Real-World Image Restoration with Large-Scale Image Editing Models

作者: Yufeng Yang, Xianfang Zeng, Zhangqi Jiang, Fukun Yin, Jianzhuang Liu, Wei Cheng, jinghong lan, Shiyu Liu, Yuqi Peng, Gang YU, Shifeng Chen 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25502v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的图像修复任务，使用大规模图像编辑模型，而非自然语言处理领域的大语言模型或相关技术。所有关键词均针对大语言模型的技术原理、训练方法、应用场景或评估指标，与论文的视觉修复主题无直接关联。

!!! tip deepseek-chat TL;DR

该研究通过构建大规模真实世界退化图像数据集并训练开源模型，解决了现有图像修复模型泛化能力不足的问题，在真实世界图像修复基准上达到了开源方法中的最优性能。

摘要翻译

真实场景下的图像复原对于自动驾驶与目标检测等下游任务至关重要。然而，现有复原模型常受限于训练数据的规模与分布，导致其在真实场景中的泛化能力不足。近期，大规模图像编辑模型在复原任务中展现出强大的泛化能力，特别是如Nano Banana Pro这类闭源模型，能够在复原图像的同时保持一致性。然而，利用这些大型通用模型实现同等性能需要大量的数据与计算成本。为解决这一问题，我们构建了一个涵盖九种常见真实退化类型的大规模数据集，并训练了一个先进的开源模型，以缩小与闭源方案之间的差距。此外，我们提出了RealIR-Bench基准测试，其中包含464张真实退化图像，并设计了专注于退化消除与一致性保持的定制化评估指标。大量实验表明，我们的模型在开源方法中位列第一，达到了最先进的性能水平。

摘要 (Abstract)

Image restoration under real-world degradations is critical for downstream tasks such as autonomous driving and object detection. However, existing restoration models are often limited by the scale and distribution of their training data, resulting in poor generalization to real-world scenarios. Recently, large-scale image editing models have shown strong generalization ability in restoration tasks, especially for closed-source models like Nano Banana Pro, which can restore images while preserving consistency. Nevertheless, achieving such performance with those large universal models requires substantial data and computational costs. To address this issue, we construct a large-scale dataset covering nine common real-world degradation types and train a state-of-the-art open-source model to narrow the gap with closed-source alternatives. Furthermore, we introduce RealIR-Bench, which contains 464 real-world degraded images and tailored evaluation metrics focusing on degradation removal and consistency preservation. Extensive experiments demonstrate our model ranks first among open-source methods, achieving state-of-the-art performance.

关键词: Image Restoration, Real-World Degradations, Large-Scale Image Editing Models, Generalization, Dataset Construction, Open-Source Model, Benchmark Evaluation, Consistency Preservation

189. ❌ Knowledge-Guided Failure Prediction: Detecting When Object Detectors Miss Safety-Critical Objects

作者: Jakob Paul Zimmermann, Gerrit Holzbach, David Lerch 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25499v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域，研究目标检测器的故障预测方法（KGFP），使用视觉基础模型（如CLIP）的嵌入特征进行语义对齐检测。虽然涉及基础模型概念，但所有关键词均针对大语言模型（LLM）及其相关技术（如MoE、RLHF、RAG、CoT等），而本文未涉及任何语言模型、文本生成、对齐训练、推理优化或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于知识引导的故障预测方法（KGFP），通过检测目标检测器内部特征与视觉基础模型嵌入之间的语义偏差，来预测安全关键对象（如行人）的漏检，在COCO数据集上显著提升了召回率。

摘要翻译

部署在安全关键环境中的目标检测器可能发生静默失效，例如在未发出任何警告的情况下漏检行人、工人或其他安全关键目标。传统的分布外检测方法侧重于识别陌生输入，但无法直接预测检测器本身的功能性失效。本文提出知识引导失效预测框架，这是一种基于表征的监测方法，将漏检的安全关键目标视为运行时可检测的异常。该框架通过双编码器架构与角度距离度量，量化内部目标检测器特征与视觉基础模型嵌入之间的语义错位。其关键特性在于：当检测器在其能力范围外运行，或视觉基础模型自身遇到新异输入时，两种嵌入表征会产生分歧，从而生成高角度信号，可靠地标记不安全图像。我们将这种新颖的KGFP方法与基准分布外检测方法进行比较。在COCO行人检测任务中，将KGFP作为选择性预测门控机制，可使接收图像中的行人召回率在5%误报率下从64.3%提升至84.5%，并在六个COCO-O视觉域中保持强劲性能，显著优于各类分布外检测基线方法。我们的代码、模型及特征已发布于https://gitlab.cc-asp.fraunhofer.de/iosb_public/KGFP。

摘要 (Abstract)

Object detectors deployed in safety-critical environments can fail silently, e.g. missing pedestrians, workers, or other safety-critical objects without emitting any warning. Traditional Out Of Distribution (OOD) detection methods focus on identifying unfamiliar inputs, but do not directly predict functional failures of the detector itself. We introduce Knowledge Guided Failure Prediction (KGFP), a representation-based monitoring framework that treats missed safety-critical detections as anomalies to be detected at runtime. KGFP measures semantic misalignment between internal object detector features and visual foundation model embeddings using a dual-encoder architecture with an angular distance metric. A key property is that when either the detector is operating outside its competence or the visual foundation model itself encounters novel inputs, the two embeddings diverge, producing a high-angle signal that reliably flags unsafe images. We compare our novel KGFS method to baseline OOD detection methods. On COCO person detection, applying KGFP as a selective-prediction gate raises person recall among accepted images from 64.3% to 84.5% at 5% False Positive Rate (FPR), and maintains strong performance across six COCO-O visual domains, outperforming OOD baselines by large margins. Our code, models, and features are published at https://gitlab.cc-asp.fraunhofer.de/iosb_public/KGFP.

关键词: object detection, failure prediction, safety-critical, visual foundation models, semantic misalignment, anomaly detection, selective prediction, out-of-distribution detection

190. ❌ AdaSFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments

作者: Xuzhi Wang, Xinran Wu, Song Wang, Lingdong Kong, Ziping Zhao 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25494v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的室内单目语义场景补全（MSSC），提出了一种基于Transformer的AdaSFormer框架，包含自适应序列化Transformer、中心相对位置编码和卷积调制层归一化等创新设计。所有评分关键词均与大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文研究的是计算机视觉中的3D场景理解任务，未涉及LLM、MoE、缩放定律、训练技术、推理方法、代理系统、模型压缩、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或AI for Science等主题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出AdaSFormer，一种自适应序列化Transformer框架，用于解决室内单目语义场景补全中因复杂空间布局和严重遮挡导致的挑战，并在NYUv2和Occ-ScanNet数据集上实现了最先进的性能。

摘要翻译

室内单目语义场景补全（MSSC）因复杂的空间布局和严重的遮挡问题，其挑战性显著高于室外场景。尽管Transformer模型擅长建模全局依赖关系，但其高内存成本与细粒度细节重建的困难限制了其在室内MSSC中的应用。为解决这些局限，我们提出了AdaSFormer——一种专为室内MSSC设计的序列化Transformer框架。该模型包含三项核心设计：（1）具有可学习偏移量的自适应序列化Transformer，可动态调整感受野；（2）中心相对位置编码，以捕捉丰富的空间信息；（3）卷积调制层归一化，用于桥接卷积特征与Transformer特征之间的异构表示。在NYUv2和Occ-ScanNet数据集上的大量实验表明，AdaSFormer实现了最先进的性能。代码已公开于：https://github.com/alanWXZ/AdaSFormer。

摘要 (Abstract)

Indoor monocular semantic scene completion (MSSC) is notably more challenging than its outdoor counterpart due to complex spatial layouts and severe occlusions. While transformers are well suited for modeling global dependencies, their high memory cost and difficulty in reconstructing fine-grained details have limited their use in indoor MSSC. To address these limitations, we introduce AdaSFormer, a serialized transformer framework tailored for indoor MSSC. Our model features three key designs: (1) an Adaptive Serialized Transformer with learnable shifts that dynamically adjust receptive fields; (2) a Center-Relative Positional Encoding that captures spatial information richness; and (3) a Convolution-Modulated Layer Normalization that bridges heterogeneous representations between convolutional and transformer features. Extensive experiments on NYUv2 and Occ-ScanNet demonstrate that AdaSFormer achieves state-of-the-art performance. The code is publicly available at: https://github.com/alanWXZ/AdaSFormer.

关键词: monocular semantic scene completion, indoor environments, serialized transformer, adaptive receptive fields, center-relative positional encoding, convolution-modulated layer normalization, 3D scene understanding, transformer architecture

191. ❌ GridVAD: Open-Set Video Anomaly Detection via Spatial Reasoning over Stratified Frame Grids

作者: Mohamed Eltahir, Ahmed O. Ibrahim, Obada Siralkhatim, Tabarak Abdallah, Sondos Mohamed 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25467v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究视频异常检测，使用Vision-Language Models (VLMs) 进行空间推理，但所有评分关键词均针对大语言模型(LLMs)及其相关技术（如MoE、Scaling Laws、RLHF、LoRA等），而论文未涉及任何LLM技术、训练方法、推理优化或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出GridVAD方法，通过Vision-Language Models对分层网格表示进行空间推理来生成异常提案，并利用自一致性整合和基础模型实现无需训练的开放集视频异常检测，在UCSD Ped2数据集上取得了优于现有方法的像素级检测性能。

摘要翻译

视觉-语言模型（VLMs）是强大的开放集推理器，但将其直接用作视频监控中的异常检测器存在脆弱性：未经校准的异常先验知识会导致漏检与幻觉性误报交替出现。我们认为问题不在于VLM本身，而在于其使用方式。VLM应作为异常提议生成器，产生开放集的候选描述，随后由专门构建的空间与时间模块进行定位与追踪。我们在GridVAD中实例化了这一“提议-定位-传播”原则，该训练免费流程无需任何领域特定训练即可生成像素级异常掩码。VLM对视频片段的网格化分层表征进行推理，生成自然语言异常提议。自一致性整合（SCC）通过仅保留在多次独立采样中重复出现的提议来过滤幻觉。Grounding DINO将每个保留的提议锚定为边界框，SAM2则将其作为密集掩码在异常区间内传播。无论视频长度如何，每个片段的VLM调用预算固定为M+1次（M可根据所需提议数量设定）。在UCSD Ped2数据集上，GridVAD在所有对比方法中取得了最高的像素级AUROC（77.59），甚至超过部分微调的TAO方法（75.11），并在物体级RBDC指标上以超过5倍的优势优于其他零样本方法。消融实验表明SCC提供了可控的精确率-召回率权衡：过滤机制以物体级召回率的适度代价提升了所有像素级指标。效率实验显示GridVAD的调用效率比均匀逐帧VLM查询高2.7倍，同时还能生成密集分割掩码。代码与定性视频结果详见https://gridvad.github.io。

摘要 (Abstract)

Vision-Language Models (VLMs) are powerful open-set reasoners, yet their direct use as anomaly detectors in video surveillance is fragile: without calibrated anomaly priors, they alternate between missed detections and hallucinated false alarms. We argue the problem is not the VLM itself but how it is used. VLMs should function as anomaly proposers, generating open-set candidate descriptions that are then grounded and tracked by purpose-built spatial and temporal modules. We instantiate this propose-ground-propagate principle in GridVAD, a training-free pipeline that produces pixel-level anomaly masks without any domain-specific training. A VLM reasons over stratified grid representations of video clips to generate natural-language anomaly proposals. Self-Consistency Consolidation (SCC) filters hallucinations by retaining only proposals that recur across multiple independent samplings. Grounding DINO anchors each surviving proposal to a bounding box, and SAM2 propagates it as a dense mask through the anomaly interval. The per-clip VLM budget is fixed at M+1 calls regardless of video length, where M can be set according to the proposals needed. On UCSD Ped2, GridVAD achieves the highest Pixel-AUROC (77.59) among all compared methods, surpassing even the partially fine-tuned TAO (75.11) and outperforms other zero-shot approaches on object-level RBDC by over 5x. Ablations reveal that SCC provides a controllable precision-recall tradeoff: filtering improves all pixel level metrics at a modest cost in object-level recall. Efficiency experiments show GridVAD is 2.7x more call-efficient than uniform per-frame VLM querying while additionally producing dense segmentation masks.Code and qualitative video results are available at https://gridvad.github.io.

关键词: Video Anomaly Detection, Vision-Language Models, Open-set Detection, Spatial Reasoning, Training-free Pipeline, Self-Consistency Consolidation, Pixel-level Segmentation, Zero-shot Learning

192. ❌ CIAR: Interval-based Collaborative Decoding for Image Generation Acceleration

作者: Keming Ye, Zhou Zhao, Fan Wu, Shengyu Zhang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25463v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文CIAR专注于图像生成加速，通过云-设备协作框架解决自回归模型的延迟问题。核心相关关键词：1) “Small Language Models OR SLMs OR On-device AI”（8分）- 论文明确针对on-device部署优化，涉及设备端验证和加速；2) “Speculative Decoding OR Inference Acceleration”（10分）- 这是论文的核心创新，提出基于区间的不确定性量化和增强解码来加速推理。其他关键词主要涉及大语言模型、训练方法、对齐、代理等，与本文的图像生成和加速主题无关。

!!! tip deepseek-chat TL;DR

论文提出CIAR框架，通过设备端令牌不确定性量化和区间增强解码，解决自回归图像生成模型在设备部署中的高延迟问题，实现了2.18倍加速并减少70%云请求，同时保持图像质量。

摘要翻译

自回归模型近期在图像生成领域取得显著进展，其性能已可与基于扩散的方法相媲美。然而，其计算密集性与序列化特性阻碍了在设备端的部署，导致难以接受的延迟。我们通过云-端协作框架 CIAR 解决这一问题，该框架利用设备端自验证来处理视觉合成的两个关键特性：生成高保真图像所需的庞大词汇表，以及固有的空间冗余性——后者导致同质区域具有极高的可预测性，而物体边界则表现出高度不确定性。均匀验证会在此类冗余词元上浪费资源。我们的解决方案核心在于一个设备端词元不确定性量化器，它采用连续概率区间来加速处理，并使其能够适用于大型视觉词汇表，而非传统的离散解集。此外，我们引入了一个区间增强解码模块，通过分布对齐训练策略，在保持视觉保真度与语义一致性的同时进一步加速解码。大量实验表明，与现有方法相比，CIAR 在保持图像质量的同时，实现了 2.18 倍的加速，并将云端请求减少了 70%。

摘要 (Abstract)

Auto-regressive (AR) models have recently made notable progress in image generation, achieving performance comparable to diffusion-based approaches. However, their computational intensity and sequential nature impede on-device deployment, causing disruptive latency. We address this via a cloud-device collaboration framework \textbf{CIAR}, which utilizes on-device self-verification to handle two key properties of visual synthesis: \textit{the vast token vocabulary} required for high-fidelity images and \textit{inherent spatial redundancy} which leads to extreme predictability in homogeneous regions, while object boundaries exhibit high uncertainty. Uniform verification wastes resources on such redundant tokens. Our solution centers on an on-device token uncertainty quantifier, which adopts continuous probability intervals to accelerate processing and make it feasible for large visual vocabularies instead of conventional discrete solution sets. Additionally, we incorporate a Interval-enhanced decoding module to further speed up decoding while maintaining visual fidelity and semantic consistency via a distribution alignment training strategy. Extensive experiments demonstrate that CIAR achieves a 2.18x speed-up and reduces cloud requests by 70%, while preserving image quality compared to existing methods.

关键词: image generation acceleration, auto-regressive models, on-device deployment, cloud-device collaboration, token uncertainty quantification, interval-based decoding, inference acceleration, visual synthesis

193. ❌ DC-Reg: Globally Optimal Point Cloud Registration via Tight Bounding with Difference of Convex Programming

作者: Wei Lian, Fei Ma, Hang Pan, Zhesen Cui, Wangmeng Zuo 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25442v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是点云配准的全局优化算法（DC-Reg），属于计算机视觉和几何处理领域，核心贡献是提出基于凸差规划（DC programming）的紧致下界方法，用于分支定界搜索。所有评分关键词均与大模型、深度学习技术原理或科学AI应用相关，而该论文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为DC-Reg的全局最优点云配准框架，通过凸差规划推导紧致下界，显著加速分支定界搜索，在合成数据和3DMatch基准测试中实现了更快的收敛速度和更强的鲁棒性。

摘要翻译

在部分重叠与大幅错位条件下实现全局最优的点云配准仍是一个基础性挑战。虽然同步估计变换参数（$\boldsymbolθ$）与对应关系（$\mathbf{P}$）的方法对非刚性形变具有鲁棒性，但其非凸耦合目标函数常导致启发式方法陷入局部极小值，而现有全局求解器因下界松弛度过大而收敛时间过长。为此，我们提出DC-Reg——一种鲁棒的全局最优框架，通过显著收紧分支定界（Branch-and-Bound, BnB）搜索来解决该问题。我们的核心创新在于基于凸差规划（Difference of Convex, DC programming）范式，为耦合的变换-匹配目标函数推导出整体性凹下界估计器。与先前依赖逐项松弛（如麦考密克包络）而忽略变量间相互作用的研究不同，我们的整体DC分解捕捉了$\boldsymbolθ$与$\mathbf{P}$之间的联合结构交互。该形式化方法通过在搜索框顶点处求解高效线性分配问题（Linear Assignment Problems, LAP），实现了极紧的下界计算。我们在二维相似变换与三维刚性配准任务上验证了该框架，其中三维配准采用旋转不变特征以在保持最优性的同时实现高效率。在合成数据与3DMatch基准测试上的实验结果表明，相较于现有先进全局优化技术，DC-Reg在收敛速度上显著提升，并对极端噪声和异常值具有更优的鲁棒性。

摘要 (Abstract)

Achieving globally optimal point cloud registration under partial overlaps and large misalignments remains a fundamental challenge. While simultaneous transformation ($\boldsymbolθ$) and correspondence ($\mathbf{P}$) estimation has the advantage of being robust to nonrigid deformation, its non-convex coupled objective often leads to local minima for heuristic methods and prohibitive convergence times for existing global solvers due to loose lower bounds. To address this, we propose DC-Reg, a robust globally optimal framework that significantly tightens the Branch-and-Bound (BnB) search. Our core innovation is the derivation of a holistic concave underestimator for the coupled transformation-assignment objective, grounded in the Difference of Convex (DC) programming paradigm. Unlike prior works that rely on term-wise relaxations (e.g., McCormick envelopes) which neglect variable interplay, our holistic DC decomposition captures the joint structural interaction between $\boldsymbolθ$ and $\mathbf{P}$. This formulation enables the computation of remarkably tight lower bounds via efficient Linear Assignment Problems (LAP) evaluated at the vertices of the search boxes. We validate our framework on 2D similarity and 3D rigid registration, utilizing rotation-invariant features for the latter to achieve high efficiency without sacrificing optimality. Experimental results on synthetic data and the 3DMatch benchmark demonstrate that DC-Reg achieves significantly faster convergence and superior robustness to extreme noise and outliers compared to state-of-the-art global techniques.

关键词: point cloud registration, globally optimal, difference of convex programming, branch-and-bound, tight lower bounds, 3D rigid registration, rotation-invariant features, 3DMatch benchmark

194. ❌ HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models

作者: Huizhi Liang, Yichao Shen, Yu Deng, Sicheng Xu, Zhiyuan Feng, Tong Zhang, Yaobo Liang, Jiaolong Yang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25411v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视觉语言模型（VLMs）的3D空间理解，属于大模型在特定领域（计算机视觉与语言交叉）的应用研究。核心相关性体现在：1）明确使用监督微调（SFT）方法训练模型（10分）；2）涉及空间推理任务，与多步推理和深度推理概念相关（各5分）；3）论文属于大模型应用范畴，但未深入LLM技术原理（5分）。其他关键词如MoE、量化、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种层次化框架和自动化数据生成流程，通过监督微调显著提升了视觉语言模型在3D空间理解和推理任务上的性能，超越了包括GPT-5在内的现有先进模型。

摘要翻译

为实现视觉语言模型（VLMs）类人的空间智能，需从二维观察中推断三维结构、识别三维空间中的物体属性与关系，并进行高层次空间推理。本文提出一种原则性的分层框架，将VLMs中三维空间理解的学习分解为四个渐进复杂的层级——从几何感知到抽象空间推理。在此框架指导下，我们构建了一个自动化流程，处理约500万张图像中的超过4500万个物体，跨多种任务与场景生成三维空间视觉问答对，用于VLM的监督微调。同时，我们开发了一种结合度量尺度点云图作为辅助输入的RGB-D VLM，以进一步增强空间理解能力。大量实验表明，我们的方法在多项空间理解与推理基准测试中取得了最先进的性能，超越了专用空间模型及Gemini-2.5-pro、GPT-5等大型专有系统。此外，我们的分析揭示了分层任务层级间明确的依赖关系，为多层次任务设计如何促进三维空间智能的涌现提供了新的见解。

摘要 (Abstract)

Achieving human-like spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D space, and performing high-level spatial reasoning. In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex levels, from geometric perception to abstract spatial reasoning. Guided by this framework, we construct an automated pipeline that processes approximately 5M images with over 45M objects to generate 3D spatial VQA pairs across diverse tasks and scenes for VLM supervised fine-tuning. We also develop an RGB-D VLM incorporating metric-scale point maps as auxiliary inputs to further enhance spatial understanding. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple spatial understanding and reasoning benchmarks, surpassing specialized spatial models and large proprietary systems such as Gemini-2.5-pro and GPT-5. Moreover, our analysis reveals clear dependencies among hierarchical task levels, offering new insights into how multi-level task design facilitates the emergence of 3D spatial intelligence.

关键词: Vision-Language Models, 3D Spatial Understanding, Hierarchical Framework, Supervised Fine-tuning, Spatial Reasoning, RGB-D VLM, State-of-the-art Performance, Multi-level Task Design

195. ❌ VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents

作者: George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Yang Bai, Liudi Yang, Ziyuan Liu 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25420v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	5.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频到视频（V2V）翻译技术，特别是多视图视频翻译，用于具身智能体（embodied agents）的环境随机化。核心贡献是提出了VideoWeaver框架，利用基于流的单视图模型和Pi3空间基础模型构建共享4D潜在空间，实现多视图一致性。论文与大多数关键词（如LLMs、MoE、SLMs、Scaling Laws、SFT、RLHF、PEFT、RAG、推理加速、量化等）完全无关，因为这些关键词主要针对语言模型及其训练、推理、对齐等技术。仅有两个关键词有弱关联：1）‘Pre-training OR Continual Pre-training OR Domain Adaptation’：论文提到使用预训练的Pi3模型和单视图V2V模型，涉及基础模型和领域适应概念，但非核心焦点，给5分。2）‘World Models AND General World Models’：论文应用于机器人学习的世界随机化（world randomization），与构建环境模型相关，但非通用世界模型，给5分。其他关键词如AI for Science虽涉及科学应用，但论文聚焦机器人而非生物/化学信息学，给0分。

!!! tip deepseek-chat TL;DR

论文解决了多视图视频到视频翻译中视图不一致的问题，提出了VideoWeaver框架，通过共享4D潜在空间和扩散时间步训练，首次实现了物理和风格一致的多视图翻译，提升了具身智能体环境随机化的效果。

摘要翻译

视频到视频（V2V）转换技术的最新进展，已能实现对具身人工智能演示进行逼真的重模拟，这一能力使得预训练的机器人策略无需额外数据收集即可迁移至新环境。然而，现有方法每次仅能处理单一视角，而具身AI任务通常通过多个同步摄像机捕捉以支持策略学习。若将单视角模型简单地独立应用于每个摄像机，会导致不同视角间外观不一致；同时，由于跨视角注意力机制存在二次计算成本，标准Transformer架构难以扩展至多视角场景。
本文提出VideoWeaver，首个多模态多视角V2V转换框架。VideoWeaver首先被训练为基于流的单视角V2V模型。为实现向多视角体系的扩展，我们提出将所有视角锚定于一个共享的4D潜在空间，该空间源自前馈式空间基础模型Pi3。即使在宽基线及动态摄像机运动条件下，该方法仍能有效保证视角间外观的一致性。为突破固定摄像机数量的限制，我们在不同扩散时间步训练各视角，使模型能够学习联合视角分布与条件视角分布。这进而实现了基于现有视角自回归合成新视角的功能。
实验表明，本框架在单视角转换基准测试中达到或超越了当前最优性能，并首次实现了物理与风格一致的多视角转换，包括对机器人学习中的世界随机化至关重要的挑战性场景——如以自我为中心的视角和异构摄像机配置。

摘要 (Abstract)

Recent progress in video-to-video (V2V) translation has enabled realistic resimulation of embodied AI demonstrations, a capability that allows pretrained robot policies to be transferable to new environments without additional data collection. However, prior works can only operate on a single view at a time, while embodied AI tasks are commonly captured from multiple synchronized cameras to support policy learning. Naively applying single-view models independently to each camera leads to inconsistent appearance across views, and standard transformer architectures do not scale to multi-view settings due to the quadratic cost of cross-view attention. We present VideoWeaver, the first multimodal multi-view V2V translation framework. VideoWeaver is initially trained as a single-view flow-based V2V model. To achieve an extension to the multi-view regime, we propose to ground all views in a shared 4D latent space derived from a feed-forward spatial foundation model, namely, Pi3. This encourages view-consistent appearance even under wide baselines and dynamic camera motion. To scale beyond a fixed number of cameras, we train views at distinct diffusion timesteps, enabling the model to learn both joint and conditional view distributions. This in turn allows autoregressive synthesis of new viewpoints conditioned on existing ones. Experiments show superior or similar performance to the state-of-the-art on the single-view translation benchmarks and, for the first time, physically and stylistically consistent multi-view translations, including challenging egocentric and heterogeneous-camera setups central to world randomization for robot learning.

关键词: Video-to-Video Translation, Multi-view Video, Embodied Agents, World Randomization, Diffusion Models, 4D Latent Space, Pi3 Foundation Model, View Consistency

196. ❌ LaMP: Learning Vision-Language-Action Policies with 3D Scene Flow as Latent Motion Prior

作者: Xinkai Wang, Chenyi Wang, Yifu Xu, Mingzhe Ye, Fu-Cheng Zhang, Jialin Tian, Xinyu Zhan, Lifeng Zhu, Cewu Lu, Lixin Yang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25399v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文LaMP专注于机器人操作中的视觉-语言-动作策略学习，使用3D场景流作为潜在运动先验。虽然涉及深度学习（VLA模型），但研究内容与所有评分关键词（均围绕大模型技术、训练方法、推理优化、对齐、代理等）完全无关。论文未提及任何语言模型、MoE、缩放定律、训练技术、推理方法、代理系统或科学AI应用。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

论文提出LaMP框架，通过将密集3D场景流作为潜在运动先验来改进机器人操作中的视觉-语言-动作策略学习，在多个仿真基准和真实实验中优于现有基线并提高了鲁棒性。

摘要翻译

本文提出LaMP，一种双专家视觉-语言-动作框架，它将稠密三维场景流嵌入为机器人操作的潜在运动先验。现有VLA模型直接从二维语义视觉特征回归动作，迫使它们隐式学习复杂的三维物理交互。这种隐式学习策略在不熟悉的空间动态下性能会下降。LaMP通过门控交叉注意力将流匹配的运动专家与策略预测的动作专家对齐，以解决这一局限。具体而言，运动专家生成一步部分去噪的三维场景流，其隐藏状态在无需完整多步重建的条件下为动作专家提供条件。我们在LIBERO、LIBERO-Plus和SimplerEnv-WidowX仿真基准以及真实世界实验中评估LaMP。在相同训练预算下，LaMP在LIBERO、LIBERO-Plus和SimplerEnv-WidowX基准上均持续超越现有VLA基线，取得了目前报告的最高平均成功率。在LIBERO-Plus分布外扰动测试中，LaMP展现出更强的鲁棒性，相比先前最强基线平均提升9.7%。项目页面详见https://summerwxk.github.io/lamp-project-page/。

摘要 (Abstract)

We introduce \textbf{LaMP}, a dual-expert Vision-Language-Action framework that embeds dense 3D scene flow as a latent motion prior for robotic manipulation. Existing VLA models regress actions directly from 2D semantic visual features, forcing them to learn complex 3D physical interactions implicitly. This implicit learning strategy degrades under unfamiliar spatial dynamics. LaMP addresses this limitation by aligning a flow-matching \emph{Motion Expert} with a policy-predicting \emph{Action Expert} through gated cross-attention. Specifically, the Motion Expert generates a one-step partially denoised 3D scene flow, and its hidden states condition the Action Expert without full multi-step reconstruction. We evaluate LaMP on the LIBERO, LIBERO-Plus, and SimplerEnv-WidowX simulation benchmarks as well as real-world experiments. LaMP consistently outperforms evaluated VLA baselines across LIBERO, LIBERO-Plus, and SimplerEnv-WidowX benchmarks, achieving the highest reported average success rates under the same training budgets. On LIBERO-Plus OOD perturbations, LaMP shows improved robustness with an average 9.7% gain over the strongest prior baseline. Our project page is available at https://summerwxk.github.io/lamp-project-page/.

关键词: Vision-Language-Action, 3D scene flow, latent motion prior, robotic manipulation, dual-expert framework, flow-matching, cross-attention, simulation benchmarks

197. ❌ PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders

作者: Niccolò Cavagnero, Narges Norouzi, Gijs Dubbelman, Daan de Geus 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25398v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉基础模型（VFMs）在图像和视频分割中的应用，提出Plain Mask Transformer（PMT）解码器以在冻结的VFM编码器上实现高效分割。与评分关键词的相关性分析：1）仅与“Pre-training OR Continual Pre-training OR Domain Adaptation”和“Post-training OR Supervised Fine-tuning OR SFT”有一定关联（各5分），因为论文涉及VFM预训练和下游任务适应，但未深入探讨这些技术原理；2）其他关键词均与语言模型、推理、对齐、压缩等主题无关，得0分。论文属于计算机视觉领域，未涉及大语言模型或评分列表中的其他核心技术。

!!! tip deepseek-chat TL;DR

该论文提出Plain Mask Transformer（PMT），一种基于Transformer的快速分割解码器，用于在冻结的视觉基础模型（VFM）编码器上实现图像和视频分割，在保持编码器共享性的同时达到与微调方法相当的精度，并显著提升推理速度。

摘要翻译

视觉基础模型（Vision Foundation Models, VFMs）通过大规模预训练，使得单个冻结编码器能够同时服务于多个下游任务。近期基于VFM的纯编码器模型（如EoMT和VidEoMT）在图像与视频分割任务中实现了具有竞争力的精度和极低的延迟，但这些方法需要对编码器进行微调，牺牲了编码器的多任务共享特性，而这正是VFMs在大规模部署中具有实际吸引力的关键。为了兼顾纯编码器架构的简洁高效与冻结VFM特征的优势，我们提出了Plain Mask Decoder（PMD）——一种基于Transformer的快速分割解码器，可直接处理冻结的VFM特征。由此构建的模型**Plain Mask Transformer（PMT）**在保持纯编码器设计架构简洁性和低延迟的同时，保留了编码器表征的不可变性与可共享性。该设计可无缝应用于图像与视频分割任务，继承了纯编码器框架的通用性。在标准图像分割基准测试中，PMT在保持冻结编码器最优性能的同时，推理速度提升约3倍。对于视频分割任务，其性能甚至与全微调方法相当，同时比当前最优的冻结编码器模型快达8倍。代码：https://github.com/tue-mps/pmt。

摘要 (Abstract)

Vision Foundation Models (VFMs) pre-trained at scale enable a single frozen encoder to serve multiple downstream tasks simultaneously. Recent VFM-based encoder-only models for image and video segmentation, such as EoMT and VidEoMT, achieve competitive accuracy with remarkably low latency, yet they require finetuning the encoder, sacrificing the multi-task encoder sharing that makes VFMs practically attractive for large-scale deployment. To reconcile encoder-only simplicity and speed with frozen VFM features, we propose the Plain Mask Decoder (PMD), a fast Transformer-based segmentation decoder that operates on top of frozen VFM features. The resulting model, the Plain Mask Transformer (PMT), preserves the architectural simplicity and low latency of encoder-only designs while keeping the encoder representation unchanged and shareable. The design seamlessly applies to both image and video segmentation, inheriting the generality of the encoder-only framework. On standard image segmentation benchmarks, PMT matches the frozen-encoder state of the art while running up to ~3x faster. For video segmentation, it even performs on par with fully finetuned methods, while being up to 8x faster than state-of-the-art frozen-encoder models. Code: https://github.com/tue-mps/pmt.

关键词: Vision Foundation Models, Image Segmentation, Video Segmentation, Frozen Encoder, Transformer Decoder, Low Latency, Multi-task Sharing, Plain Mask Transformer

198. ❌ FSGNet: A Frequency-Aware and Semantic Guidance Network for Infrared Small Target Detection

作者: Yingmei Zhang, Wangtao Bao, Yong Yang, Weiguo Wan, Qin Xiao, Xueting Zou 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25389v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于红外小目标检测（IRSTD）的计算机视觉任务，提出了一种结合频率感知和语义引导机制的轻量级检测框架FSGNet。论文内容涉及U-Net架构改进、多方向交互注意力模块、多尺度频率感知模块和全局语义引导流等技术，属于传统的深度学习在特定应用领域（红外图像处理）的研究。所有评分关键词均与大语言模型（LLMs）、大模型技术原理、AI for Science（生物信息学/化学信息学）等主题相关，而本论文完全不涉及这些领域。论文没有讨论任何大模型、语言模型、模型训练技术、推理优化、AI代理或科学AI应用等内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对红外小目标检测中U-Net架构存在的语义退化问题，提出了FSGNet框架，通过频率感知和语义引导机制实现了更精确的目标定位和更高的检测性能。

摘要翻译

红外小目标检测旨在从复杂背景中识别并区分小目标。借助U-Net架构强大的多尺度特征融合能力，该领域已取得显著进展。然而，U-Net在将深层高级特征向浅层传递时存在语义退化问题，限制了小目标的精确定位。为解决此问题，本文提出FSGNet——一种融合频率感知与语义引导机制的轻量化高效检测框架。具体而言，我们在编码器全程引入多方向交互注意力模块，以捕获细粒度方向性特征，增强网络对低对比度小目标的敏感性。为抑制跳跃连接传播的背景干扰，多尺度频率感知模块利用快速傅里叶变换滤除类目标杂波，同时保留显著目标结构。在最深层，全局池化模块捕获高级语义信息，通过全局语义引导流上采样并传播至各解码阶段，确保跨尺度的语义一致性与精确定位能力。在四个公开IRSTD数据集上的大量实验表明，FSGNet实现了优越的检测性能并保持高效运行，凸显了其实用性与鲁棒性。代码将在https://github.com/Wangtao-Bao/FSGNet发布。

摘要 (Abstract)

Infrared small target detection (IRSTD) aims to identify and distinguish small targets from complex backgrounds. Leveraging the powerful multi-scale feature fusion capability of the U-Net architecture, IRSTD has achieved significant progress. However, U-Net suffers from semantic degradation when transferring high-level features from deep to shallow layers, limiting the precise localization of small targets. To address this issue, this paper proposes FSGNet, a lightweight and effective detection framework incorporating frequency-aware and semantic guidance mechanisms. Specifically, a multi-directional interactive attention module is proposed throughout the encoder to capture fine-grained and directional features, enhancing the network’s sensitivity to small, low-contrast targets. To suppress background interference propagated through skip connections, a multi-scale frequency-aware module leverages Fast Fourier transform to filter out target-similar clutter while preserving salient target structures. At the deepest layer, a global pooling module captures high-level semantic information, which is subsequently upsampled and propagated to each decoder stage through the global semantic guidance flows, ensuring semantic consistency and precise localization across scales. Extensive experiments on four public IRSTD datasets demonstrate that FSGNet achieves superior detection performance and maintains high efficiency, highlighting its practical applicability and robustness. The codes will be released on https://github.com/Wangtao-Bao/FSGNet.

关键词: Infrared small target detection, FSGNet, Frequency-aware, Semantic guidance, U-Net, Multi-directional attention, Fast Fourier transform, Lightweight framework

199. ❌ Multimodal Dataset Distillation via Phased Teacher Models

作者: Shengbin Guo, Hang Zhao, Senqiao Yang, Chenyang Jiang, Yuhang Cheng, Xiangru Peng, Rui Shao, Zhuotao Tian 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25388v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多模态数据集蒸馏（Multimodal Dataset Distillation），旨在通过分阶段教师模型（Phased Teacher Model）和捷径轨迹（Shortcut Trajectory）策略，从大规模图像-文本数据中高效压缩和转移知识。其核心是优化蒸馏过程，提升合成数据集的质量和学生模型的性能。然而，论文内容与所有给定的评分关键词（主要围绕大语言模型、深度学习技术原理、AI for Science等）均无直接关联。论文未涉及LLMs、MoE、SLMs、Scaling Laws、预训练/后训练技术、对齐、RLHF、PEFT、RAG、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI应用等主题。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为PTM-ST的新型分阶段蒸馏框架，通过阶段感知的教师建模和基于捷径的轨迹构建策略，解决了多模态数据集蒸馏中教师模型知识动态演化难以捕获的问题，从而显著提升了蒸馏数据的质量和学生模型的性能，在Flickr30k和COCO数据集上超越了现有方法。

摘要翻译

多模态数据集蒸馏旨在构建紧凑的合成数据集，以实现从大规模图像-文本数据中进行高效压缩与知识迁移。然而，现有方法往往难以捕捉教师模型在后期训练阶段中嵌入的复杂且动态演变的知识，这一局限导致学生模型性能下降并损害蒸馏数据的质量。为应对显著的跨阶段性能差距与不稳定的教师轨迹等关键挑战，我们提出了基于快捷轨迹的分阶段教师模型（PTM-ST）——一种新颖的分阶段蒸馏框架。PTM-ST利用阶段感知的教师建模和基于快捷方式的轨迹构建策略，以精确拟合教师模型在不同训练阶段的学习动态，从而提升蒸馏过程的稳定性与表达能力。通过理论分析与综合实验，我们证明PTM-ST能显著缓解优化振荡和阶段间知识差距，同时降低存储开销。我们的方法在Flickr30k和COCO数据集上持续超越现有先进基线，在Flickr30k上实现了最高13.5%的绝对性能提升及平均9.53%的增益。代码：https://github.com/Previsior/PTM-ST。

摘要 (Abstract)

Multimodal dataset distillation aims to construct compact synthetic datasets that enable efficient compression and knowledge transfer from large-scale image-text data. However, existing approaches often fail to capture the complex, dynamically evolving knowledge embedded in the later training stages of teacher models. This limitation leads to degraded student performance and compromises the quality of the distilled data. To address critical challenges such as pronounced cross-stage performance gaps and unstable teacher trajectories, we propose Phased Teacher Model with Shortcut Trajectory (PTM-ST) – a novel phased distillation framework. PTM-ST leverages stage-aware teacher modeling and a shortcut-based trajectory construction strategy to accurately fit the teacher’s learning dynamics across distinct training phases. This enhances both the stability and expressiveness of the distillation process. Through theoretical analysis and comprehensive experiments, we show that PTM-ST significantly mitigates optimization oscillations and inter-phase knowledge gaps, while also reducing storage overhead. Our method consistently surpasses state-of-the-art baselines on Flickr30k and COCO, achieving up to 13.5% absolute improvement and an average gain of 9.53% on Flickr30k. Code: https://github.com/Previsior/PTM-ST.

关键词: Multimodal Dataset Distillation, Phased Teacher Model, Shortcut Trajectory, Knowledge Transfer, Synthetic Datasets, Teacher-Student Learning, Optimization Oscillations, Cross-stage Performance Gaps

200. ❌ CLIP-RD: Relational Distillation for Efficient CLIP Knowledge Distillation

作者: Jeannie Chung, Hanna Jang, Ingyeong Yang, Uiwon Hwang, Jaehyung Sim 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25383v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于CLIP模型的知识蒸馏技术，提出了一种关系蒸馏框架（VRD和XRD），旨在提升轻量级学生模型对教师模型嵌入几何结构的保留能力。所有评分关键词均与大语言模型（LLM）相关，而本文研究的是视觉-语言模型CLIP的知识蒸馏，属于计算机视觉与多模态领域，与LLM技术原理、训练方法、推理优化、应用场景等均无直接关联。因此，所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文针对现有CLIP知识蒸馏方法未能显式建模师生嵌入间多向关系依赖的问题，提出了关系知识蒸馏框架CLIP-RD，通过垂直关系蒸馏和交叉关系蒸馏方法，使学生模型更好地保留教师模型的结构关系，在性能上超越了现有方法0.8个百分点。

摘要翻译

CLIP通过对比学习对齐图像与文本嵌入，并展现出强大的零样本泛化能力。其大规模架构需要大量计算与内存资源，这促使研究者将其能力蒸馏至轻量级学生模型。然而，现有的CLIP蒸馏方法未能显式建模教师与学生嵌入之间的多向关系依赖性，限制了学生模型保持教师编码的结构关系的能力。为解决这一问题，我们提出了一种关系知识蒸馏框架，引入两种新方法：垂直关系蒸馏（Vertical Relational Distillation, VRD）与交叉关系蒸馏（Cross Relational Distillation, XRD）。VRD在分布层面强制跨模态的教师-学生蒸馏强度保持一致性，而XRD则对跨模态的教师-学生相似度分布施加双向对称约束。通过联合建模多向关系结构，CLIP-RD促进学生嵌入几何结构与教师模型保持忠实对齐，在性能上超越现有方法0.8个百分点。

摘要 (Abstract)

CLIP aligns image and text embeddings via contrastive learning and demonstrates strong zero-shot generalization. Its large-scale architecture requires substantial computational and memory resources, motivating the distillation of its capabilities into lightweight student models. However, existing CLIP distillation methods do not explicitly model multi-directional relational dependencies between teacher and student embeddings, limiting the student’s ability to preserve the structural relationships encoded by the teacher. To address this, we propose a relational knowledge distillation framework that introduces two novel methods, Vertical Relational Distillation (VRD) and Cross Relational Distillation (XRD). VRD enforces consistency of teacher-student distillation strength across modalities at the distribution level, while XRD imposes bidirectional symmetry on cross-modal teacher-student similarity distributions. By jointly modeling multi-directional relational structures, CLIP-RD promotes faithful alignment of the student embedding geometry with that of the teacher, outperforming existing methods by 0.8%p.

关键词: CLIP, knowledge distillation, relational distillation, embedding alignment, vertical relational distillation, cross relational distillation, multi-modal learning, model efficiency

201. ❌ InstanceAnimator: Multi-Instance Sketch Video Colorization

作者: Yinhan Zhang, Yue Ma, Bingyuan Wang, Kunyu Feng, Yeying Jin, Qifeng Chen, Anyi Rao, Zeyu Wang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25357v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文InstanceAnimator专注于计算机视觉领域的视频着色任务，采用Diffusion Transformer框架解决多实例草图视频着色问题。虽然属于AI应用，但研究内容与所有评分关键词（主要围绕大语言模型技术原理、训练方法、推理优化、对齐技术、代理系统等）完全无关。论文未涉及任何语言模型、MoE、缩放定律、训练方法、对齐、推理加速、可解释性等主题，也未应用于科学领域（如生物信息学）。

!!! tip deepseek-chat TL;DR

论文提出InstanceAnimator，一种基于Diffusion Transformer的新框架，解决了多实例草图视频着色中用户控制不灵活、实例对齐不准确和细节保真度低的问题，实现了高质量的多实例视频着色。

摘要翻译

我们提出InstanceAnimator，一种用于多实例线稿视频上色的新型扩散Transformer框架。现有方法存在三个核心局限：因过度依赖单一参考帧而导致用户控制灵活性不足、实例可控性差导致多角色场景中对齐错误，以及在细粒度区域细节保真度下降。为解决这些挑战，我们引入了三项对应的创新。首先，画布引导条件通过允许自由放置参考元素与背景，消除了工作流程的碎片化，实现了前所未有的用户控制灵活性。其次，实例匹配机制通过将实例特征与线稿整合，解决了对齐问题，确保了对多个角色的精确控制。第三，自适应解耦控制模块通过将来自角色、背景和文本条件的语义特征注入扩散过程，提升了细节保真度。大量实验表明，InstanceAnimator在增强用户控制、高视觉质量和强实例一致性的前提下，实现了卓越的多实例上色效果。

摘要 (Abstract)

We propose InstanceAnimator, a novel Diffusion Transformer framework for multi-instance sketch video colorization. Existing methods suffer from three core limitations: inflexible user control due to heavy reliance on single reference frames, poor instance controllability leading to misalignment in multi-character scenarios, and degraded detail fidelity in fine-grained regions. To address these challenges, we introduce three corresponding innovations. First, a Canvas Guidance Condition eliminates workflow fragmentation by allowing free placement of reference elements and background, enabling unprecedented user flexibility. Second, an Instance Matching Mechanism resolves misalignment by integrating instance features with the sketches, ensuring precise control over multiple characters. Third, an Adaptive Decoupled Control Module enhances detail fidelity by injecting semantic features from characters, backgrounds, and text conditions into the diffusion process. Extensive experiments demonstrate that InstanceAnimator achieves superior multi-instance colorization with enhanced user control, high visual quality, and strong instance consistency.

关键词: InstanceAnimator, Diffusion Transformer, sketch video colorization, multi-instance, instance controllability, Canvas Guidance Condition, Instance Matching Mechanism, Adaptive Decoupled Control Module

202. ❌ HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT

作者: Yongsung Kim, Wooseok Song, Jaihyun Lew, Hun Hwangbo, Jaehoon Lee, Sungroh Yoon 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25336v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是视觉几何基础Transformer（VGGT）在3D视觉中的注意力头稀疏化敏感性问题，提出了一种基于Hessian近似的Head Sensitivity Score（HeSS）来指导稀疏化重新分配。所有评分关键词都专注于大语言模型（LLM）及其相关技术（如微调、对齐、推理、代理等），而本论文研究的是计算机视觉领域的Transformer模型（VGGT），专注于注意力机制的计算优化和稀疏化技术，与LLM领域没有直接关联。虽然论文涉及模型压缩和稀疏化，但这是针对视觉Transformer的特定应用，而非LLM的量化或压缩技术。

!!! tip deepseek-chat TL;DR

论文针对Visual Geometry Grounded Transformer（VGGT）中全局注意力层的二次计算成本问题，提出了一种基于Head Sensitivity Score（HeSS）的两阶段稀疏化方法，通过量化注意力头的稀疏化敏感性并重新分配注意力预算，有效减少了高稀疏度下的性能下降。

摘要翻译

视觉几何基础变换器（VGGT）推动了三维视觉的发展，但其全局注意力层存在二次计算成本问题，阻碍了模型的可扩展性。已有多种基于稀疏化的加速技术被提出以缓解此问题，但这些方法通常伴随显著的精度下降。我们假设精度下降源于各注意力头对稀疏化敏感度的异质性，因为现有方法对所有注意力头采用了统一的稀疏化模式。基于这一假设，我们提出了一种两阶段稀疏化流程，能有效量化并利用注意力头级别的稀疏化敏感度。在第一阶段，我们使用一种新颖的度量指标——注意力头敏感度分数（HeSS）来量化各注意力头的稀疏化敏感度，该指标通过在小规模校准集上对两个不同误差项近似海森矩阵计算得出。在推理阶段，我们执行HeSS引导的稀疏化，利用预计算的HeSS重新分配总体注意力计算预算——为敏感度高的注意力头分配更密集的注意力计算，而对更鲁棒的注意力头则采用更稀疏的注意力模式。我们证明HeSS能有效捕捉注意力头级别的稀疏化敏感度，并通过实验证实全局注意力层中的注意力头确实表现出异质性的敏感度特征。大量实验进一步表明，我们的方法能有效缓解高稀疏度下的性能下降，在不同稀疏化水平下均展现出强大的鲁棒性。代码发布于https://github.com/libary753/HeSS。

摘要 (Abstract)

Visual Geometry Grounded Transformer (VGGT) has advanced 3D vision, yet its global attention layers suffer from quadratic computational costs that hinder scalability. Several sparsification-based acceleration techniques have been proposed to alleviate this issue, but they often suffer from substantial accuracy degradation. We hypothesize that the accuracy degradation stems from the heterogeneity in head-wise sparsification sensitivity, as the existing methods apply a uniform sparsity pattern across all heads. Motivated by this hypothesis, we present a two-stage sparsification pipeline that effectively quantifies and exploits headwise sparsification sensitivity. In the first stage, we measure head-wise sparsification sensitivity using a novel metric, the Head Sensitivity Score (HeSS), which approximates the Hessian with respect to two distinct error terms on a small calibration set. In the inference stage, we perform HeSS-Guided Sparsification, leveraging the pre-computed HeSS to reallocate the total attention budget-assigning denser attention to sensitive heads and sparser attention to more robust ones. We demonstrate that HeSS effectively captures head-wise sparsification sensitivity and empirically confirm that attention heads in the global attention layers exhibit heterogeneous sensitivity characteristics. Extensive experiments further show that our method effectively mitigates performance degradation under high sparsity, demonstrating strong robustness across varying sparsification levels. Code is available at https://github.com/libary753/HeSS.

关键词: Visual Geometry Grounded Transformer, VGGT, attention sparsification, head sensitivity, Hessian approximation, computational efficiency, 3D vision, sparsity redistribution

203. ❌ Adaptive Learned Image Compression with Graph Neural Networks

作者: Yunuo Chen, Bing He, Zezheng Lyu, Hongwei Hu, Qunshan Gu, Yuan Tian, Guo Lu 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25316v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于图像压缩领域，提出了一种基于图神经网络（GNN）的自适应学习图像压缩方法。虽然属于深度学习应用，但研究内容与所有评分关键词（均围绕大语言模型、对齐、推理、代理、科学AI等主题）完全无关。论文未涉及任何语言模型、MoE、缩放定律、训练技术、推理优化、代理系统或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于图神经网络的自适应学习图像压缩框架（GLIC），通过构建双尺度图和动态调整节点连接来灵活建模图像冗余，在多个数据集上实现了优于现有方法的压缩性能。

摘要翻译

高效的图像压缩依赖于对局部与全局冗余的联合建模。当前大多数先进的基于学习的图像压缩方法均基于卷积神经网络或Transformer架构，这些方法本质上具有结构刚性。标准卷积核与基于窗口的注意力机制采用固定的感受野和静态连接模式，可能仅因像素在欧氏空间中的邻近性而耦合非冗余像素。这种刚性限制了模型自适应捕捉图像中空间变化冗余（尤其是在全局层面）的能力。为克服这些局限，我们提出一种基于图神经网络的内容自适应图像压缩框架。具体而言，该方法构建双尺度图以实现灵活、数据驱动的感受野。此外，我们通过依据局部内容复杂度动态调整每个节点的邻居数量，引入了自适应连接机制。这些创新使得我们提出的基于图的学习图像压缩模型能够有效建模图像中多样化的冗余模式，从而实现更高效、自适应的压缩。实验表明，该模型在Kodak、Tecnick和CLIC数据集上相较于VTM-9.1分别实现了19.29%、21.69%和18.71%的BD-rate压缩性能提升，达到了当前最优性能。代码将在https://github.com/UnoC-727/GLIC发布。

摘要 (Abstract)

Efficient image compression relies on modeling both local and global redundancy. Most state-of-the-art (SOTA) learned image compression (LIC) methods are based on CNNs or Transformers, which are inherently rigid. Standard CNN kernels and window-based attention mechanisms impose fixed receptive fields and static connectivity patterns, which potentially couple non-redundant pixels simply due to their proximity in Euclidean space. This rigidity limits the model’s ability to adaptively capture spatially varying redundancy across the image, particularly at the global level. To overcome these limitations, we propose a content-adaptive image compression framework based on Graph Neural Networks (GNNs). Specifically, our approach constructs dual-scale graphs that enable flexible, data-driven receptive fields. Furthermore, we introduce adaptive connectivity by dynamically adjusting the number of neighbors for each node based on local content complexity. These innovations empower our Graph-based Learned Image Compression (GLIC) model to effectively model diverse redundancy patterns across images, leading to more efficient and adaptive compression. Experiments demonstrate that GLIC achieves state-of-the-art performance, achieving BD-rate reductions of 19.29%, 21.69%, and 18.71% relative to VTM-9.1 on Kodak, Tecnick, and CLIC, respectively. Code will be released at https://github.com/UnoC-727/GLIC.

关键词: Learned Image Compression, Graph Neural Networks, Adaptive Compression, Dual-scale Graphs, Content-adaptive, Receptive Fields, BD-rate Reduction, State-of-the-art Performance

204. ❌ MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data

作者: Zhekai Chen, Yuqing Wang, Manyuan Zhang, Xihui Liu 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25319v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多参考图像生成任务，通过构建大规模数据集（MacroData）和基准测试（MacroBench）来解决现有模型在输入参考图像数量增加时性能下降的问题。论文的核心是计算机视觉领域的图像生成，而非大语言模型（LLM）或深度学习技术原理的创新。摘要中提到的’fine-tuning’与关键词’Post-training OR Supervised Fine-tuning OR SFT’有一定关联，因为论文通过微调模型来提升多参考图像生成性能，但这并非论文的核心创新点（核心是数据集和基准构建），因此给予5分。其他所有关键词均与论文内容完全无关，论文未涉及LLMs、MoE、Scaling Laws、Alignment、RAG、推理方法、代理、模型压缩、AI for Science等主题。

!!! tip deepseek-chat TL;DR

该论文针对多参考图像生成任务中模型性能随输入参考数量增加而下降的问题，通过构建大规模结构化长上下文数据集（MacroData）和标准化评估基准（MacroBench），并证明基于该数据集的微调能显著提升生成性能。

摘要翻译

基于多张视觉参考图像生成图像对于多主体合成、叙事插画及新视角合成等实际应用至关重要，然而当前模型在输入参考图像数量增加时会出现严重的性能退化。我们发现其根本原因在于一个根本性的数据瓶颈：现有数据集主要由单参考或少量参考图像对主导，缺乏用于学习密集参考间依赖关系所需的结构化、长上下文监督。为解决这一问题，我们引入了MacroData，这是一个包含40万个样本的大规模数据集，每个样本最多包含10张参考图像，并系统性地按照四个互补维度——定制化、插画、空间推理与时间动态——进行组织，以全面覆盖多参考生成的任务空间。鉴于当前同时缺乏标准化的评估方案，我们进一步提出了MacroBench，这是一个包含4000个样本的基准测试，用于评估模型在分级任务维度与不同输入规模下的生成连贯性。大量实验表明，在MacroData上进行微调能显著提升多参考生成性能，消融研究进一步揭示了跨任务协同训练的协同效益以及处理长上下文复杂性的有效策略。本数据集与基准测试将公开发布。

摘要 (Abstract)

Generating images conditioned on multiple visual references is critical for real-world applications such as multi-subject composition, narrative illustration, and novel view synthesis, yet current models suffer from severe performance degradation as the number of input references grows. We identify the root cause as a fundamental data bottleneck: existing datasets are dominated by single- or few-reference pairs and lack the structured, long-context supervision needed to learn dense inter-reference dependencies. To address this, we introduce MacroData, a large-scale dataset of 400K samples, each containing up to 10 reference images, systematically organized across four complementary dimensions – Customization, Illustration, Spatial reasoning, and Temporal dynamics – to provide comprehensive coverage of the multi-reference generation space. Recognizing the concurrent absence of standardized evaluation protocols, we further propose MacroBench, a benchmark of 4,000 samples that assesses generative coherence across graded task dimensions and input scales. Extensive experiments show that fine-tuning on MacroData yields substantial improvements in multi-reference generation, and ablation studies further reveal synergistic benefits of cross-task co-training and effective strategies for handling long-context complexity. The dataset and benchmark will be publicly released.

关键词: multi-reference image generation, structured long-context data, dataset construction, benchmark evaluation, fine-tuning, generative coherence, ablation studies, cross-task co-training

205. ❌ V2U4Real: A Real-world Large-scale Dataset for Vehicle-to-UAV Cooperative Perception

作者: Weijia Li, Haoen Xiang, Tianxu Wang, Shuaibing Wu, Qiming Xia, Cheng Wang, Chenglu Wen 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25275v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于自动驾驶领域的协同感知数据集创建和基准测试，涉及车辆与无人机（V2U）的多模态数据收集、3D目标检测和跟踪。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究内容属于计算机视觉和自动驾驶感知系统，与这些关键词无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了首个大规模真实世界车辆与无人机协同感知数据集V2U4Real，并建立了3D目标检测和跟踪基准，证明了V2U合作能有效提升感知鲁棒性和远距离感知能力。

摘要翻译

现代自动驾驶感知系统常受限于遮挡、盲区及有限的传感范围。现有的协同感知范式（如车-车协同（V2V）与车-路协同（V2I））虽已证明能有效缓解这些问题，但仍局限于地面层面的协作，无法完全解决复杂环境中的大规模遮挡或远距离感知难题。为推进跨视角协同感知研究，我们提出了V2U4Real——首个面向车-无人机协同（Vehicle-to-UAV, V2U）目标感知的大规模真实世界多模态数据集。该数据集由搭载多视角激光雷达（LiDAR）与RGB相机的地面车辆和无人机协同采集，覆盖城市街道、大学校园和乡村道路等多种交通场景，包含超过5.6万帧激光雷达点云、5.6万幅多视角相机图像以及涵盖四个类别的70万个标注3D边界框。为支持广泛的研究任务，我们建立了单智能体3D目标检测、协同3D目标检测与目标追踪的基准测试。通过对多种前沿模型的综合评估，验证了车-无人机协同在提升感知鲁棒性与远距离感知能力方面的有效性。V2U4Real数据集与代码库已发布于https://github.com/VjiaLi/V2U4Real。

摘要 (Abstract)

Modern autonomous vehicle perception systems are often constrained by occlusions, blind spots, and limited sensing range. While existing cooperative perception paradigms, such as Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I), have demonstrated their effectiveness in mitigating these challenges, they remain limited to ground-level collaboration and cannot fully address large-scale occlusions or long-range perception in complex environments. To advance research in cross-view cooperative perception, we present V2U4Real, the first large-scale real-world multi-modal dataset for Vehicle-to-UAV (V2U) cooperative object perception. V2U4Real is collected by a ground vehicle and a UAV equipped with multi-view LiDARs and RGB cameras. The dataset covers urban streets, university campuses, and rural roads under diverse traffic scenarios, comprising over 56K LiDAR frames, 56K multi-view camera images, and 700K annotated 3D bounding boxes across four classes. To support a wide range of research tasks, we establish benchmarks for single-agent 3D object detection, cooperative 3D object detection, and object tracking. Comprehensive evaluations of several state-of-the-art models demonstrate the effectiveness of V2U cooperation in enhancing perception robustness and long-range awareness. The V2U4Real dataset and codebase is available at https://github.com/VjiaLi/V2U4Real.

关键词: Vehicle-to-UAV cooperative perception, large-scale dataset, multi-modal data, 3D object detection, object tracking, autonomous vehicles, LiDAR, RGB cameras

206. ❌ Towards Controllable Low-Light Image Enhancement: A Continuous Multi-illumination Dataset and Efficient State Space Framework

作者: Hongru Han, Tingrui Guo, Liming Zhang, Yan Su, Qiwen Xu, Zhuohua Ye 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25296v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的低光图像增强任务，提出了一种可控增强框架CLE-RWKV和新的数据集Light100，并采用了状态空间模型（SSMs）和空间到深度（S2D）策略。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是传统的图像处理问题，未涉及任何大模型技术、深度学习创新原理或AI在生物/化学等科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对低光图像增强任务中存在的病态性问题，提出了一个可控的低光增强框架CLE-RWKV，通过引入连续真实世界光照数据集Light100和噪声解耦监督策略，实现了对输出亮度的有效控制，并在多个基准测试中取得了竞争性性能。

摘要翻译

低光照图像增强传统上被构建为一种确定性映射任务。然而，该范式往往难以应对任务本身的不适定性——未知的环境条件与传感器参数构成了一个多模态解空间。因此，现有先进方法常面临预测结果与真实标签间的亮度差异，通常需要依赖“gt-mean”后处理来对齐输出亮度以进行评估。为克服这一根本性局限，我们提出向可控低光照增强范式转变，明确地将该任务重新构建为一个适定的条件性问题。为此，我们引入了CLE-RWKV这一整体框架，并辅以Light100——一个包含连续真实世界光照变化的新基准数据集。为解决亮度控制与色彩保真度之间的冲突，我们在HVI色彩空间中采用了噪声解耦监督策略，有效分离了光照调制与纹理恢复过程。在架构设计上，为使高效的状态空间模型适应密集预测任务，我们利用了空间到深度策略。通过将空间邻域折叠至通道维度，该设计使模型能够恢复局部归纳偏置，并有效弥合扁平化视觉序列中固有的“扫描间隙”，同时保持线性复杂度。在七个基准数据集上的实验表明，我们的方法实现了具有竞争力的性能与鲁棒的可控性，提供了一个真实世界的多光照替代方案，显著降低了对gt-mean后处理的依赖。

摘要 (Abstract)

Low-light image enhancement (LLIE) has traditionally been formulated as a deterministic mapping. However, this paradigm often struggles to account for the ill-posed nature of the task, where unknown ambient conditions and sensor parameters create a multimodal solution space. Consequently, state-of-the-art methods frequently encounter luminance discrepancies between predictions and labels, often necessitating “gt-mean” post-processing to align output luminance for evaluation. To address this fundamental limitation, we propose a transition toward Controllable Low-light Enhancement (CLE), explicitly reformulating the task as a well-posed conditional problem. To this end, we introduce CLE-RWKV, a holistic framework supported by Light100, a new benchmark featuring continuous real-world illumination transitions. To resolve the conflict between luminance control and chromatic fidelity, a noise-decoupled supervision strategy in the HVI color space is employed, effectively separating illumination modulation from texture restoration. Architecturally, to adapt efficient State Space Models (SSMs) for dense prediction, we leverage a Space-to-Depth (S2D) strategy. By folding spatial neighborhoods into channel dimensions, this design allows the model to recover local inductive biases and effectively bridge the “scanning gap” inherent in flattened visual sequences without sacrificing linear complexity. Experiments across seven benchmarks demonstrate that our approach achieves competitive performance and robust controllability, providing a real-world multi-illumination alternative that significantly reduces the reliance on gt-mean post-processing.

关键词: Low-light image enhancement, Controllable enhancement, State Space Models, Continuous illumination, HVI color space, Space-to-Depth, Multi-illumination dataset, Luminance control

207. ❌ EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval

作者: Yuhan Chen, Pengwen Dai, Chuan Wang, Dayan Wu, Xiaochun Cao 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25267v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于文本-视频检索任务，提出了一种名为EagleNet的模型，通过细粒度关系学习和能量感知匹配来改进文本嵌入。虽然论文提到了大规模视觉-语言预训练模型作为背景，但核心内容不涉及任何评分关键词中的大模型技术原理、训练方法、推理优化、对齐技术、代理系统或科学AI应用。所有关键词均与论文的具体技术贡献无关。

!!! tip deepseek-chat TL;DR

该论文针对文本-视频检索中文本表达与视频语义不匹配的问题，提出了EagleNet模型，通过细粒度关系学习和能量感知匹配生成上下文感知的文本嵌入，在多个基准数据集上取得了优越性能。

摘要翻译

得益于大规模视觉语言预训练模型的最新发展，文本-视频检索任务取得了显著进步。传统方法主要关注视频表征或跨模态对齐，而近期研究则转向增强文本表达能力以更好地匹配视频中丰富的语义信息。然而，这些方法仅利用文本与帧/视频之间的交互，忽略了视频内部帧间丰富的相互作用，导致最终扩展的文本无法捕捉帧间上下文信息，造成文本与视频之间的语义差异。为此，我们提出能量感知细粒度关系学习网络（EagleNet），以生成准确且具有上下文感知的增强文本嵌入。具体而言，所提出的细粒度关系学习机制首先通过生成的文本候选帧与视频帧构建文本-帧图，进而学习文本与帧之间的关系，最终利用这些关系将文本候选帧聚合为包含帧上下文信息的增强文本嵌入。为进一步优化细粒度关系学习，我们设计了能量感知匹配模块，通过建模文本-帧交互的能量来精确捕捉真实文本-视频对的分布特征。此外，为实现更有效的跨模态对齐和稳定训练，我们采用基于sigmoid的损失函数替代传统的基于softmax的对比损失。在MSRVTT、DiDeMo、MSVD和VATEX数据集上的大量实验证明了EagleNet的优越性。代码发布于https://github.com/draym28/EagleNet。

摘要 (Abstract)

Text-video retrieval tasks have seen significant improvements due to the recent development of large-scale vision-language pre-trained models. Traditional methods primarily focus on video representations or cross-modal alignment, while recent works shift toward enriching text expressiveness to better match the rich semantics in videos. However, these methods use only interactions between text and frames/video, and ignore rich interactions among the internal frames within a video, so the final expanded text cannot capture frame contextual information, leading to disparities between text and video. In response, we introduce Energy-Aware Fine-Grained Relationship Learning Network (EagleNet) to generate accurate and context-aware enriched text embeddings. Specifically, the proposed Fine-Grained Relationship Learning mechanism (FRL) first constructs a text-frame graph by the generated text candidates and frames, then learns relationships among texts and frames, which are finally used to aggregate text candidates into an enriched text embedding that incorporates frame contextual information. To further improve fine-grained relationship learning in FRL, we design Energy-Aware Matching (EAM) to model the energy of text-frame interactions and thus accurately capture the distribution of real text-video pairs. Moreover, for more effective cross-modal alignment and stable training, we replace the conventional softmax-based contrastive loss with the sigmoid loss. Extensive experiments have demonstrated the superiority of EagleNet across MSRVTT, DiDeMo, MSVD, and VATEX. Codes are available at https://github.com/draym28/EagleNet.

关键词: text-video retrieval, vision-language pre-trained models, fine-grained relationship learning, energy-aware matching, cross-modal alignment, context-aware text embeddings, contrastive loss, MSRVTT

208. ❌ ViewSplat: View-Adaptive Dynamic Gaussian Splatting for Feed-Forward Synthesis

作者: Moonyeon Jeong, Seunggi Min, Suhyeon Lee, Hongje Seong 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25265v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ViewSplat专注于计算机视觉中的3D场景重建和新视角合成，使用3D高斯泼溅技术，与所有评分关键词（均涉及大模型、深度学习技术原理或AI在科学领域的应用）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

论文提出ViewSplat，一种用于从未定位图像进行新视角合成的视图自适应动态3D高斯泼溅网络，通过从静态基元回归转向视图自适应动态泼溅，实现了最先进的保真度，同时保持快速推理和实时渲染。

摘要翻译

我们提出ViewSplat，一种基于未标定图像进行新视角合成的视角自适应三维高斯溅射网络。尽管近期前馈式三维高斯溅射方法通过绕过逐场景优化显著加速了三维场景重建，但其在保真度上仍存在根本性差距。我们认为这一瓶颈源于单步前馈网络在回归满足所有视角需求的静态高斯基元方面能力有限。为突破此限制，我们将范式从静态基元回归转向视角自适应动态溅射。我们的流程不再采用刚性高斯表示，而是学习一种视角自适应的潜在表示。具体而言，ViewSplat首先预测基础高斯基元及动态多层感知机（MLPs）的权重。在渲染过程中，这些MLPs以目标视角坐标为输入，为每个高斯属性（即三维位置、尺度、旋转、不透明度与颜色）预测视角依赖的残差更新。这一被我们称为视角自适应动态溅射的机制，使每个基元能够修正初始估计误差，从而有效捕捉高保真外观。大量实验表明，ViewSplat在保持快速推理（17 FPS）与实时渲染（154 FPS）能力的同时，实现了业界领先的保真度。

摘要 (Abstract)

We present ViewSplat, a view-adaptive 3D Gaussian splatting network for novel view synthesis from unposed images. While recent feed-forward 3D Gaussian splatting has significantly accelerated 3D scene reconstruction by bypassing per-scene optimization, a fundamental fidelity gap remains. We attribute this bottleneck to the limited capacity of single-step feed-forward networks to regress static Gaussian primitives that satisfy all viewpoints. To address this limitation, we shift the paradigm from static primitive regression to view-adaptive dynamic splatting. Instead of a rigid Gaussian representation, our pipeline learns a view-adaptable latent representation. Specifically, ViewSplat initially predicts base Gaussian primitives alongside the weights of dynamic MLPs. During rendering, these MLPs take target view coordinates as input and predict view-dependent residual updates for each Gaussian attribute (i.e., 3D position, scale, rotation, opacity, and color). This mechanism, which we term view-adaptive dynamic splatting, allows each primitive to rectify initial estimation errors, effectively capturing high-fidelity appearances. Extensive experiments demonstrate that ViewSplat achieves state-of-the-art fidelity while maintaining fast inference (17 FPS) and real-time rendering (154 FPS).

关键词: ViewSplat, view-adaptive, 3D Gaussian splatting, novel view synthesis, feed-forward network, dynamic MLPs, real-time rendering, scene reconstruction

209. ❌ Towards Practical Lossless Neural Compression for LiDAR Point Clouds

作者: Pengpeng Yu, Haoran Li, Runqing Jiang, Dingquan Li, Jing Wang, Liang Lin, Yulan Guo 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25260v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于LiDAR点云的无损神经压缩技术，提出了一种紧凑表示方法和轻量级模块（几何重密度模块和跨尺度特征传播模块），并引入了整数推理管道。所有评分关键词均与大语言模型、深度学习技术原理或AI科学应用相关，而本文研究的是计算机视觉中的点云压缩，属于传统深度学习应用，未涉及大模型技术、AI for Science或评分关键词中的任何具体技术。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对LiDAR点云压缩中高精度几何细节极度稀疏导致上下文建模效率低的问题，提出了一种紧凑表示和轻量级模块框架，实现了实时速度下的竞争性压缩性能。

摘要翻译

激光雷达点云是众多应用的基础，然而高精度几何细节的极端稀疏性阻碍了高效的上下文建模，从而限制了现有方法的压缩速度与性能。为解决这一挑战，我们提出了一种紧凑表示方法以实现高效的无损预测编码。我们的框架包含两个轻量级模块。首先，几何重致密化模块迭代地对已编码的稀疏几何进行致密化，在稠密尺度上提取特征，随后将特征稀疏化以用于预测编码。该模块避免了在高度稀疏细节上进行昂贵计算，同时保持了轻量级的预测头。其次，跨尺度特征传播模块利用来自多分辨率层次的占用线索来引导分层特征传播，实现跨尺度信息共享并减少冗余特征提取。此外，我们引入了纯整数推理流程，以实现跨平台的比特级精确一致性，这避免了现有神经压缩方法中观察到的熵编码崩溃问题，并进一步加速了编码过程。实验表明，该方法在实时速度下取得了具有竞争力的压缩性能。代码将在论文录用后公开。代码地址：https://github.com/pengpeng-yu/FastPCC。

摘要 (Abstract)

LiDAR point clouds are fundamental to various applications, yet the extreme sparsity of high-precision geometric details hinders efficient context modeling, thereby limiting the compression speed and performance of existing methods. To address this challenge, we propose a compact representation for efficient predictive lossless coding. Our framework comprises two lightweight modules. First, the Geometry Re-Densification Module iteratively densifies encoded sparse geometry, extracts features at a dense scale, and then sparsifies the features for predictive coding. This module avoids costly computation on highly sparse details while maintaining a lightweight prediction head. Second, the Cross-scale Feature Propagation Module leverages occupancy cues from multiple resolution levels to guide hierarchical feature propagation, enabling information sharing across scales and reducing redundant feature extraction. Additionally, we introduce an integer-only inference pipeline to enable bit-exact cross-platform consistency, which avoids the entropy-coding collapse observed in existing neural compression methods and further accelerates coding. Experiments demonstrate competitive compression performance at real-time speed. Code will be released upon acceptance. Code is available at https://github.com/pengpeng-yu/FastPCC.

关键词: LiDAR point clouds, lossless compression, neural compression, geometry re-densification, cross-scale feature propagation, integer-only inference, real-time speed, predictive coding

210. ❌ Hyperspectral Trajectory Image for Multi-Month Trajectory Anomaly Detection

作者: Md Awsafur Rahman, Chandrakanth Gudavalli, Hardik Prajapati, B. S. Manjunath 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25255v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 该论文专注于轨迹异常检测，提出了一种将轨迹表示为超光谱轨迹图像（HTI）并使用循环因子化Transformer（CFT）进行建模的计算机视觉方法。论文的核心是轨迹数据处理、图像表示和高效Transformer架构设计，并未涉及任何大语言模型（LLM）、深度学习技术原理创新（如MoE、缩放定律、训练/对齐方法、推理优化、智能体等）或AI在科学领域的特定应用（如生物信息学）。所有评分关键词均与大模型或指定的科学AI子领域直接相关，而本文研究主题（时空轨迹分析）与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文解决了在密集GPS轨迹上进行多月异常检测的难题，提出了一种将轨迹表示为超光谱图像并使用循环因子化Transformer进行建模的新方法TITAnD，实现了在稀疏和密集基准上最佳的检测性能，同时显著提升了计算效率。

摘要翻译

轨迹异常检测是欺诈检测到城市移动分析等应用的基础。密集GPS方法保留了细粒度证据（如异常速度和短时事件），但其二次方计算成本使得多月分析难以实现；因此，现有方法均无法检测多月密集GPS轨迹中的异常。该领域转而依赖可扩展的稀疏停留点方法，但这些方法丢弃了上述证据，导致不同数据模态需采用独立架构且阻碍了知识迁移。我们认为这一瓶颈并非必然：人类轨迹无论密集或稀疏，均共享沿日内与日间轴线的天然二维循环结构。为此，我们提出TITAnD（用于异常检测的轨迹图像Transformer），通过将轨迹表示为高光谱轨迹图像（Hyperspectral Trajectory Image, HTI）——一种以“日期×日内时间”为网格、其通道编码来自任意模态的空间、语义、时间和运动学信息的表示形式，从而将轨迹异常检测重构为视觉问题，在单一表征下统一两种模态。在此框架下，个体级异常检测简化为图像分类，而时序定位则转化为语义分割。为建模该表征，我们引入循环因子化Transformer（Cyclic Factorized Transformer, CFT），其沿两个时间轴分解注意力机制，编码人类日常行为的循环归纳偏置，同时将注意力计算成本降低数个数量级，首次实现了密集多月轨迹的异常检测。实验表明，TITAnD在稀疏与密集基准测试中均取得最优的AUC-PR，超越UNet等视觉模型，同时比同等内存占用的Transformer快11-75倍，证明视觉重构与结构感知建模具有协同必要性。代码即将公开。

摘要 (Abstract)

Trajectory anomaly detection underpins applications from fraud detection to urban mobility analysis. Dense GPS methods preserve fine-grained evidence such as abnormal speeds and short-duration events, but their quadratic cost makes multi-month analysis intractable; consequently, no existing approach detects anomalies over multi-month dense GPS trajectories. The field instead relies on scalable sparse stay-point methods that discard this evidence, forcing separate architectures for each regime and preventing knowledge transfer. We argue this bottleneck is unnecessary: human trajectories, dense or sparse, share a natural two-dimensional cyclic structure along within-day and across-day axes. We therefore propose TITAnD (Trajectory Image Transformer for Anomaly Detection), which reformulates trajectory anomaly detection as a vision problem by representing trajectories as a Hyperspectral Trajectory Image (HTI): a day x time-of-day grid whose channels encode spatial, semantic, temporal, and kinematic information from either modality, unifying both under a single representation. Under this formulation, agent-level detection reduces to image classification and temporal localization to semantic segmentation. To model this representation, we introduce the Cyclic Factorized Transformer (CFT), which factorizes attention along the two temporal axes, encoding the cyclic inductive bias of human routines, while reducing attention cost by orders of magnitude and enabling dense multi-month anomaly detection for the first time. Empirically, TITAnD achieves the best AUC-PR across sparse and dense benchmarks, surpassing vision models like UNet while being 11-75x faster than the Transformer with comparable memory, demonstrating that vision reformulation and structure-aware modeling are jointly essential. Code will be made public soon.

关键词: Trajectory Anomaly Detection, Hyperspectral Trajectory Image (HTI), Cyclic Factorized Transformer (CFT), Multi-month Dense GPS, Vision Reformulation, Temporal Localization, Attention Cost Reduction, Human Routine Modeling

211. ❌ Efficient Preemptive Robustification with Image Sharpening

作者: Jiaming Liang, Chi-Man Pun 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25244v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究计算机视觉领域的图像对抗鲁棒性问题，提出了一种基于图像锐化的预攻击防御方法。所有评分关键词均与大语言模型、深度学习技术原理创新或科学领域AI应用相关，而本文专注于图像处理中的对抗防御，未涉及任何大模型技术、训练方法、推理优化、代理系统或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于图像锐化的高效预攻击鲁棒化方法，通过增强图像纹理强度来提升深度神经网络对对抗样本的鲁棒性，在迁移场景中实现了显著效果且计算成本低。

摘要翻译

尽管深度神经网络取得了巨大成功，其依赖高维、非鲁棒性表征的特性使其对微小扰动极为脆弱，即使在迁移场景中也不例外。为解决这一问题，训练时防御（如对抗性训练与鲁棒架构设计）与攻击后防御（如输入净化与对抗性检测）均已得到广泛研究。近期，少量研究初步探索了一种攻击前防御范式，称为先发鲁棒化，即在攻击前对良性样本引入细微修改以主动抵抗对抗性扰动。然而，由于存在若干局限，其实用性仍存疑，包括：（1）依赖训练良好的分类器作为代理以提供鲁棒性先验；（2）因迭代优化或训练生成器进行鲁棒化而产生巨大计算开销；（3）基于优化或生成的鲁棒化过程可解释性有限。受近期研究揭示纹理强度与良性样本鲁棒性正相关的启发，我们发现仅通过图像锐化即可高效实现图像鲁棒化。据我们所知，这是首个无需代理、无需优化、无需生成器且具备人类可解释性的鲁棒化方法。大量实验表明，锐化能以较低计算成本显著提升鲁棒性，尤其在迁移场景中表现突出。

摘要 (Abstract)

Despite their great success, deep neural networks rely on high-dimensional, non-robust representations, making them vulnerable to imperceptible perturbations, even in transfer scenarios. To address this, both training-time defenses (e.g., adversarial training and robust architecture design) and post-attack defenses (e.g., input purification and adversarial detection) have been extensively studied. Recently, a limited body of work has preliminarily explored a pre-attack defense paradigm, termed preemptive robustification, which introduces subtle modifications to benign samples prior to attack to proactively resist adversarial perturbations. Unfortunately, their practical applicability remains questionable due to several limitations, including (1) reliance on well-trained classifiers as surrogates to provide robustness priors, (2) substantial computational overhead arising from iterative optimization or trained generators for robustification, and (3) limited interpretability of the optimization- or generation-based robustification processes. Inspired by recent studies revealing a positive correlation between texture intensity and the robustness of benign samples, we show that image sharpening alone can efficiently robustify images. To the best of our knowledge, this is the first surrogate-free, optimization-free, generator-free, and human-interpretable robustification approach. Extensive experiments demonstrate that sharpening yields remarkable robustness gains with low computational cost, especially in transfer scenarios.

关键词: preemptive robustification, image sharpening, adversarial robustness, deep neural networks, transfer scenarios, computational efficiency, human-interpretable

212. ❌ Semantic-Aware Prefix Learning for Token-Efficient Image Generation

作者: Qingfeng Li, Haoxian Zhang, Xu He, Songlin Tang, Zhixue Fang, Xiaoqiang Liu, Pengfei Wan Guoqi Li 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25249v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的图像生成任务，提出了一种语义感知的视觉分词器（SMAP）和混合因果自回归-扩散生成器（CARD）。虽然论文涉及深度学习技术，但所有关键词均针对大语言模型（LLM）及其相关技术（如微调、对齐、推理、代理等），而本文研究的是视觉分词器和图像生成模型，与文本大模型技术无直接关联。关键词中唯一的跨领域应用关键词“AI for Science”要求生物信息学或化学信息学应用，而本文的ImageNet图像生成不属于这些科学子领域。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对现有视觉分词器语义对齐不足的问题，提出了一种语义感知前缀分词器（SMAP）和混合因果自回归-扩散生成器（CARD），在ImageNet上实现了更好的重建质量和下游生成性能。

摘要翻译

视觉分词器通过桥接高维图像与可处理的生成建模，在潜在图像生成中发挥着核心作用。然而，现有的大多数分词器仍以重建为主导目标进行训练，这通常产生的潜在表征仅与高层语义存在弱关联。近期方法虽提升了语义对齐能力，但通常仅将语义信号视为辅助正则化手段，而未使其在表征学习中发挥功能性作用。我们提出SMAP，一种语义感知前缀分词器，它将类别级语义条件注入基于查询的一维分词框架。为使语义在训练过程中不可或缺，SMAP引入了尾部令牌丢弃策略，该策略迫使语义条件与早期潜在前缀在逐渐缩减的令牌预算下承担递增的责任。为验证所得潜在空间适用于生成任务而非仅用于重建，我们进一步提出CARD，一种混合因果自回归-扩散生成器。在ImageNet上的大量实验表明，SMAP在离散与连续分词设置下均能持续提升重建质量，且其基于语义的潜在空间在紧凑令牌预算下展现出强大的下游生成性能。

摘要 (Abstract)

Visual tokenizers play a central role in latent image generation by bridging high-dimensional images and tractable generative modeling. However, most existing tokenizers are still trained with reconstruction-dominated objectives, which often yield latent representations that are only weakly grounded in high-level semantics. Recent approaches improve semantic alignment, but typically treat semantic signals as auxiliary regularization rather than making them functionally necessary for representation learning. We propose SMAP, a SeMantic-Aware Prefix tokenizer that injects class-level semantic conditions into a query-based 1D tokenization framework. To make semantics indispensable during training, SMAP introduces a tail token dropping strategy, which forces semantic conditions and early latent prefixes to bear increasing responsibility under progressively reduced token budgets. To verify that the resulting latent space is useful for generation rather than reconstruction alone, we further introduce CARD, a hybrid Causal AutoRegressive–Diffusion generator. Extensive experiments on ImageNet show that SMAP consistently improves reconstruction quality across discrete and continuous tokenization settings, and that its semantically grounded latent space yields strong downstream generation performance under compact token budgets.

关键词: visual tokenizer, semantic-aware prefix, image generation, latent representation, causal autoregressive-diffusion, token-efficient, semantic alignment, ImageNet

213. ❌ A Unified Spatial Alignment Framework for Highly Transferable Transformation-Based Attacks on Spatially Structured Tasks

作者: Jiaming Liang, Chi-Man Pun 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25230v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是对抗性攻击中的空间对齐问题，专注于计算机视觉领域的语义分割和物体检测任务。论文内容涉及对抗性攻击、空间变换、标签对齐等计算机视觉安全主题，与所有评分关键词（均围绕大模型、深度学习技术原理、AI for Science等）完全无关。论文未提及任何大模型、语言模型、训练技术、推理优化、AI代理或科学AI应用相关内容。

!!! tip deepseek-chat TL;DR

该论文针对结构化任务（如语义分割和物体检测）中现有变换式对抗攻击因空间未对齐而效果不佳的问题，提出了一个统一的空间对齐框架（SAF），通过同步变换输入和标签来显著提升攻击的迁移性，实验证明该方法能有效降低模型在多个数据集上的性能指标。

摘要翻译

基于变换的对抗攻击（Transformation-based Adversarial Attacks, TAAs）在欺骗分类模型时展现出强大的可迁移性。然而，现有TAAs在应用于语义分割和目标检测等结构化任务时，往往表现不佳甚至失效。值得关注的是，近期研究将变换划分为非空间变换与空间变换，这为我们应对该挑战提供了启示。我们发现，对于非结构化任务，其标签在空间上是非结构化的，因此TAAs在应用空间变换时无需调整标签。相反，对于结构化任务，标签具有空间结构，若未能使标签与输入同步变换，则会导致空间错位并产生错误梯度。为解决这些问题，我们提出了一种新颖的统一空间对齐框架（Spatial Alignment Framework, SAF），用于在空间结构化任务上实现高可迁移性的TAAs。该框架通过提出的空间对齐（Spatial Alignment, SA）算法，使TAAs能够利用空间变换同步处理输入与标签。大量实验证明了我们的SAF在结构化任务的TAAs中具有关键作用。具体而言，在非定向攻击中，我们的SAF将Cityscapes数据集的平均mIoU从24.50降低至11.34，将Kvasir-SEG数据集的平均mIoU从49.91降低至31.80，同时将COCO数据集的平均mAP从17.89降低至5.25。

摘要 (Abstract)

Transformation-based adversarial attacks (TAAs) demonstrate strong transferability when deceiving classification models. However, existing TAAs often perform unsatisfactorily or even fail when applied to structured tasks such as semantic segmentation and object detection. Encouragingly, recent studies that categorize transformations into non-spatial and spatial transformations inspire us to address this challenge. We find that for non-structured tasks, labels are spatially non-structured, and thus TAAs are not required to adjust labels when applying spatial transformations. In contrast, for structured tasks, labels are spatially structured, and failing to transform labels synchronously with inputs can cause spatial misalignment and yield erroneous gradients. To address these issues, we propose a novel unified Spatial Alignment Framework (SAF) for highly transferable TAAs on spatially structured tasks, where the TAAs spatially transform labels synchronously with the input using the proposed Spatial Alignment (SA) algorithm. Extensive experiments demonstrate the crucial role of our SAF for TAAs on structured tasks. Specifically, in non-targeted attacks, our SAF degrades the average mIoU on Cityscapes from 24.50 to 11.34, and on Kvasir-SEG from 49.91 to 31.80, while reducing the average mAP of COCO from 17.89 to 5.25.

关键词: adversarial attacks, spatial alignment, structured tasks, transformation-based attacks, semantic segmentation, object detection, transferability, spatial transformations

214. ❌ An Image Dataset of Common Skin Diseases of Bangladesh and Benchmarking Performance with Machine Learning Models

作者: Sazzad Hossain, Saiful Islam, Muhammad Ibrahim, Md. Rasel Ahmed, Md Shuayb, Ahmedul Kabir 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25229v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文主要关注使用传统机器学习和深度学习模型进行皮肤病图像分类，属于AI在医疗领域的应用。论文未涉及大语言模型（LLM）、模型架构创新（如MoE）、训练技术（如预训练、微调、对齐）、推理优化、智能体系统等关键词。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物医学（皮肤病学）领域的应用，但并非核心创新，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均给0分。

!!! tip deepseek-chat TL;DR

该研究创建了一个包含孟加拉国五种常见皮肤病图像的公开数据集，并应用多种机器学习和深度学习模型进行疾病分类，为自动化皮肤病诊断提供了基准性能。

摘要翻译

皮肤病是全球范围内一项重大的公共卫生问题，在没有皮肤科专业知识支持的情况下，其检测往往面临挑战。在孟加拉国这样人口稠密的国家，合格的皮肤科专家和诊断仪器数量不足以满足需求。由于缺乏适当的皮肤病检测与治疗，可能导致包括死亡在内的严重健康后果。皮肤病的常见特征是改变皮肤的颜色、质地和纹理，在当今人工智能和机器学习的时代，我们能够利用图像处理和计算机视觉技术来检测皮肤病。为应对这一挑战，我们开发了一个公开可用的数据集，专注于利用机器学习技术检测常见皮肤病。我们聚焦于孟加拉国五种流行皮肤病：接触性皮炎（Contact Dermatitis）、白癜风（Vitiligo）、湿疹（Eczema）、疥疮（Scabies）和体癣（Tinea Ringworm）。该数据集包含1612张图像（其中250张为独立采集图像，其余为增强图像），均直接采集自孟加拉国法里德布尔医学院门诊部的患者。数据分别包含皮炎302张、湿疹381张、疥疮301张、体癣316张和白癜风312张图像。尽管数据是在特定区域收集的，但所选疾病在许多国家尤其是南亚地区普遍存在，这使得该数据集对于基于机器学习的皮肤病学全球应用具有潜在价值。我们还在该数据集上应用了多种机器学习和深度学习模型，并报告了分类性能。我们期望这项研究能够引起从事自动化疾病诊断领域的机器学习与深度学习研究人员及实践者的关注。

摘要 (Abstract)

Skin diseases are a major public health concern worldwide, and their detection is often challenging without access to dermatological expertise. In countries like Bangladesh, which is highly populated, the number of qualified skin specialists and diagnostic instruments is insufficient to meet the demand. Due to the lack of proper detection and treatment of skin diseases, that may lead to severe health consequences including death. Common properties of skin diseases are, changing the color, texture, and pattern of skin and in this era of artificial intelligence and machine learning, we are able to detect skin diseases by using image processing and computer vision techniques. In response to this challenge, we develop a publicly available dataset focused on common skin disease detection using machine learning techniques. We focus on five prevalent skin diseases in Bangladesh: Contact Dermatitis, Vitiligo, Eczema, Scabies, and Tinea Ringworm. The dataset consists of 1612 images (of which, 250 are distinct while others are augmented), collected directly from patients at the outpatient department of Faridpur Medical College, Faridpur, Bangladesh. The data comprises of 302, 381, 301, 316, and 312 images of Dermatitis, Eczema, Scabies, Tinea Ringworm, and Vitiligo, respectively. Although the data are collected regionally, the selected diseases are common across many countries especially in South Asia, making the dataset potentially valuable for global applications in machine learning-based dermatology. We also apply several machine learning and deep learning models on the dataset and report classification performance. We expect that this research would garner attention from machine learning and deep learning researchers and practitioners working in the field of automated disease diagnosis.

关键词: skin disease detection, image dataset, machine learning, deep learning, dermatology, Bangladesh, classification, computer vision

215. ❌ SDD-YOLO: A Small-Target Detection Framework for Ground-to-Air Anti-UAV Surveillance with Edge-Efficient Deployment

作者: Pengyu Chen, Haotian Sa, Yiwei Hu, Yuhan Cheng, Junbo Wang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25218v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的小目标检测任务，提出了一种基于YOLO架构的改进框架SDD-YOLO，用于地面到空中的无人机检测。论文内容涉及目标检测算法优化、高分辨率检测头设计、训练策略改进（MuSGD、ProgLoss、STAL）和边缘部署效率评估。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文研究的是传统的计算机视觉目标检测问题，未涉及大模型、语言模型、MoE、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、推理优化、思维链、智能体、量化、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等大模型相关技术，也未涉及生物信息学或化学信息学等AI for Science应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对地面到空中无人机小目标检测的挑战，提出了SDD-YOLO框架，通过引入P2高分辨率检测头和优化训练策略，在自建数据集DroneSOD-30K上实现了86.0%的mAP@0.5，并在边缘设备上展示了高效的推理性能。

摘要翻译

从地对空（G2A）视角检测小型无人机（UAV）面临重大挑战，包括极低的像素占比、复杂的空中背景以及严格的实时性要求。现有的基于YOLO的检测器主要针对通用目标检测进行优化，对于亚像素目标往往缺乏足够的特征分辨率，同时在部署时引入了复杂性。本文提出SDD-YOLO，一个专为G2A反无人机监控设计的小目标检测框架。为捕捉对微目标至关重要的细粒度空间细节，SDD-YOLO引入了一个在4倍下采样下运行的P2高分辨率检测头。此外，我们整合了YOLO26的最新架构进展，包括采用无需DFL（Distribution Focal Loss）、无需NMS（非极大值抑制）的架构以简化推理流程，以及结合了ProgLoss（渐进损失）和STAL（自训练锚点学习）的MuSGD混合训练策略，该策略显著缓解了稀疏小目标信号上的梯度振荡。为支持评估，我们构建了DroneSOD-30K，一个大规模G2A数据集，包含约30,000张标注图像，涵盖多种气象条件。实验表明，SDD-YOLO-n在DroneSOD-30K上实现了86.0%的mAP@0.5，比YOLOv5n基线高出7.8个百分点。广泛的推理分析显示，我们的模型在NVIDIA RTX 5090上达到226 FPS，在Intel Xeon CPU上达到35 FPS，展现了未来边缘部署的卓越效率。

摘要 (Abstract)

Detecting small unmanned aerial vehicles (UAVs) from a ground-to-air (G2A) perspective presents significant challenges, including extremely low pixel occupancy, cluttered aerial backgrounds, and strict real-time constraints. Existing YOLO-based detectors are primarily optimized for general object detection and often lack adequate feature resolution for sub-pixel targets, while introducing complexities during deployment. In this paper, we propose SDD-YOLO, a small-target detection framework tailored for G2A anti-UAV surveillance. To capture fine-grained spatial details critical for micro-targets, SDD-YOLO introduces a P2 high-resolution detection head operating at 4 times downsampling. Furthermore, we integrate the recent architectural advancements from YOLO26, including a DFL-free, NMS-free architecture for streamlined inference, and the MuSGD hybrid training strategy with ProgLoss and STAL, which substantially mitigates gradient oscillation on sparse small-target signals. To support our evaluation, we construct DroneSOD-30K, a large-scale G2A dataset comprising approximately 30,000 annotated images covering diverse meteorological conditions. Experiments demonstrate that SDD-YOLO-n achieves a mAP@0.5 of 86.0% on DroneSOD-30K, surpassing the YOLOv5n baseline by 7.8 percentage points. Extensive inference analysis shows our model attains 226 FPS on an NVIDIA RTX 5090 and 35 FPS on an Intel Xeon CPU, demonstrating exceptional efficiency for future edge deployment.

关键词: small-target detection, ground-to-air anti-UAV surveillance, SDD-YOLO, P2 high-resolution detection head, edge-efficient deployment, DroneSOD-30K dataset, YOLO-based detectors, real-time constraints

216. ❌ TacSIm: A Dataset and Benchmark for Football Tactical Style Imitation

作者: Peng Wen, Yuting Wang, Qiurui Wang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25199v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《TacSIm: A Dataset and Benchmark for Football Tactical Style Imitation》专注于足球战术风格模仿的数据集和基准测试，研究内容涉及计算机视觉、行为模仿、体育分析等领域。论文未提及任何大模型、深度学习技术原理或AI for Science的具体应用，所有关键词均与论文主题无关，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了TacSIm数据集和基准测试，用于足球战术风格模仿，通过将英超比赛转播画面中的球员位置和动作投影到标准球场坐标系，并评估空间占用相似性和移动向量相似性，为战术协调提供了定量和视觉评估方法。

摘要翻译

当前足球模仿研究主要致力于优化基于奖励的目标，如进球数或胜率指标，较少关注准确复现真实世界球队战术行为。我们推出TacSIm（战术风格模仿数据集），这是一个用于足球战术风格模仿的大规模数据集与基准测试平台。TacSIm在英超比赛的单一直播镜头下，模仿给定直播画面中某一方全部11名球员的动作。在进攻或防守直播画面中，TacSIm将双方共22名球员的起始位置与动作映射至标准球场坐标系。该数据集提供明确的风格模仿任务与评估框架，通过定义时段内的空间占据相似度与移动向量相似度来衡量战术风格模仿效果，支持对单支球队空间与时间相似性的评估。我们在统一虚拟环境中运行多种基线方法以生成全队行为，实现对战术协调性的定量与可视化评估。通过采用从直播到仿真的统一数据与度量标准，TacSIm为足球领域风格对齐的战术模仿任务的衡量与建模建立了严谨的基准。

摘要 (Abstract)

Current football imitation research primarily aims to opti mize reward-based objectives, such as goals scored or win rate proxies, paying less attention to accurately replicat ing real-world team tactical behaviors. We introduce Tac SIm, a large-scale dataset and benchmark for Tactical Style Imitation in football. TacSIm imitates the acitons of all 11 players in one team in the given broadcast footage of Pre mier League matches under a single broadcast view. Under a offensive or defensive broadcast footage, TacSIm projects the beginning positions and actions of all 22 players from both sides onto a standard pitch coordinate system. Tac SIm offers an explicit style imitation task and evaluation protocols. Tactics style imitation is measured by using spatial occupancy similarity and movement vector similarity in defined time, supporting the evaluation of spatial and tem poral similarities for one team. We run multiple baseline methods in a unified virtual environment to generate full team behaviors, enabling both quantitative and visual as sessment of tactical coordination. By using unified data and metrics from broadcast to simulation, TacSIm estab lishes a rigorous benchmark for measuring and modeling style-aligned tactical imitation task in football.

关键词: football tactical imitation, dataset, benchmark, spatial occupancy similarity, movement vector similarity, tactical coordination, Premier League, virtual environment

217. ❌ AnyID: Ultra-Fidelity Universal Identity-Preserving Video Generation from Any Visual References

作者: Jiahao Wang, Hualian Sheng, Sijia Cai, Yuxiao Yang, Weizhan Zhang, Caixia Yan, Bing Deng, Jieping Ye 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25188v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频生成技术，特别是身份保持的视频生成框架AnyID。虽然论文涉及深度学习和大模型在生成任务中的应用，但具体内容与绝大多数关键词无关。唯一相关的是’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’，因为论文明确提到使用强化学习进行最终微调，并利用基于人类评估构建的偏好数据集，这直接对应RLHF（Reinforcement Learning from Human Feedback）技术。其他关键词如大模型、MoE、量化、推理加速、科学AI等均未涉及。

!!! tip deepseek-chat TL;DR

该论文解决了现有身份保持视频生成方法通常只针对单一身份参考的限制问题，提出了AnyID框架，通过可扩展的全参考架构和主参考生成范式，实现了超高保真度的身份保持和精确的属性级可控性。

摘要翻译

身份保持视频生成为创意表达提供了强大工具，使用户能够定制包含其喜爱角色的视频。然而，主流方法通常针对单一身份参考进行设计和优化。这一基本假设限制了创作灵活性，因其未能充分适应多样化的现实世界输入格式。依赖单一来源也构成了一个不适定场景，导致固有的模糊性设置，使模型难以在新颖情境中忠实地复现身份。为解决这些问题，我们提出了AnyID，一个具备超高保真度的身份保持视频生成框架，其核心贡献包含两点。首先，我们引入了一种可扩展的全参考架构，能够将异构身份输入（如面部图像、肖像和视频）有效统一为连贯的表征。其次，我们提出了一种主参考生成范式，该范式指定一个参考作为规范锚点，并采用新颖的差分提示来实现精确的属性级可控性。我们在一个大规模精心策划的数据集上进行训练以确保鲁棒性和高保真度，随后利用强化学习进行最终微调阶段。此过程利用了基于人类评估构建的偏好数据集，其中标注者根据两个关键标准对视频进行成对比较：身份保真度与提示可控性。大量评估验证了AnyID在不同任务设置下均实现了超高身份保真度以及卓越的属性级可控性。

摘要 (Abstract)

Identity-preserving video generation offers powerful tools for creative expression, allowing users to customize videos featuring their beloved characters. However, prevailing methods are typically designed and optimized for a single identity reference. This underlying assumption restricts creative flexibility by inadequately accommodating diverse real-world input formats. Relying on a single source also constitutes an ill-posed scenario, causing an inherently ambiguous setting that makes it difficult for the model to faithfully reproduce an identity across novel contexts. To address these issues, we present AnyID, an ultra-fidelity identity-preservation video generation framework that features two core contributions. First, we introduce a scalable omni-referenced architecture that effectively unifies heterogeneous identity inputs (e.g., faces, portraits, and videos) into a cohesive representation. Second, we propose a primary-referenced generation paradigm, which designates one reference as a canonical anchor and uses a novel differential prompt to enable precise, attribute-level controllability. We conduct training on a large-scale, meticulously curated dataset to ensure robustness and high fidelity, and then perform a final fine-tuning stage using reinforcement learning. This process leverages a preference dataset constructed from human evaluations, where annotators performed pairwise comparisons of videos based on two key criteria: identity fidelity and prompt controllability. Extensive evaluations validate that AnyID achieves ultra-high identity fidelity as well as superior attribute-level controllability across different task settings.

关键词: identity-preserving video generation, AnyID, omni-referenced architecture, primary-referenced generation, differential prompt, reinforcement learning fine-tuning, human preference dataset, attribute-level controllability

218. ❌ CardioDiT: Latent Diffusion Transformers for 4D Cardiac MRI Synthesis

作者: Marvin Seyfarth, Sarah Kaye Müller, Arman Ghanaat, Isabelle Ayx, Fabian Fastenrath, Philipp Wild, Alexander Hertel, Theano Papavassiliu, Salman Ul Hassan Dar, Sandy Engelhardt 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25194v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文CardioDiT专注于4D心脏MRI合成的医学图像生成，使用潜在扩散模型和扩散变换器，属于AI在生物医学成像领域的应用。所有关键词中，只有"AI for Science OR Bioinformatics OR Cheminformatics"高度相关（10分），因为论文直接应用AI于医学成像（心脏MRI），属于生物信息学/科学AI范畴。其他关键词均涉及大语言模型（LLMs）及其相关技术（如微调、对齐、推理、代理等），而本文研究的是扩散模型在图像生成中的应用，未涉及语言模型或相关技术，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过统一的4D潜在扩散变换器模型（CardioDiT）合成时空一致的心脏MRI图像，避免了传统方法中时空分解导致的生理不一致性问题，并在实验中展示了改进的切片间一致性和逼真的心脏功能分布。

摘要翻译

潜在扩散模型（LDMs）近期在三维医学图像合成中展现出优异性能。然而，如电影心脏磁共振成像（cine CMR）这类模态，代表了整个心动周期中时间同步的三维容积，其增加了一个额外的维度，而大多数生成方法并未直接建模该维度。这些方法通常将空间与时间分解处理，或通过解剖掩模等辅助机制强制实现时间一致性。此类策略引入了结构性偏差，可能限制全局上下文整合，并导致细微的时空不连续性或生理学上不一致的心脏动力学。本研究探讨了统一的四维生成模型是否能在无需架构分解的情况下学习连续的心脏动力学。我们提出了CardioDiT——一个基于扩散变换器的完整四维潜在扩散框架，用于短轴电影心脏磁共振合成。该框架通过时空VQ-VAE将二维+时间切片编码为紧凑的潜在表示，随后由扩散变换器将其作为完整的三维+时间容积进行联合建模，在整个生成过程中耦合空间与时间维度。我们在公开心脏磁共振数据集及更大规模的私有队列上评估CardioDiT，并将其与时空耦合程度逐步增强的基线方法进行比较。实验结果表明，该方法提升了切片间一致性、时间连贯的运动模式以及真实的心脏功能分布，这证明采用扩散变换器进行显式四维建模为时空心脏图像合成提供了理论严谨的基础。基于公开数据训练的代码与模型已发布于https://github.com/Cardio-AI/cardiodit。

摘要 (Abstract)

Latent diffusion models (LDMs) have recently achieved strong performance in 3D medical image synthesis. However, modalities like cine cardiac MRI (CMR), representing a temporally synchronized 3D volume across the cardiac cycle, add an additional dimension that most generative approaches do not model directly. Instead, they factorize space and time or enforce temporal consistency through auxiliary mechanisms such as anatomical masks. Such strategies introduce structural biases that may limit global context integration and lead to subtle spatiotemporal discontinuities or physiologically inconsistent cardiac dynamics. We investigate whether a unified 4D generative model can learn continuous cardiac dynamics without architectural factorization. We propose CardioDiT, a fully 4D latent diffusion framework for short-axis cine CMR synthesis based on diffusion transformers. A spatiotemporal VQ-VAE encodes 2D+t slices into compact latents, which a diffusion transformer then models jointly as complete 3D+t volumes, coupling space and time throughout the generative process. We evaluate CardioDiT on public CMR datasets and a larger private cohort, comparing it to baselines with progressively stronger spatiotemporal coupling. Results show improved inter-slice consistency, temporally coherent motion, and realistic cardiac function distributions, suggesting that explicit 4D modeling with a diffusion transformer provides a principled foundation for spatiotemporal cardiac image synthesis. Code and models trained on public data are available at https://github.com/Cardio-AI/cardiodit.

关键词: Cardiac MRI synthesis, Latent diffusion models, Diffusion transformers, 4D medical imaging, Spatiotemporal modeling, Cardiac dynamics, VQ-VAE, Generative AI

219. ❌ VolDiT: Controllable Volumetric Medical Image Synthesis with Diffusion Transformers

作者: Marvin Seyfarth, Salman Ul Hassan Dar, Yannik Frisch, Philipp Wild, Norbert Frey, Florian André, Sandy Engelhardt 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25181v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于医学图像合成的扩散模型技术，与绝大多数大语言模型（LLM）相关的关键词（如LLMs、MoE、SFT、RLHF、RAG、CoT、Agents等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文研究3D医学图像合成，属于AI在生物医学领域的应用，与’AI for Science’高度相关，评分为10分。其他关键词均评0分。

!!! tip deepseek-chat TL;DR

该论文提出了VolDiT，首个基于纯Transformer的3D扩散模型，用于可控的体医学图像合成，相比现有U-Net方法实现了更好的全局一致性和生成保真度。

摘要翻译

扩散模型已成为实现高保真医学图像合成的主流方法。然而，现有的大多数三维医学图像生成方法依赖于潜在扩散框架中的卷积U-Net主干网络。尽管这些架构有效，但它们引入了强烈的局部性偏置和有限的感受野，这可能制约模型的可扩展性、全局上下文整合能力以及灵活的条件控制。在本研究中，我们提出了VolDiT，这是首个完全基于Transformer的三维扩散Transformer模型，用于体医学图像合成。我们的方法通过体素块嵌入和直接在三维标记上进行全局自注意力操作，将扩散Transformer扩展至原生三维数据。为实现结构化控制，我们提出了一种时间步门控控制适配器，该适配器将分割掩码映射为可学习的控制标记，并在去噪过程中调节Transformer层。这种标记级条件机制能够在保留Transformer架构建模优势的同时，提供精确的空间引导。我们在高分辨率三维医学图像合成任务上评估了所提出的模型，并将其与基于U-Net的先进三维潜在扩散模型进行了比较。实验结果表明，我们的模型在全局一致性、生成保真度和可控性方面均表现更优。我们的研究结果表明，完全基于Transformer的扩散模型为体医学图像合成提供了一个灵活的基础框架。基于公开数据训练的代码和模型可在https://github.com/Cardio-AI/voldit获取。

摘要 (Abstract)

Diffusion models have become a leading approach for high-fidelity medical image synthesis. However, most existing methods for 3D medical image generation rely on convolutional U-Net backbones within latent diffusion frameworks. While effective, these architectures impose strong locality biases and limited receptive fields, which may constrain scalability, global context integration, and flexible conditioning. In this work, we introduce VolDiT, the first purely transformer-based 3D Diffusion Transformer for volumetric medical image synthesis. Our approach extends diffusion transformers to native 3D data through volumetric patch embeddings and global self-attention operating directly over 3D tokens. To enable structured control, we propose a timestep-gated control adapter that maps segmentation masks into learnable control tokens that modulate transformer layers during denoising. This token-level conditioning mechanism allows precise spatial guidance while preserving the modeling advantages of transformer architectures. We evaluate our model on high-resolution 3D medical image synthesis tasks and compare it to state-of-the-art 3D latent diffusion models based on U-Nets. Results demonstrate improved global coherence, superior generative fidelity, and enhanced controllability. Our findings suggest that fully transformerbased diffusion models provide a flexible foundation for volumetric medical image synthesis. The code and models trained on public data are available at https://github.com/Cardio-AI/voldit.

关键词: Volumetric Medical Image Synthesis, Diffusion Transformers, 3D Diffusion Model, Transformer-based, Controllable Synthesis, Segmentation Mask Conditioning, Global Self-attention, Medical AI

220. ❌ ET-SAM: Efficient Point Prompt Prediction in SAM for Unified Scene Text Detection and Layout Analysis

作者: Xike Zhang, Maoyuan Ye, Juhua Liu, Bo Du 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25168v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ET-SAM专注于计算机视觉领域的场景文本检测和布局分析，基于Segment Anything Model (SAM)进行改进。虽然SAM是视觉基础模型，但论文的研究内容与所有评分关键词（主要针对大语言模型LLMs及其相关技术）完全无关。论文未涉及任何LLM技术、训练方法、推理优化、对齐、代理系统或科学AI应用。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

论文ET-SAM针对基于SAM的场景文本检测和布局分析中推理延迟高和数据利用率低的问题，提出了一种高效的双解码器框架，实现了约3倍的推理加速并在多个数据集上取得竞争性或改进的性能。

摘要翻译

先前基于Segment Anything Model（SAM）的研究在统一场景文本检测与版面分析任务中已展现出良好性能。然而，典型方法依赖像素级文本分割来采样数千个前景点作为提示，这导致推理延迟较高且数据利用率有限。为解决上述问题，我们提出ET-SAM——一种基于SAM的高效双解码器框架，用于统一场景文本检测与版面分析。技术上，我们定制了一个轻量级点解码器，通过生成词语热力图来获取少量前景点，从而消除冗余点提示并加速推理。在摆脱对像素级分割依赖的基础上，我们进一步设计了联合训练策略，以利用现有具有异构文本级标注的数据。具体而言，将包含多层级、仅词语级及仅行级标注的数据集并行整合为统一训练集。针对这些数据集，我们在点解码器与分层掩码解码器中引入了三组可学习的任务提示，以缓解数据集间的差异。大量实验表明，相较于先前基于SAM的架构，ET-SAM在HierText数据集上实现了约3倍的推理加速且保持竞争力性能，并在Total-Text、CTW1500和ICDAR15数据集上平均提升了11.0%的F分数。

摘要 (Abstract)

Previous works based on Segment Anything Model (SAM) have achieved promising performance in unified scene text detection and layout analysis. However, the typical reliance on pixel-level text segmentation for sampling thousands of foreground points as prompts leads to unsatisfied inference latency and limited data utilization. To address above issues, we propose ET-SAM, an Efficient framework with two decoders for unified scene Text detection and layout analysis based on SAM. Technically, we customize a lightweight point decoder that produces word heatmaps for achieving a few foreground points, thereby eliminating excessive point prompts and accelerating inference. Without the dependence on pixel-level segmentation, we further design a joint training strategy to leverage existing data with heterogeneous text-level annotations. Specifically, the datasets with multi-level, word-level only, and line-level only annotations are combined in parallel as a unified training set. For these datasets, we introduce three corresponding sets of learnable task prompts in both the point decoder and hierarchical mask decoder to mitigate discrepancies across datasets.Extensive experiments demonstrate that, compared to the previous SAM-based architecture, ET-SAM achieves about 3$\times$ inference acceleration while obtaining competitive performance on HierText, and improves an average of 11.0% F-score on Total-Text, CTW1500, and ICDAR15.

关键词: ET-SAM, Segment Anything Model, scene text detection, layout analysis, point prompt prediction, inference acceleration, joint training strategy, hierarchical mask decoder

221. ❌ AG-EgoPose: Leveraging Action-Guided Motion and Kinematic Joint Encoding for Egocentric 3D Pose Estimation

作者: Md Mushfiqur Azam, John Quarles, Kevin Desai 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25175v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文AG-EgoPose专注于计算机视觉中的第一人称3D人体姿态估计，使用双流框架（空间流和时间流）结合ResNet和Transformer架构处理鱼眼相机输入。所有评分关键词均涉及大语言模型（LLM）及其相关技术（如MoE、SFT、RLHF、RAG、量化等）或AI在科学领域的应用（如生物信息学）。论文内容完全不涉及语言模型、模型训练技术、推理方法、代理系统或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为AG-EgoPose的双流框架，通过整合短程和长程运动上下文与细粒度空间线索，解决了第一人称视角下因严重透视畸变、有限身体可见性和复杂相机运动导致的3D人体姿态估计挑战，并在真实数据集上实现了最先进的性能。

摘要翻译

以自我为中心的三维人体姿态估计仍面临严峻挑战，这主要源于第一人称视角固有的严重透视畸变、有限的身体可见度以及复杂的相机运动。现有方法通常依赖单帧分析或有限的时序融合，未能有效利用自我中心视频中丰富的运动上下文信息。我们提出AG-EgoPose——一种新颖的双流框架，该框架将短程与长程运动上下文信息与细粒度空间线索相结合，以实现从鱼眼相机输入中进行鲁棒的姿态估计。我们的框架包含两个并行流：空间流采用权重共享的ResNet-18编码器-解码器生成二维关节热图及相应的关节特定空间特征标记；与此同时，时序流使用ResNet-50骨干网络提取视觉特征，再通过动作识别骨干网络处理以捕捉运动动态。这些互补的表征在带有可学习关节标记的Transformer解码器中进行融合与优化，从而在保持解剖学约束的同时，实现了空间与时序证据在关节层面的整合。在真实世界数据集上的实验表明，AG-EgoPose在定量与定性指标上均达到了最先进的性能。代码发布于：https://github.com/Mushfiq5647/AG-EgoPose。

摘要 (Abstract)

Egocentric 3D human pose estimation remains challenging due to severe perspective distortion, limited body visibility, and complex camera motion inherent in first-person viewpoints. Existing methods typically rely on single-frame analysis or limited temporal fusion, which fails to effectively leverage the rich motion context available in egocentric videos. We introduce AG-EgoPose, a novel dual-stream framework that integrates short- and long-range motion context with fine-grained spatial cues for robust pose estimation from fisheye camera input. Our framework features two parallel streams: A spatial stream uses a weight-sharing ResNet-18 encoder-decoder to generate 2D joint heatmaps and corresponding joint-specific spatial feature tokens. Simultaneously, a temporal stream uses a ResNet-50 backbone to extract visual features, which are then processed by an action recognition backbone to capture the motion dynamics. These complementary representations are fused and refined in a transformer decoder with learnable joint tokens, which allows for the joint-level integration of spatial and temporal evidence while maintaining anatomical constraints. Experiments on real-world datasets demonstrate that AG-EgoPose achieves state-of-the-art performance in both quantitative and qualitative metrics. Code is available at: https://github.com/Mushfiq5647/AG-EgoPose.

关键词: Egocentric 3D Pose Estimation, Dual-stream Framework, Action-Guided Motion, Kinematic Joint Encoding, Fisheye Camera, Transformer Decoder, State-of-the-art Performance

222. ❌ Towards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds

作者: Bin Yang, Mohamed Abdelsamad, Miao Zhang, Alexandru Paul Condurache 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25165v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于3D点云的自监督学习框架PointINS，旨在实现实例感知的表示学习以推进3D基础模型的发展。与关键词的相关性分析如下：1. 与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分），因为论文明确以’Foundation Models for 3D Scene Understanding’为目标，并讨论构建3D基础模型；2. 与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为3D场景理解可视为AI在科学（如计算机视觉、机器人学）领域的应用，但非生物信息学或化学信息学；3. 其他关键词（如MoE、SFT、RAG等）均未涉及，评分为0分。加权总分计算为(8.01.0)+(5.01.0)=13.0。

!!! tip deepseek-chat TL;DR

该论文针对3D点云表示中实例感知能力不足的问题，提出了一个自监督学习框架PointINS，通过几何感知学习和正则化策略，显著提升了室内实例分割和室外全景分割的性能。

摘要翻译

点云自监督学习的最新进展显著提升了无需人工标注的三维场景理解能力。现有方法主要通过增强视图间的特征一致性或掩码场景建模来强调语义感知，但所得表征在实例定位任务中迁移效果不佳，且通常需要完整微调才能获得强劲性能。实例感知是三维感知的基本组成部分，因此弥合这一差距对于推进真正支持所有三维数据下游任务的基础模型至关重要。本研究提出PointINS——一种面向实例的自监督框架，通过几何感知学习增强点云表征。PointINS采用正交偏移分支联合学习高层语义理解与几何推理，从而获得实例感知能力。我们识别出对鲁棒实例定位至关重要的两个一致性属性，并将其构建为互补的正则化策略：偏移分布正则化（ODR）通过将预测偏移量与经验观测的几何先验对齐，以及空间聚类正则化（SCR）通过伪实例掩码正则化偏移量以增强局部一致性。在五个数据集上的大量实验表明，PointINS在室内实例分割任务中平均提升3.5% mAP，在室外全景分割任务中平均提升4.1% PQ，为可扩展的三维基础模型的发展开辟了新路径。

摘要 (Abstract)

Recent advances in self-supervised learning (SSL) for point clouds have substantially improved 3D scene understanding without human annotations. Existing approaches emphasize semantic awareness by enforcing feature consistency across augmented views or by masked scene modeling. However, the resulting representations transfer poorly to instance localization, and often require full finetuning for strong performance. Instance awareness is a fundamental component of 3D perception, thus bridging this gap is crucial for progressing toward true 3D foundation models that support all downstream tasks on 3D data. In this work, we introduce PointINS, an instance-oriented self-supervised framework that enriches point cloud representations through geometry-aware learning. PointINS employs an orthogonal offset branch to jointly learn high-level semantic understanding and geometric reasoning, yielding instance awareness. We identify two consistent properties essential for robust instance localization and formulate them as complementary regularization strategies, Offset Distribution Regularization (ODR), which aligns predicted offsets with empirically observed geometric priors, and Spatial Clustering Regularization (SCR), which enforces local coherence by regularizing offsets with pseudo-instance masks. Through extensive experiments across five datasets, PointINS achieves on average +3.5% mAP improvement for indoor instance segmentation and +4.1% PQ gain for outdoor panoptic segmentation, paving the way for scalable 3D foundation models.

关键词: 3D scene understanding, self-supervised learning, point clouds, instance awareness, foundation models, geometric reasoning, instance segmentation, panoptic segmentation

223. ❌ SportSkills: Physical Skill Learning from Sports Instructional Videos

作者: Kumar Ashutosh, Chi Hsuan Wu, Kristen Grauman 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25163v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于体育技能学习的视频数据集构建和检索任务，涉及计算机视觉和视频理解领域，但未提及或应用任何大语言模型、深度学习技术原理创新或AI for Science的具体技术。所有关键词均与大模型技术、训练方法、推理优化、对齐技术、AI科学应用等相关，而本文核心是视频数据集和检索系统，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了SportSkills——首个面向体育技能学习的大规模教学视频数据集，并基于此开发了错误条件视频检索系统，显著提升了视频模型为用户查询提供个性化视觉指导的能力。

摘要翻译

当前大规模视频数据集主要关注通用人类活动，但缺乏针对物理技能学习所需的细粒度活动覆盖深度。我们推出SportSkills——首个面向物理技能学习的大规模野外运动视频数据集。该数据集包含超过36万条教学视频，涵盖55种不同运动项目，提供超过63万次视觉演示，并配有解说旁白阐释动作背后的技术要领。通过一系列实验，我们证明SportSkills能够解锁对物理动作间细粒度差异的识别能力。使用相同模型架构，基于本数据集训练的表示性能较传统以活动为中心的数据集提升高达4倍。更重要的是，基于SportSkills我们首次构建了大规模"错误条件教学视频检索"任务框架，将表示学习与可操作反馈生成相连接（例如：“这是我的技能执行视频；应该观看哪个教学片段来改进？"）。专业教练的正式评估表明，我们的检索方法显著提升了视频模型针对用户查询个性化匹配视觉教学内容的能力。

摘要 (Abstract)

Current large-scale video datasets focus on general human activity, but lack depth of coverage on fine-grained activities needed to address physical skill learning. We introduce SportSkills, the first large-scale sports dataset geared towards physical skill learning with in-the-wild video. SportSkills has more than 360k instructional videos containing more than 630k visual demonstrations paired with instructional narrations explaining the know-how behind the actions from 55 varied sports. Through a suite of experiments, we show that SportSkills unlocks the ability to understand fine-grained differences between physical actions. Our representation achieves gains of up to 4x with the same model trained on traditional activity-centric datasets. Crucially, building on SportSkills, we introduce the first large-scale task formulation of mistake-conditioned instructional video retrieval, bridging representation learning and actionable feedback generation (e.g., “here’s my execution of a skill; which video clip should I watch to improve it?”). Formal evaluations by professional coaches show our retrieval approach significantly advances the ability of video models to personalize visual instructions for a user query.

关键词: sports instructional videos, physical skill learning, large-scale dataset, video retrieval, mistake-conditioned retrieval, representation learning, actionable feedback, fine-grained action understanding

224. ❌ A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection

作者: SuYeon Kim, Wongyu Lee, MyeongAh Cho 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25159v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于3D点云异常检测，提出了一种语义解缠的统一模型来解决类别间特征纠缠问题。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关，因为这些关键词主要针对自然语言处理领域的大语言模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为3D异常检测可视为计算机视觉在工业检测或科学应用中的一个方向，属于AI在特定领域（科学或工程）的应用，因此给予5分（有一定关联）。论文未涉及任何指定的大模型技术或深度学习原理创新，也未在生物信息学或化学信息学领域有直接应用。

!!! tip deepseek-chat TL;DR

该论文解决了多类别3D异常检测中因类别间特征纠缠导致模型采用错误语义先验的问题，提出了一种语义解缠的统一模型，通过在Real3D-AD和Anomaly-ShapeNet数据集上的实验，实现了最先进的性能，分别将物体级AUROC提高了2.8%和9.1%。

摘要翻译

三维异常检测旨在仅利用正常数据训练，实现对三维点云中缺陷的检测与定位。虽然统一模型通过跨类别学习提升了可扩展性，但其常受类别间纠缠问题的影响——即不同类别的潜在特征相互重叠，导致模型在重建过程中采用错误的语义先验，最终产生不可靠的异常分数。为解决该问题，我们提出了一种语义解缠的统一三维异常检测模型，该模型基于解缠后的语义表征进行条件化特征重建。我们的框架包含三个核心组件：（一）用于构建实例级语义身份的由粗到细全局标记化模块，（二）用于解缠类别语义的类别条件对比学习模块，以及（三）实现语义一致性重建的几何引导解码器。在Real3D-AD和Anomaly-ShapeNet数据集上的大量实验表明，我们的方法在统一模型与类别专用模型上均达到了最优性能，分别将物体级AUROC提升了2.8%和9.1%，同时增强了统一三维异常检测的可靠性。

摘要 (Abstract)

3D anomaly detection targets the detection and localization of defects in 3D point clouds trained solely on normal data. While a unified model improves scalability by learning across multiple categories, it often suffers from Inter-Category Entanglement (ICE)-where latent features from different categories overlap, causing the model to adopt incorrect semantic priors during reconstruction and ultimately yielding unreliable anomaly scores. To address this issue, we propose the Semantically Disentangled Unified Model for 3D Anomaly Detection, which reconstructs features conditioned on disentangled semantic representations. Our framework consists of three key components: (i) Coarse-to-Fine Global Tokenization for forming instance-level semantic identity, (ii) Category-Conditioned Contrastive Learning for disentangling category semantics, and (iii) a Geometry-Guided Decoder for semantically consistent reconstruction. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that our method achieves state-of-the-art for both unified and category-specific models, improving object-level AUROC by 2.8% and 9.1%, respectively, while enhancing the reliability of unified 3D anomaly detection.

关键词: 3D anomaly detection, semantic disentanglement, unified model, inter-category entanglement, point cloud, contrastive learning, geometry-guided decoder, state-of-the-art

225. ❌ Learning to Rank Caption Chains for Video-Text Alignment

作者: Ansel Blume, Burak Uzkent, Shalini Chaudhuri, Garin Kessler 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25145v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究视频-文本对齐中的直接偏好优化（DPO）方法改进，核心贡献在于提出基于排名的优化方法替代二元的DPO，以更好地评估视觉语言模型中响应的视觉忠实度。因此，与’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’高度相关（10分），与’Instruction Tuning OR Alignment OR Value Alignment’和’Hallucination Mitigation OR Factuality OR Truthfulness’有较强关联（8分），因为涉及模型对齐和事实性/忠实度问题。与’Large Language Models OR LLMs OR Foundation Models’和’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（5分），因为论文涉及语言模型和微调技术。其他关键词如MoE、量化、推理加速等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型在视频-文本对齐任务中直接偏好优化（DPO）方法的局限性，提出了一种基于排名的优化方法，通过生成有序的标题链来更精确地评估响应的视觉忠实度，实验表明该方法在长内容生成和评估中优于二元DPO，并发现需要微调视觉编码器才能有效。

摘要翻译

直接偏好优化（DPO）是一种训练语言模型生成偏好回复而非非偏好回复的有效技术。然而，这种二元化的“赢者通吃”方法对于视觉-语言模型而言并非最优，因为其回复质量高度依赖于视觉内容。具体而言，即使某个回复相较于另一回复的偏好程度较低，它仍可能忠实于视觉输入。标准的布拉德利-特里（Bradley-Terry）DPO公式缺乏这种细微区分，它在提升获胜回复权重的过程中，未能充分考虑“落败”回复是否仍保持较高的视觉忠实度。在本研究中，我们探讨了排序优化作为一种替代方案，它能更精确地评估回复对视觉输入的忠实程度。我们聚焦于使用详细视频描述进行视频-文本对齐，提出了一种通过重复描述降级来大规模生成具有挑战性的全序描述链的方法。我们的结果表明，在生成长篇内容及进行评估时，排序优化优于二元DPO；更重要的是，我们发现这些方法需要对视觉编码器进行微调才能有效，这挑战了将DPO单纯视为语言重加权过程的观点。

摘要 (Abstract)

Direct preference optimization (DPO) is an effective technique to train language models to generate preferred over dispreferred responses. However, this binary “winner-takes-all” approach is suboptimal for vision-language models whose response quality is highly dependent on visual content. In particular, a response may still be faithful to the visual inputs even if it is less preferable than an alternative. The standard Bradley-Terry DPO formulation lacks this nuance, upweighting winning responses without sufficient regard for whether the “losing” response still maintains high visual fidelity. In this work, we investigate ranking optimization as an alternative that more precisely situates responses’ faithfulness to visual inputs. We focus on video-text alignment using detailed video captions, proposing a method to generate challenging, totally ordered caption chains at scale through repeated caption degradation. Our results show ranking optimization outperforms binary DPO for long-form content generation and assessment, and importantly, we find that these approaches require finetuning of the vision encoder to be effective, challenging the view of DPO as purely a language-reweighting process.

关键词: Direct Preference Optimization, Video-Text Alignment, Ranking Optimization, Visual Fidelity, Caption Chains, Vision-Language Models, Fine-tuning, Long-form Content Generation

226. ❌ EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions

作者: Taegyoon Yoon, Yegyu Han, Seojin Ji, Jaewoo Park, Sojeong Kim, Taein Kwon, Hyung-Sin Kim 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25135v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的6D物体姿态估计，特别是针对第一人称视角（egocentric view）在极端条件下的鲁棒性研究。论文的核心贡献是提出了一个新的数据集EgoXtreme，并评估了现有姿态估计方法在该数据集上的表现。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文的研究内容（计算机视觉、姿态估计、数据集构建）与这些关键词没有直接关联。论文未涉及大模型技术、训练方法、推理优化、对齐技术、AI代理或AI for Science等主题。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为EgoXtreme的新数据集，用于评估6D物体姿态估计模型在第一人称视角下极端条件（如运动模糊、动态光照、视觉遮挡）的鲁棒性，并发现现有方法在这些条件下泛化能力不足。

摘要翻译

智能眼镜因其能在手部繁忙、眼需专注的场景中提供丰富信息，正逐渐成为一种实用设备。为理解佩戴者的情境，以自我中心视角进行6D物体姿态估计变得至关重要。然而，现有的6D物体姿态估计基准数据集未能充分反映真实世界自我中心应用中的挑战，这些场景通常存在严重的运动模糊、动态光照和视觉遮挡。这种差异导致受控实验室数据与混乱的现实应用之间存在显著鸿沟。为弥合这一差距，我们提出了EgoXtreme——一个完全从自我中心视角采集的大规模6D姿态估计数据集。EgoXtreme包含工业维护、体育运动和紧急救援三大挑战性场景，通过极端光照、剧烈运动模糊和烟雾干扰引入严重的感知模糊性。在EgoXtreme上对当前最先进的通用化姿态估计器进行评估表明，它们在极端条件下（尤其是低光照环境）的泛化能力明显失效。我们进一步证明，简单应用图像恢复技术（如去模糊）对极端条件并无积极改善。尽管基于追踪的方法显示出性能提升，这暗示在快速运动场景中利用时序信息具有重要价值。我们认为，EgoXtreme是开发和评估足以应对真实世界自我中心视觉需求的下一代鲁棒姿态估计模型的关键资源。数据集与代码公开于https://taegyoun88.github.io/EgoXtreme/。

摘要 (Abstract)

Smart glass is emerging as an useful device since it provides plenty of insights under hands-busy, eyes-on-task situations. To understand the context of the wearer, 6D object pose estimation in egocentric view is becoming essential. However, existing 6D object pose estimation benchmarks fail to capture the challenges of real-world egocentric applications, which are often dominated by severe motion blur, dynamic illumination, and visual obstructions. This discrepancy creates a significant gap between controlled lab data and chaotic real-world application. To bridge this gap, we introduce EgoXtreme, a new large-scale 6D pose estimation dataset captured entirely from an egocentric perspective. EgoXtreme features three challenging scenarios - industrial maintenance, sports, and emergency rescue - designed to introduce severe perceptual ambiguities through extreme lighting, heavy motion blur, and smoke. Evaluations of state-of-the-art generalizable pose estimators on EgoXtreme indicate that their generalization fails to hold in extreme conditions, especially under low light. We further demonstrate that simply applying image restoration (e.g., deblurring) offers no positive improvement for extreme conditions. While performance gain has appeared in tracking-based approach, implying using temporal information in fast-motion scenarios is meaningful. We conclude that EgoXtreme is an essential resource for developing and evaluating the next generation of pose estimation models robust enough for real-world egocentric vision. The dataset and code are available at https://taegyoun88.github.io/EgoXtreme/

关键词: 6D object pose estimation, egocentric view, extreme conditions, dataset, EgoXtreme, motion blur, dynamic illumination, robustness

227. ❌ Robust Principal Component Completion

作者: Yinjian Wang, Wei Li, Yuanyuan Gui, James E. Fowler, Gemine Vivone 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25132v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Robust Principal Component Completion》专注于稳健主成分分析（RPCA）的扩展，提出了一种通过变分贝叶斯推断和稀疏张量分解解决稀疏前景遮挡低秩背景问题的新框架。该研究属于传统的机器学习、信号处理和计算机视觉领域，涉及矩阵分解、异常检测和视频处理。所有评分关键词均与大语言模型、深度学习技术原理、AI for Science应用或相关创新技术（如MoE、量化、推理加速等）直接相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为稳健主成分完成（RPCC）的新框架，通过变分贝叶斯推断和稀疏张量分解解决稀疏前景遮挡低秩背景的问题，并在合成数据、彩色视频和高光谱数据集上实现了接近最优的估计和稳健的前景提取与异常检测性能。

摘要翻译

鲁棒主成分分析（RPCA）旨在从数据总和中分离出低秩成分与稀疏成分。然而，在许多实际应用中，稀疏前景实际上会替换或遮挡低秩背景中的元素。为解决这一不匹配问题，本文提出一种新框架，通过确定稀疏成分的支撑集来间接识别该成分。该方法称为鲁棒主成分补全（Robust Principal Component Completion, RPCC），通过应用于完全概率化贝叶斯稀疏张量分解的变分贝叶斯推断进行求解。研究证明该方法能收敛至支撑集的硬分类器，从而消除了大多数现有RPCA驱动方法所需的后处理阈值化步骤。实验结果表明，所提方法在合成数据上能提供近乎最优的估计，同时在真实彩色视频数据集和高光谱数据集上分别实现了鲁棒的前景提取与异常检测性能。源代码及附录详见 https://github.com/WongYinJ/BCP-RPCC。

摘要 (Abstract)

Robust principal component analysis (RPCA) seeks a low-rank component and a sparse component from their summation. Yet, in many applications of interest, the sparse foreground actually replaces, or occludes, elements from the low-rank background. To address this mismatch, a new framework is proposed in which the sparse component is identified indirectly through determining its support. This approach, called robust principal component completion (RPCC), is solved via variational Bayesian inference applied to a fully probabilistic Bayesian sparse tensor factorization. Convergence to a hard classifier for the support is shown, thereby eliminating the post-hoc thresholding required of most prior RPCA-driven approaches. Experimental results reveal that the proposed approach delivers near-optimal estimates on synthetic data as well as robust foreground-extraction and anomaly-detection performance on real color video and hyperspectral datasets, respectively. Source implementation and Appendices are available at https://github.com/WongYinJ/BCP-RPCC.

关键词: Robust Principal Component Analysis, Sparse Component, Low-rank Background, Variational Bayesian Inference, Tensor Factorization, Foreground Extraction, Anomaly Detection, Hyperspectral Data

228. ❌ Denoise and Align: Towards Source-Free UDA for Robust Panoramic Semantic Segmentation

作者: Yaowen Chang, Zhen Cao, Xu Zheng, Xiaoxin Mi, Zhen Dong 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25131v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于全景语义分割的无监督域适应（UDA）和源自由域适应（SFUDA），属于计算机视觉领域，而非大语言模型或深度学习技术原理的创新。论文的核心是解决全景图像中的几何畸变和标注成本问题，提出DAPASS框架进行伪标签去噪和上下文对齐。仅与关键词’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联，因为论文涉及域适应（Domain Adaptation），但并非大模型相关的预训练或持续预训练。其他关键词均与大模型技术、科学AI应用等无关，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文针对全景语义分割中源数据不可访问的域适应问题，提出了DAPASS框架，通过去噪和对齐模块在户外和室内基准上实现了最先进的性能提升。

摘要翻译

全景语义分割对于自动驾驶和虚拟现实等关键应用中的360°全方位场景理解至关重要。然而，该领域的发展受到两个关键挑战的制约：全景投影固有的严重几何畸变，以及密集标注的过高成本。虽然利用标注丰富的针孔相机数据集进行无监督域适应（Unsupervised Domain Adaptation, UDA）提供了一种可行的替代方案，但许多现实任务提出了更严格的无源（source-free, SFUDA）约束，即由于隐私或专有原因无法访问源数据。这一约束显著加剧了域偏移（domain shift）这一核心问题，导致生成不可靠的伪标签（pseudo-labels）和性能急剧下降，尤其是对于少数类别（minority classes）。为克服这些局限，我们提出了DAPASS框架。DAPASS引入了两个协同模块，以在无源数据的情况下稳健地迁移知识。首先，我们的全景置信度引导去噪（Panoramic Confidence-Guided Denoising, PCGD）模块通过强制扰动一致性（perturbation consistency）并结合邻域级置信度来过滤噪声，从而生成高保真、类别平衡的伪标签。其次，上下文分辨率对抗模块（Contextual Resolution Adversarial Module, CRAM）通过对抗性地对齐来自高分辨率裁剪图的细粒度细节与来自低分辨率上下文的全局语义，显式地处理尺度变化和畸变问题。DAPASS在户外（Cityscapes-to-DensePASS）和室内（Stanford2D3D）基准测试中均取得了最先进的性能，分别获得了55.04%（+2.05%）和70.38%（+1.54%）的平均交并比（mIoU）。

摘要 (Abstract)

Panoramic semantic segmentation is pivotal for comprehensive 360° scene understanding in critical applications like autonomous driving and virtual reality. However, progress in this domain is constrained by two key challenges: the severe geometric distortions inherent in panoramic projections and the prohibitive cost of dense annotation. While Unsupervised Domain Adaptation (UDA) from label-rich pinhole-camera datasets offers a viable alternative, many real-world tasks impose a stricter source-free (SFUDA) constraint where source data is inaccessible for privacy or proprietary reasons. This constraint significantly amplifies the core problems of domain shift, leading to unreliable pseudo-labels and dramatic performance degradation, particularly for minority classes. To overcome these limitations, we propose the DAPASS framework. DAPASS introduces two synergistic modules to robustly transfer knowledge without source data. First, our Panoramic Confidence-Guided Denoising (PCGD) module generates high-fidelity, class-balanced pseudo-labels by enforcing perturbation consistency and incorporating neighborhood-level confidence to filter noise. Second, a Contextual Resolution Adversarial Module (CRAM) explicitly addresses scale variance and distortion by adversarially aligning fine-grained details from high-resolution crops with global semantics from low-resolution contexts. DAPASS achieves state-of-the-art performances on outdoor (Cityscapes-to-DensePASS) and indoor (Stanford2D3D) benchmarks, yielding 55.04% (+2.05%) and 70.38% (+1.54%) mIoU, respectively.

关键词: Panoramic semantic segmentation, Unsupervised Domain Adaptation, Source-Free UDA, Pseudo-label denoising, Contextual alignment, DAPASS framework, Geometric distortion, Domain shift

229. ❌ AirSplat: Alignment and Rating for Robust Feed-Forward 3D Gaussian Splatting

作者: Minh-Quan Viet Bui, Jaeho Moon, Munchurl Kim 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25129v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于3D视觉基础模型（3DVFMs）在无姿态新视角合成（NVS）中的应用，提出AirSplat框架，涉及自一致姿态对齐和基于评级的透明度匹配技术。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用相关，而本文研究的是3D视觉和计算机图形学，与这些关键词无直接关联。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出AirSplat框架，通过自一致姿态对齐和基于评级的透明度匹配技术，将3D视觉基础模型的几何先验有效适配到无姿态新视角合成中，显著提升了重建质量。

摘要翻译

尽管三维视觉基础模型（3DVFMs）在视觉几何估计中展现出卓越的零样本能力，但其直接应用于可泛化的新视角合成（NVS）仍面临挑战。本文提出AirSplat，一种新颖的训练框架，能够将3DVFMs的鲁棒几何先验有效适配至高保真、无需姿态输入的NVS。我们的方法包含两项关键技术贡献：（1）自一致姿态对齐（Self-Consistent Pose Alignment, SCPA），这是一种训练时反馈循环，通过像素级对齐监督解决姿态与几何不一致问题；（2）基于评分的透明度匹配（Rating-based Opacity Matching, ROM），该方法利用从稀疏视角NVS教师模型中提取的局部三维几何一致性知识，以滤除退化基元。在大规模基准测试上的实验结果表明，我们的方法在重建质量上显著优于当前最先进的无需姿态NVS方法。AirSplat凸显了适配3DVFMs以实现视觉几何估计与高质量视角合成同步进行的潜力。

摘要 (Abstract)

While 3D Vision Foundation Models (3DVFMs) have demonstrated remarkable zero-shot capabilities in visual geometry estimation, their direct application to generalizable novel view synthesis (NVS) remains challenging. In this paper, we propose AirSplat, a novel training framework that effectively adapts the robust geometric priors of 3DVFMs into high-fidelity, pose-free NVS. Our approach introduces two key technical contributions: (1) Self-Consistent Pose Alignment (SCPA), a training-time feedback loop that ensures pixel-aligned supervision to resolve pose-geometry discrepancy; and (2) Rating-based Opacity Matching (ROM), which leverages the local 3D geometry consistency knowledge from a sparse-view NVS teacher model to filter out degraded primitives. Experimental results on large-scale benchmarks demonstrate that our method significantly outperforms state-of-the-art pose-free NVS approaches in reconstruction quality. Our AirSplat highlights the potential of adapting 3DVFMs to enable simultaneous visual geometry estimation and high-quality view synthesis.

关键词: 3D Vision Foundation Models, novel view synthesis, pose-free NVS, Self-Consistent Pose Alignment, Rating-based Opacity Matching, geometric priors, reconstruction quality, 3D Gaussian Splatting

230. ❌ AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization

作者: Jiawei Lin, Wanrong Zhu, Vlad I Morariu, Christopher Tensmeyer 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25118v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）在文档生成任务中的应用，与’Large Language Models’高度相关（10分）。论文明确提到fine-tuning MLLMs，与’Post-training/SFT’高度相关（10分）。论文创建大规模数据集DocHTML（265,206样本），涉及数据质量和规模，与’Scaling Laws AND Data Quality’有一定关联（5分）。论文提到fine-tuning过程，与’Pre-training/Domain Adaptation’有一定关联（5分）。其他关键词如MoE、SLMs、RLHF、RAG等均未在论文中涉及，评为0分。

!!! tip deepseek-chat TL;DR

该论文提出了AnyDoc框架，通过大规模HTML/CSS数据合成和高度感知强化学习优化，解决了多类别文档生成任务，并在三个实际任务中超越了通用MLLMs和任务特定基线。

摘要翻译

文档生成在人工智能驱动的内容创作领域日益受到关注。本研究通过引入AnyDoc框架拓展了该领域的边界，该框架能够处理广泛文档类别中的多种生成任务，所有文档均以统一的HTML/CSS格式表示。为克服现有人工标注文档数据集覆盖范围有限、规模不足的问题，AnyDoc首先建立了一个可扩展的数据合成流水线，用于自动生成HTML/CSS格式的文档。该流水线产生了DocHTML——一个包含265,206个文档样本的大规模数据集，涵盖111个类别和32种不同样式。此外，所有文档均配备完整的元数据，包括设计意图、HTML/CSS源代码、视觉素材和渲染截图。基于构建的数据集，AnyDoc对多模态大语言模型（MLLMs）进行微调，以实现三种实用的文档生成任务：意图到文档生成、文档逆向解析以及元素到文档生成。针对微调过程中观察到的内容溢出问题，AnyDoc进一步引入了高度感知强化学习（HARL）后训练流程。通过基于预测文档高度与目标高度差异定义奖励函数，HARL过程对溢出进行惩罚并逐步缓解该问题，从而提升整体性能。定性与定量实验表明，AnyDoc在所有三项任务中均优于通用多模态大语言模型及任务专用基线方法。

摘要 (Abstract)

Document generation has gained growing attention in the field of AI-driven content creation. In this work, we push its boundaries by introducing AnyDoc, a framework capable of handling multiple generation tasks across a wide spectrum of document categories, all represented in a unified HTML/CSS format. To overcome the limited coverage and scale of existing human-crafted document datasets, AnyDoc first establishes a scalable data synthesis pipeline to automatically generate documents in HTML/CSS form. This pipeline yields DocHTML, a large-scale dataset containing 265,206 document samples, while spanning 111 categories and 32 distinct styles. Additionally, all documents are equipped with comprehensive metadata, including design intentions, HTML/CSS source code, visual assets, and rendered screenshots. Building on the curated dataset, AnyDoc fine-tunes multi-modal large language models (MLLMs) to achieve three practical document generation tasks: intention-to-document, document derendering, and element-to-document. To address the content overflow issue observed during fine-tuning, AnyDoc further incorporates a height-aware reinforcement learning (HARL) post-training procedure. By defining a reward function based on the difference between predicted and target document heights, overflow is penalized and gradually mitigated during HARL, thereby enhancing overall performance. Qualitative and quantitative experiments demonstrate that AnyDoc outperforms both general-purpose MLLMs and task-specific baselines across all three tasks.

关键词: document generation, large-scale data synthesis, HTML/CSS format, multi-modal large language models, fine-tuning, height-aware reinforcement learning, DocHTML dataset, overflow mitigation

231. ❌ MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning

作者: Chenglong Wang, Yifu Huo, Yang Gan, Qiaozhi He, Qi Meng, Bei Li, Yan Wang, Junfu Liu, Tianhua Zhou, Jingbo Zhu, Tong Xiao 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25108v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态奖励模型（MRMs）的强化学习训练方法，与’RLHF/RLAIF/DPO’高度相关（10分），因为MSRL方法本质上是强化学习在奖励建模中的应用。与’Large Language Models/Foundation Models’相关（8分），因为奖励建模通常基于大模型。与’Post-training/SFT’相关（8分），涉及模型微调。与’Scaling Laws/Data Quality’（5分）和’Pre-training/Domain Adaptation’（5分）有一定关联，涉及数据质量和领域适应。与’Instruction Tuning/Alignment’（5分）相关，因为奖励建模涉及对齐。其他关键词如MoE、SLMs、RAG、量化等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种多阶段强化学习（MSRL）方法，解决了多模态奖励模型训练中依赖昂贵标注数据的可扩展性问题，在视觉理解和生成任务上显著提升了性能。

摘要翻译

多模态奖励建模的最新进展主要由从判别式到生成式方法的范式转变所驱动。基于此进展，近期研究进一步采用可验证奖励的强化学习来增强多模态奖励模型。尽管取得了成功，但基于RLVR的训练通常依赖于标注的多模态偏好数据，这些数据获取成本高昂且劳动密集，使得多模态奖励模型的训练难以扩展。为克服这一限制，我们提出了一种多阶段强化学习方法，该方法能够在多模态数据有限的情况下实现可扩展的多模态奖励模型强化学习。MSRL通过以下方式取代了传统的基于RLVR的训练范式：首先从大规模文本偏好数据中学习可泛化的奖励推理能力，随后通过基于描述的强化学习阶段和完全多模态强化学习阶段，逐步将这种能力迁移到多模态任务中。此外，我们引入了一种跨模态知识蒸馏方法，以提升MSRL内部的偏好泛化能力。大量实验表明，MSRL有效扩展了生成式多模态奖励模型基于RLVR的训练，并在视觉理解和视觉生成任务上显著提升了其性能（例如，在VL-RewardBench上从66.6%提升至75.9%，在GenAI-Bench上从70.2%提升至75.7%），且无需额外的多模态偏好标注。我们的代码公开于：https://github.com/wangclnlp/MSRL。

摘要 (Abstract)

Recent advances in multimodal reward modeling have been largely driven by a paradigm shift from discriminative to generative approaches. Building on this progress, recent studies have further employed reinforcement learning from verifiable rewards (RLVR) to enhance multimodal reward models (MRMs). Despite their success, RLVR-based training typically relies on labeled multimodal preference data, which are costly and labor-intensive to obtain, making it difficult to scale MRM training. To overcome this limitation, we propose a Multi-Stage Reinforcement Learning (MSRL) approach, which can achieve scalable RL for MRMs with limited multimodal data. MSRL replaces the conventional RLVR-based training paradigm by first learning a generalizable reward reasoning capability from large-scale textual preference data, and then progressively transferring this capability to multimodal tasks through caption-based and fully multimodal reinforcement-learning stages. Furthermore, we introduce a cross-modal knowledge distillation approach to improve preference generalization within MSRL. Extensive experiments demonstrate that MSRL effectively scales the RLVR-based training of generative MRMs and substantially improves their performance across both visual understanding and visual generation tasks (e.g., from 66.6% to 75.9% on VL-RewardBench and from 70.2% to 75.7% on GenAI-Bench), without requiring additional multimodal preference annotations. Our code is available at: https://github.com/wangclnlp/MSRL.

关键词: multimodal reward modeling, reinforcement learning, MSRL, scalable training, preference data, cross-modal knowledge distillation, visual understanding, visual generation

232. ❌ On Neural Scaling Laws for Weather Emulation through Continual Training

作者: Shashank Subramanian, Alexander Kiefer, Arnur Nigmetov, Amir Gholami, Dmitriy Morozov, Michael W. Mahoney 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25687v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	10.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究神经缩放定律在科学机器学习（特别是天气预报）中的应用，与’Scaling Laws AND Data Quality’高度相关（10分），因为论文系统研究模型、数据和计算规模对性能的影响。与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为论文专注于科学机器学习中的天气预测应用。与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（8分），因为论文采用持续训练策略。与’Large Language Models OR LLMs OR Foundation Models’有中等关联（5分），因为论文借鉴了基础模型的缩放定律概念，但未直接研究LLMs。其他关键词与论文内容无关（0分），因为论文聚焦于科学ML的缩放定律，而非LLM特定技术如MoE、SFT、RAG等。

!!! tip deepseek-chat TL;DR

该论文研究了神经缩放定律在科学机器学习中的应用，通过持续训练策略在天气预报模型中验证了可预测的缩放趋势，并确定了计算最优的训练方案。

摘要翻译

神经缩放定律作为自然语言处理与计算机视觉领域构建基础模型的基石，在某些领域能够依据模型规模、数据量和计算量预测大型神经网络的性能表现。本研究聚焦于科学机器学习中的神经缩放现象，并以天气预报模型为具体研究对象。为在尽可能简化的场景中分析缩放规律，我们采用了一种极简、可扩展且通用的Swin Transformer架构，并运用恒定学习率配合周期性冷却的持续训练策略作为高效训练方法。研究表明，以此极简方式训练的模型遵循可预测的缩放趋势，其表现甚至优于标准余弦学习率调度方案。冷却阶段可被重新用于提升下游任务性能，例如通过谱损失调整实现更精准的多步长时序推演预报以及更锐化的预测结果。我们系统探索了不同计算预算下多种模型与数据集规模的组合，构建了等计算量曲线，并确定了计算最优的训练方案。将这些趋势外推至更大规模时，研究揭示了潜在的性能极限，证明神经缩放定律可作为资源高效配置的重要诊断工具。我们已开源代码以确保结果的可复现性。

摘要 (Abstract)

Neural scaling laws, which in some domains can predict the performance of large neural networks as a function of model, data, and compute scale, are the cornerstone of building foundation models in Natural Language Processing and Computer Vision. We study neural scaling in Scientific Machine Learning, focusing on models for weather forecasting. To analyze scaling behavior in as simple a setting as possible, we adopt a minimal, scalable, general-purpose Swin Transformer architecture, and we use continual training with constant learning rates and periodic cooldowns as an efficient training strategy. We show that models trained in this minimalist way follow predictable scaling trends and even outperform standard cosine learning rate schedules. Cooldown phases can be re-purposed to improve downstream performance, e.g., enabling accurate multi-step rollouts over longer forecast horizons as well as sharper predictions through spectral loss adjustments. We also systematically explore a wide range of model and dataset sizes under various compute budgets to construct IsoFLOP curves, and we identify compute-optimal training regimes. Extrapolating these trends to larger scales highlights potential performance limits, demonstrating that neural scaling can serve as an important diagnostic for efficient resource allocation. We open-source our code for reproducibility.

关键词: Neural Scaling Laws, Weather Forecasting, Scientific Machine Learning, Continual Training, Swin Transformer, IsoFLOP Curves, Compute-optimal Training, Foundation Models

233. ❌ Longitudinal Digital Phenotyping for Early Cognitive-Motor Screening

作者: Diego Jimenez-Oviedo, Ruben Vera-Rodriguez, Ruben Tolosana, Juan Carlos Ruiz-Garcia, Jaime Herreros-Rodriguez 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25673v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究儿童认知运动发展的纵向数字表型分析，使用无监督学习（t-SNE、K-Means++）处理平板交互数据，属于AI在科学（具体是儿科/发育科学）领域的应用。因此，仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评5分），因为该关键词涵盖AI在科学领域的应用，包括生物信息学等，而论文涉及儿科和发育科学，属于广义的科学应用。其他关键词均涉及大模型、深度学习技术原理、训练方法、推理优化、代理系统等，论文未涉及任何大模型或深度学习技术，仅使用传统机器学习进行聚类分析，因此评0分。

!!! tip deepseek-chat TL;DR

该研究提出一个AI驱动的纵向框架，利用平板交互数据和无人监督学习来识别和追踪儿童（18个月至8岁）的认知运动发展表型，发现了低、中、高三种性能表型，并揭示早期低性能表型具有高度稳定性，为早期筛查和干预提供了数据驱动的基础。

摘要翻译

非典型认知运动发育的早期发现对于及时干预至关重要，然而传统评估方法严重依赖主观、静态的评价。数字设备的整合为通过数字生物标志物进行连续、客观的监测提供了机会。在本研究中，我们提出了一个由人工智能驱动的纵向框架，用于建模18个月至8岁儿童的发育轨迹。利用在多个学年期间收集的基于平板电脑交互的数据集，我们分析了六项认知运动任务（例如精细运动控制、反应时间）。我们应用降维技术（t-SNE）和无监督聚类（K-Means++）来识别不同的发育表型，并追踪个体在这些特征剖面之间随时间的变化。我们的分析揭示了三种不同的特征剖面：低、中、高表现水平。关键的是，纵向追踪突显出低表现水平聚类具有高度稳定性（在早期阶段保留率>90%），这表明早期缺陷若无干预往往会持续存在。相反，较高表现水平的聚类显示出更大的变异性，这可能反映了参与度因素的影响。本研究验证了在触摸屏数据上应用无监督学习以揭示异质性发育路径的可行性。所识别的特征剖面可作为认知发展的可扩展、数据驱动的代理指标，为早期筛查工具和个性化儿科干预奠定了基础。

摘要 (Abstract)

Early detection of atypical cognitive-motor development is critical for timely intervention, yet traditional assessments rely heavily on subjective, static evaluations. The integration of digital devices offers an opportunity for continuous, objective monitoring through digital biomarkers. In this work, we propose an AI-driven longitudinal framework to model developmental trajectories in children aged 18 months to 8 years. Using a dataset of tablet-based interactions collected over multiple academic years, we analyzed six cognitive-motor tasks (e.g., fine motor control, reaction time). We applied dimensionality reduction (t-SNE) and unsupervised clustering (K-Means++) to identify distinct developmental phenotypes and tracked individual transitions between these profiles over time. Our analysis reveals three distinct profiles: low, medium, and high performance. Crucially, longitudinal tracking highlights a high stability in the low-performance cluster (>90% retention in early years), suggesting that early deficits tend to persist without intervention. Conversely, higher-performance clusters show greater variability, potentially reflecting engagement factors. This study validates the use of unsupervised learning on touchscreen data to uncover heterogeneous developmental paths. The identified profiles serve as scalable, data-driven proxies for cognitive growth, offering a foundation for early screening tools and personalized pediatric interventions.

关键词: longitudinal digital phenotyping, cognitive-motor development, early screening, unsupervised learning, developmental trajectories, tablet-based interactions, K-Means++, pediatric interventions

234. ❌ Uncertainty-Guided Label Rebalancing for CPS Safety Monitoring

作者: John Ayotunde, Qinghua Xu, Guancheng Wang, Lionel C. Briand 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25670v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于CPS安全监控中的类别不平衡问题，提出了一种基于行为不确定性的标签重平衡方法。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理等）完全无关，因为这些关键词特指大语言模型及相关技术，而论文使用的是传统的机器学习方法（GatedMLP）处理时间序列数据。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将AI应用于科学/工程领域（CPS安全监控），属于AI for Science的广义范畴，但并非核心的生物信息学或化学信息学应用，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对网络物理系统（CPS）安全监控中因不安全事件罕见导致的极端类别不平衡问题，提出了一种名为U-Balance的方法，该方法利用行为不确定性来指导标签重平衡，从而显著提升了安全预测器的性能（F1分数达到0.806，比最强基线高出14.3个百分点）。

摘要翻译

安全监控对于信息物理系统（Cyber-Physical Systems, CPS）至关重要。然而，在实际CPS运行中，不安全事件极为罕见，这导致了极端的类别不平衡问题，从而降低了安全预测器的性能。标准的再平衡技术在处理时间序列CPS遥测数据时表现不佳，它们要么生成不切实际的合成样本，要么对少数类别过拟合。与此同时，CPS运行中的行为不确定性——即CPS决策中的怀疑或不确定程度——常与安全结果相关，但在安全监控领域尚未得到充分探索。为此，我们提出了U-Balance，一种监督式方法，它利用行为不确定性在训练安全预测器之前对不平衡数据集进行再平衡。U-Balance首先训练一个基于门控多层感知器（GatedMLP）的不确定性预测器，该预测器将每个遥测数据窗口概括为分布运动学特征，并输出一个不确定性分数。随后，它应用一种不确定性引导的标签再平衡机制（uLNR），该机制以概率方式将标记为“安全”但具有异常高不确定性的窗口重新标记为“不安全”，从而在不合成新数据的情况下，通过信息丰富的边界样本丰富少数类别。最后，在再平衡后的数据集上训练安全预测器以进行安全监控。我们在一个安全与不安全事件比例为46:1的大规模无人机（UAV）基准测试上评估了U-Balance。结果证实了行为不确定性与安全性之间存在适度但显著的相关性。与直接的早期融合和晚期融合策略相比，我们进一步确定uLNR是利用不确定性信息最有效的策略。U-Balance实现了0.806的F1分数，比最强基线高出14.3个百分点，同时保持了有竞争力的推理效率。消融研究证实，基于GatedMLP的不确定性预测器和uLNR机制都对U-Balance的有效性有显著贡献。

摘要 (Abstract)

Safety monitoring is essential for Cyber-Physical Systems (CPSs). However, unsafe events are rare in real-world CPS operations, creating an extreme class imbalance that degrades safety predictors. Standard rebalancing techniques perform poorly on time-series CPS telemetry, either generating unrealistic synthetic samples or overfitting on the minority class. Meanwhile, behavioral uncertainty in CPS operations, defined as the degree of doubt or uncertainty in CPS decisions , is often correlated with safety outcomes but unexplored in safety monitoring. To that end, we propose U-Balance, a supervised approach that leverages behavioral uncertainty to rebalance imbalanced datasets prior to training a safety predictor. U-Balance first trains a GatedMLP-based uncertainty predictor that summarizes each telemetry window into distributional kinematic features and outputs an uncertainty score. It then applies an uncertainty-guided label rebalancing (uLNR) mechanism that probabilistically relabels \textit{safe}-labeled windows with unusually high uncertainty as \textit{unsafe}, thereby enriching the minority class with informative boundary samples without synthesizing new data. Finally, a safety predictor is trained on the rebalanced dataset for safety monitoring. We evaluate U-Balance on a large-scale UAV benchmark with a 46:1 safe-to-unsafe ratio. Results confirm a moderate but significant correlation between behavioral uncertainty and safety. We then identify uLNR as the most effective strategy to exploit uncertainty information, compared to direct early and late fusion. U-Balance achieves a 0.806 F1 score, outperforming the strongest baseline by 14.3 percentage points, while maintaining competitive inference efficiency. Ablation studies confirm that both the GatedMLP-based uncertainty predictor and the uLNR mechanism contribute significantly to U-Balance’s effectiveness.

关键词: Cyber-Physical Systems, Safety Monitoring, Class Imbalance, Behavioral Uncertainty, Label Rebalancing, UAV Benchmark, GatedMLP, F1 Score

235. ❌ Anchored-Branched Steady-state WInd Flow Transformer (AB-SWIFT): a metamodel for 3D atmospheric flow in urban environments

作者: Armand de Villeroché, Rem-Sophia Mouradi, Vincent Le Guen, Sibo Cheng, Marc Bocquet, Alban Farchi, Patrick Armand, Patrick Massin 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25635v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是基于Transformer架构的深度学习模型（AB-SWIFT）用于城市环境三维大气流动建模，属于AI在科学领域的应用（大气科学/环境科学），因此仅与’AI for Science OR Bioinformatics OR Cheminformatics’关键词有一定关联（评分5分），但与所有其他涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等关键词完全无关（评分0分）。

!!! tip deepseek-chat TL;DR

该论文针对城市大气流动建模中深度学习模型难以适应复杂几何和大网格尺寸的问题，提出了一种基于Transformer的锚定分支稳态风流动模型（AB-SWIFT），在随机城市几何和多种大气分层条件下的模拟数据上训练，相比现有Transformer和图模型取得了最佳预测精度。

摘要翻译

局部尺度气流建模对于污染物扩散模拟或风电场建模等应用至关重要。为规避昂贵的计算流体动力学（CFD）模拟，深度学习代理模型近年来已成为具有前景的替代方案。然而，在城市气流模拟领域，深度学习模型难以适应城市几何结构的高度变化及大规模网格尺寸。为应对这些挑战，我们提出了锚定分支稳态风流变换器（Anchored Branched Steady-state WInd Flow Transformer, AB-SWIFT），这是一种基于变换器架构、具有内部分支结构的模型，专为大气流动建模而设计。我们在一个特别构建的数据库上训练模型，该数据库包含随机城市几何结构周围的大气模拟数据，并混合了不稳定、中性和稳定三种大气层结条件。与当前最先进的变换器模型及基于图结构的模型相比，我们的模型在所有预测场上均达到了最佳精度。代码与数据公开于 https://github.com/cerea-daml/abswift。

摘要 (Abstract)

Air flow modeling at a local scale is essential for applications such as pollutant dispersion modeling or wind farm modeling. To circumvent costly Computational Fluid Dynamics (CFD) computations, deep learning surrogate models have recently emerged as promising alternatives. However, in the context of urban air flow, deep learning models struggle to adapt to the high variations of the urban geometry and to large mesh sizes. To tackle these challenges, we introduce Anchored Branched Steady-state WInd Flow Transformer (AB-SWIFT), a transformer-based model with an internal branched structure uniquely designed for atmospheric flow modeling. We train our model on a specially designed database of atmospheric simulations around randomised urban geometries and with a mixture of unstable, neutral, and stable atmospheric stratifications. Our model reaches the best accuracy on all predicted fields compared to state-of-the-art transformers and graph-based models. Our code and data is available at https://github.com/cerea-daml/abswift.

关键词: atmospheric flow modeling, urban environments, transformer-based model, deep learning surrogate, Computational Fluid Dynamics, branched structure, wind flow, air flow

236. ❌ The Geometry of Efficient Nonconvex Sampling

作者: Santosh S. Vempala, Andre Wibisono 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25622v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《The Geometry of Efficient Nonconvex Sampling》研究的是高维空间中非凸集合的均匀采样算法，属于计算几何和采样理论领域。所有评分关键词均与大模型、深度学习、AI应用或相关技术原理相关，而该论文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种在满足等周性和体积增长条件的任意紧致非凸集合中从热启动状态进行均匀采样的高效算法，其复杂度在维度、庞加莱常数和体积增长常数上是多项式的。

摘要翻译

我们提出一种高效算法，用于在等周性与自然体积增长条件下，从任意紧致体 $\mathcal{X} \subset \mathbb{R}^n$ 中实现基于热启动的均匀采样。我们的结果对凸体与星形体已知结论进行了重要的共同推广。该算法的复杂度在维度、$\mathcal{X}$ 上均匀分布的泊瓦雷常数以及集合 $\mathcal{X}$ 的体积增长常数方面均为多项式级别。

摘要 (Abstract)

We present an efficient algorithm for uniformly sampling from an arbitrary compact body $\mathcal{X} \subset \mathbb{R}^n$ from a warm start under isoperimetry and a natural volume growth condition. Our result provides a substantial common generalization of known results for convex bodies and star-shaped bodies. The complexity of the algorithm is polynomial in the dimension, the Poincaré constant of the uniform distribution on $\mathcal{X}$ and the volume growth constant of the set $\mathcal{X}$.

关键词: nonconvex sampling, uniform sampling, compact body, isoperimetry, volume growth, Poincaré constant, warm start, efficient algorithm

作者: Liping Yi, Zhiming Zhao, Qinghua Hu 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25614v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出SoHip框架，属于社会机器学习(SML)和联邦学习(FL)领域，核心是通过内存共享而非模型共享实现异构代理协作。与大多数大模型技术关键词无关。仅与’Small Language Models OR SLMs OR On-device AI’有中等关联(5分)，因为框架强调本地模型和设备上处理；与’Multi-agent Systems OR Agent Coordination’有中等关联(5分)，因为涉及多代理协作学习。其他关键词均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于内存共享的社会机器学习框架SoHip，通过抽象和融合个体与集体长期记忆来增强异构代理的本地预测，在保持数据隐私的同时实现了比现有方法更高的准确率。

摘要翻译

社会学习强调，智能体的能力提升并非孤立进行，而是通过与他人的互动及结构化知识交换实现。将这一原则引入机器学习领域，便催生了社会机器学习（SML），即多个智能体通过共享抽象知识进行协作学习。联邦学习（FL）为此范式提供了天然的协作基础，然而现有的异构联邦学习方法通常依赖于共享模型参数或中间表征，这可能暴露敏感信息并产生额外开销。本研究提出SoHip（社会海马体记忆学习），一种以记忆为中心的社会机器学习框架，使异构智能体能够通过共享记忆而非共享模型进行协作。SoHip从局部表征中抽象出每个智能体的个体短期记忆，通过受海马体启发的机制将其巩固为个体长期记忆，并与集体聚合的长期记忆融合以增强本地预测。在整个过程中，原始数据与本地模型始终保留在设备端，仅交换轻量级的记忆单元。我们对其收敛性和隐私保护特性进行了理论分析。在两个基准数据集上使用七种基线方法的实验表明，SoHip始终优于现有方法，最高可实现8.78%的准确率提升。

摘要 (Abstract)

Social learning highlights that learning agents improve not in isolation, but through interaction and structured knowledge exchange with others. When introduced into machine learning, this principle gives rise to social machine learning (SML), where multiple agents collaboratively learn by sharing abstracted knowledge. Federated learning (FL) provides a natural collaboration substrate for this paradigm, yet existing heterogeneous FL approaches often rely on sharing model parameters or intermediate representations, which may expose sensitive information and incur additional overhead. In this work, we propose SoHip (Social Hippocampus Memory Learning), a memory-centric social machine learning framework that enables collaboration among heterogeneous agents via memory sharing rather than model sharing. SoHip abstracts each agent’s individual short-term memory from local representations, consolidates it into individual long-term memory through a hippocampus-inspired mechanism, and fuses it with collectively aggregated long-term memory to enhance local prediction. Throughout the process, raw data and local models remain on-device, while only lightweight memory are exchanged. We provide theoretical analysis on convergence and privacy preservation properties. Experiments on two benchmark datasets with seven baselines demonstrate that SoHip consistently outperforms existing methods, achieving up to 8.78% accuracy improvements.

关键词: Social Machine Learning, Federated Learning, Memory Sharing, Heterogeneous Agents, On-device Learning, Privacy Preservation, Collaborative Learning, Hippocampus-inspired Mechanism

238. ❌ Spatiotemporal System Forecasting with Irregular Time Steps via Masked Autoencoder

作者: Kewei Zhu, Yanze Xin, Jinwei Hu, Xiaoyuan Cheng, Yiming Yang, Sibo Cheng 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25597v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种用于不规则时间步长时空系统预测的物理-时空掩码自编码器方法，属于深度学习在科学计算领域的应用。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、智能体等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为该方法应用于气候建模、流体动力学、海洋预报等科学领域，属于AI for Science的范畴，但并非核心的生物信息学或化学信息学应用，因此给予5分。

!!! tip deepseek-chat TL;DR

该论文提出了一种物理-时空掩码自编码器方法，用于解决不规则时间步长下的高维动态系统预测问题，并在模拟和真实海洋温度数据上验证了其在预测精度、鲁棒性和计算效率上的显著提升。

摘要翻译

预测具有不规则时间步长的高维动力系统对当前数据驱动算法构成显著挑战。这些不规则性源于数据缺失、稀疏观测或自适应计算技术，导致预测精度下降。为应对这些局限，我们提出一种新方法：物理-时空掩码自编码器。该方法将用于空间特征提取的卷积自编码器与针对不规则时间序列优化的掩码自编码器相结合，利用注意力机制在单次预测过程中重建完整物理序列。该模型无需数据填补，同时保持系统的物理完整性。此处“物理”指由底层动力系统生成的高维场，而非强制施加显式物理约束或偏微分方程残差。我们在多个模拟数据集和真实海洋温度数据上评估该方法。结果表明，相较于传统卷积和循环网络方法，我们的方法在预测精度、非线性鲁棒性和计算效率方面均取得显著提升。该模型展现出无需领域先验知识即可捕捉复杂时空模式的潜力，适用于气候建模、流体动力学、海洋预报、环境监测和科学计算等领域。

摘要 (Abstract)

Predicting high-dimensional dynamical systems with irregular time steps presents significant challenges for current data-driven algorithms. These irregularities arise from missing data, sparse observations, or adaptive computational techniques, reducing prediction accuracy. To address these limitations, we propose a novel method: a Physics-Spatiotemporal Masked Autoencoder. This method integrates convolutional autoencoders for spatial feature extraction with masked autoencoders optimised for irregular time series, leveraging attention mechanisms to reconstruct the entire physical sequence in a single prediction pass. The model avoids the need for data imputation while preserving physical integrity of the system. Here, ‘physics’ refers to high-dimensional fields generated by underlying dynamical systems, rather than the enforcement of explicit physical constraints or PDE residuals. We evaluate this approach on multiple simulated datasets and real-world ocean temperature data. The results demonstrate that our method achieves significant improvements in prediction accuracy, robustness to nonlinearities, and computational efficiency over traditional convolutional and recurrent network methods. The model shows potential for capturing complex spatiotemporal patterns without requiring domain-specific knowledge, with applications in climate modelling, fluid dynamics, ocean forecasting, environmental monitoring, and scientific computing.

关键词: Spatiotemporal forecasting, Irregular time steps, Masked autoencoder, Physics-informed, High-dimensional dynamical systems, Ocean temperature prediction, Attention mechanisms, Computational efficiency

239. ❌ The Rules-and-Facts Model for Simultaneous Generalization and Memorization in Neural Networks

作者: Gabriele Farné, Fabrizio Boncoraglio, Lenka Zdeborová 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25579v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究神经网络同时学习规则和记忆事实的理论模型（RAF模型），属于机器学习理论、统计物理与学习理论交叉领域。所有评分关键词均聚焦于大模型（LLM）的具体技术、应用、训练方法、推理优化、对齐、部署等实践层面，而本文是基础理论分析，不涉及任何具体的大模型架构、训练技术、应用场景或优化方法。因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为规则与事实（RAF）的最小可解理论模型，用于分析和量化神经网络如何同时实现泛化（学习底层规则）和记忆（存储特定事实或例外）的能力，并揭示了过参数化、正则化和核选择在此过程中的作用。

摘要翻译

现代神经网络的一个关键能力在于其能同时学习底层规则并记忆特定事实或例外。然而，对这种双重能力的理论理解仍然有限。我们提出了规则与事实（Rules-and-Facts, RAF）模型，这是一个最小可解设定，通过连接统计物理学习理论中两条经典研究路线——用于泛化的师生框架（teacher-student framework）与用于记忆的加德纳式容量分析（Gardner-style capacity analysis），实现了对该现象的精确刻画。在RAF模型中，训练标签的一部分（比例为 $1 - \varepsilon$）由结构化的教师规则生成，而另一部分（比例为 $\varepsilon$）则由带有随机标签的非结构化事实构成。我们刻画了学习者在何种条件下能够同时恢复底层规则（从而泛化至新数据）并记忆非结构化示例。我们的结果量化了过参数化如何促成这两个目标的同时实现：充足的过剩容量支持记忆，而正则化以及核函数或非线性函数的选择则控制着容量在规则学习与记忆之间的分配。RAF模型为理解现代神经网络如何在推断结构的同时存储稀有或不可压缩信息提供了理论基础。

摘要 (Abstract)

A key capability of modern neural networks is their capacity to simultaneously learn underlying rules and memorize specific facts or exceptions. Yet, theoretical understanding of this dual capability remains limited. We introduce the Rules-and-Facts (RAF) model, a minimal solvable setting that enables precise characterization of this phenomenon by bridging two classical lines of work in the statistical physics of learning: the teacher-student framework for generalization and Gardner-style capacity analysis for memorization. In the RAF model, a fraction $1 - \varepsilon$ of training labels is generated by a structured teacher rule, while a fraction $\varepsilon$ consists of unstructured facts with random labels. We characterize when the learner can simultaneously recover the underlying rule - allowing generalization to new data - and memorize the unstructured examples. Our results quantify how overparameterization enables the simultaneous realization of these two objectives: sufficient excess capacity supports memorization, while regularization and the choice of kernel or nonlinearity control the allocation of capacity between rule learning and memorization. The RAF model provides a theoretical foundation for understanding how modern neural networks can infer structure while storing rare or non-compressible information.

关键词: neural networks, generalization, memorization, teacher-student framework, statistical physics of learning, overparameterization, capacity analysis, rule learning

240. ❌ Cooperative Deep Reinforcement Learning for Fair RIS Allocation

作者: Martin Mark Zan, Stefan Schwarz 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25572v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究无线通信网络中可重构智能表面（RIS）的公平分配问题，采用多智能体强化学习（MARL）方法。论文内容与绝大多数关键词（涉及大模型、深度学习技术原理、AI for Science等）完全无关。唯一相关的关键词是’Multi-agent Systems OR Agent Coordination’，因为论文明确使用了’collaborative multi-agent reinforcement learning’和’cooperative learning’方法，但这不是论文的核心创新点（核心是公平分配机制），因此给予5分（有一定关联）。其他所有关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于合作多智能体强化学习的公平感知框架，用于在多小区无线网络中动态分配可重构智能表面资源，有效改善了最差服务用户的速率同时保持了整体吞吐量。

摘要翻译

可重构智能表面（RIS）的部署为多小区无线网络的资源分配带来了新的挑战，尤其在基站间用户负载不均衡的情况下。本研究将RIS视为共享基础设施，需在竞争基站间动态分配，并采用同步增价拍卖机制解决此问题。为缓解小区间的性能失衡，我们提出一种公平性感知的协作多智能体强化学习方法，使基站能够根据预期效用增益和相对服务质量调整其竞价策略。通过将中央计算的性能相关公平性指标纳入智能体观测中，实现了无需基站间直接通信的隐式协调。仿真结果表明，所提框架能有效将RIS资源重新分配给性能较弱的小区，在保持整体吞吐量的同时显著提升了服务最差用户的速率。这些结果证明，通过协作学习可以实现以公平为导向的RIS分配，为未来无线网络中效率与公平的平衡提供了灵活工具。

摘要 (Abstract)

The deployment of reconfigurable intelligent surfaces (RISs) introduces new challenges for resource allocation in multi-cell wireless networks, particularly when user loads are uneven across base stations. In this work, we consider RISs as shared infrastructure that must be dynamically assigned among competing base stations, and we address this problem using a simultaneous ascending auction mechanism. To mitigate performance imbalances between cells, we propose a fairness-aware collaborative multi-agent reinforcement learning approach in which base stations adapt their bidding strategies based on both expected utility gains and relative service quality. A centrally computed performance-dependent fairness indicator is incorporated into the agents’ observations, enabling implicit coordination without direct inter-base-station communication. Simulation results show that the proposed framework effectively redistributes RIS resources toward weaker-performing cells, substantially improving the rates of the worst-served users while preserving overall throughput. The results demonstrate that fairness-oriented RIS allocation can be achieved through cooperative learning, providing a flexible tool for balancing efficiency and equity in future wireless networks.

关键词: reconfigurable intelligent surfaces, resource allocation, multi-cell wireless networks, multi-agent reinforcement learning, fairness, cooperative learning, simultaneous ascending auction, wireless networks

241. ❌ An Integrative Genome-Scale Metabolic Modeling and Machine Learning Framework for Predicting and Optimizing Biofuel-Relevant Biomass Production in Saccharomyces cerevisiae

作者: Neha K. Nair, Aaron D’Souza 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25561v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于使用传统机器学习方法（如随机森林、XGBoost、变分自编码器、生成对抗网络）结合基因组规模代谢模型来预测和优化酵母的生物质生产，属于生物信息学/科学AI应用领域。论文未涉及任何大语言模型（LLM）或深度学习技术原理的创新，也未提及任何评分关键词中的大模型相关技术（如预训练、微调、推理优化、智能体等）。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文明确应用机器学习于生物信息学问题（酵母代谢工程），因此给予10分（高度相关）。其他所有关键词均与大模型技术无关，得0分。

!!! tip deepseek-chat TL;DR

该研究开发了一个结合基因组规模代谢模型与机器学习（如随机森林、XGBoost、生成对抗网络）的计算框架，用于预测和优化酿酒酵母的生物质通量，实现了高达12倍的提升。

摘要翻译

酿酒酵母（Saccharomyces cerevisiae）是工业生物技术中的基石微生物，因其遗传易操作性和强大的发酵能力而备受重视。准确预测不同环境与遗传扰动下的生物质通量，仍然是理性菌株设计面临的一项重大挑战。本研究提出了一个计算框架，将Yeast9基因组尺度代谢模型与机器学习和优化技术相结合，以预测、解析并提升生物质通量。通过改变葡萄糖、氧气和铵盐摄取速率，利用通量平衡分析生成了2000个通量谱。随机森林和XGBoost回归器分别达到了0.99989和0.9990的R2值。变分自编码器揭示了四个不同的代谢簇，而SHAP分析则识别出糖酵解、三羧酸循环（TCA cycle）和脂质生物合成是生物质通量的关键决定因素。通过计算机模拟的过表达实现了0.979克干重/小时的生物质通量，而对营养约束进行贝叶斯优化则使通量提高了12倍（从0.0858增至1.041克干重/小时）。生成对抗网络提出了化学计量学上可行的新型通量配置。该框架展示了基因组尺度模拟、可解释机器学习与生成模型如何共同推动酵母代谢工程的发展。

摘要 (Abstract)

Saccharomyces cerevisiae is a cornerstone organism in industrial biotechnology, valued for its genetic tractability and robust fermentative capacity. Accurately predicting biomass flux across diverse environmental and genetic perturbations remains a significant challenge for rational strain design. We present a computational framework combining the Yeast9 genome-scale metabolic model with machine learning and optimization to predict, interpret, and enhance biomass flux. Flux balance analysis generated 2,000 flux profiles by varying glucose, oxygen, and ammonium uptake rates. Random Forest and XGBoost regressors achieved R2 of 0.99989 and 0.9990, respectively. A variational autoencoder revealed four distinct metabolic clusters, and SHAP analysis identified glycolysis, the TCA cycle, and lipid biosynthesis as key biomass determinants. In silico overexpression achieved a biomass flux of 0.979 gDW/hr, while Bayesian optimization of nutrient constraints produced a 12-fold increase (0.0858 to 1.041 gDW/hr). A generative adversarial network proposed stoichiometrically feasible novel flux configurations. This framework demonstrates how genome-scale simulation, interpretable ML, and generative modeling can advance yeast metabolic engineering.

关键词: genome-scale metabolic modeling, machine learning, Saccharomyces cerevisiae, biomass production, flux balance analysis, generative adversarial network, metabolic engineering, bioinformatics

242. ❌ Missing-Aware Multimodal Fusion for Unified Microservice Incident Management

作者: Wenzhuo Qian, Hailiang Zhao, Ziqi Wang, Zhipeng Gao, Jiayi Chen, Zhiwei Ling, Shuiguang Deng 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25538v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于微服务事件管理的多模态融合框架，解决数据缺失问题，使用自监督学习和特定编码器技术。所有关键词均涉及大模型、深度学习技术原理或科学AI应用，而本文研究的是传统机器学习/深度学习在运维领域的应用，未涉及大模型技术、大模型训练方法、推理优化、对齐技术、代理系统或科学AI应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ARMOR的鲁棒自监督框架，用于解决微服务事件管理中多模态数据缺失的问题，通过模态特定编码器和缺失感知门控融合机制，在完整数据和严重模态缺失条件下都实现了最先进的异常检测、故障分类和根因定位性能。

摘要翻译

自动化事故管理对于微服务可靠性至关重要。尽管近期出现的统一框架利用多模态数据进行联合优化，但它们不切实际地假设了数据的完美完整性。在实践中，网络波动和代理故障常导致模态缺失。现有依赖静态占位符的方法会引入插补噪声，从而掩盖异常并降低性能。为解决此问题，我们提出了ARMOR，一个专为模态缺失场景设计的鲁棒自监督框架。ARMOR具备以下特点：(i) 一个模态特定的非对称编码器，用于隔离指标（metrics）、日志（logs）和追踪（traces）之间的分布差异；(ii) 一种缺失感知的门控融合机制，利用可学习的占位符和动态偏置补偿，以防止不完整输入带来的跨模态干扰。通过采用掩码引导重建的自监督自回归方法，ARMOR联合优化了异常检测（AD）、故障分诊（FT）和根因定位（RCL）。其中AD和RCL无需故障标签，而FT仅需下游分类器所需的故障类型标注。大量实验表明，ARMOR在完整数据条件下达到了最先进的性能，即使在严重模态缺失的情况下仍能保持稳健的诊断准确性。

摘要 (Abstract)

Automated incident management is critical for microservice reliability. While recent unified frameworks leverage multimodal data for joint optimization, they unrealistically assume perfect data completeness. In practice, network fluctuations and agent failures frequently cause missing modalities. Existing approaches relying on static placeholders introduce imputation noise that masks anomalies and degrades performance. To address this, we propose ARMOR, a robust self-supervised framework designed for missing modality scenarios. ARMOR features: (i) a modality-specific asymmetric encoder that isolates distribution disparities among metrics, logs, and traces; and (ii) a missing-aware gated fusion mechanism utilizing learnable placeholders and dynamic bias compensation to prevent cross-modal interference from incomplete inputs. By employing self-supervised auto-regression with mask-guided reconstruction, ARMOR jointly optimizes anomaly detection (AD), failure triage (FT), and root cause localization (RCL). AD and RCL require no fault labels, while FT relies solely on failure-type annotations for the downstream classifier. Extensive experiments demonstrate that ARMOR achieves state-of-the-art performance under complete data conditions and maintains robust diagnostic accuracy even with severe modality loss.

关键词: microservice incident management, missing modalities, multimodal fusion, self-supervised learning, anomaly detection, root cause localization, robust framework, ARMOR

243. ❌ Conformal Prediction for Nonparametric Instrumental Regression

作者: Masahiro Kato 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25509v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是非参数工具变量回归中的保形预测方法，属于统计学和计量经济学领域，与所有大模型、深度学习、AI应用相关的关键词均无直接关联。论文未涉及任何语言模型、模型训练、推理优化、AI代理或科学AI应用等内容。

!!! tip deepseek-chat TL;DR

该论文提出了一种在非参数工具变量回归中构建具有有限样本覆盖保证的分布无关预测区间的方法，并建立了理论覆盖保证。

摘要翻译

我们提出了一种在非参数工具变量回归中构建无分布预测区间的方法，该方法具有有限样本覆盖保证。基于保形推断中的条件保证框架，我们将条件覆盖重新表述为对一类工具变量偏移$\mathcal{F}$的边际覆盖。本方法可与任何非参数工具变量估计器结合使用，包括筛二阶段最小二乘法及其他基于机器学习的非参数工具变量方法（如神经网络极小极大方法）。我们的理论分析证明，该方法能在实践者选定的一类工具变量偏移上实现无分布的有限样本覆盖。

摘要 (Abstract)

We propose a method for constructing distribution-free prediction intervals in nonparametric instrumental variable regression (NPIV), with finite-sample coverage guarantees. Building on the conditional guarantee framework in conformal inference, we reformulate conditional coverage as marginal coverage over a class of IV shifts $\mathcal{F}$. Our method can be combined with any NPIV estimator, including sieve 2SLS and other machine-learning-based NPIV methods such as neural networks minimax approaches. Our theoretical analysis establishes distribution-free, finite-sample coverage over a practitioner-chosen class of IV shifts.

关键词: conformal prediction, nonparametric instrumental regression, prediction intervals, finite-sample coverage, IV shifts, NPIV estimator, sieve 2SLS, machine learning

244. ❌ How Class Ontology and Data Scale Affect Audio Transfer Learning

作者: Manuel Milling, Andreas Triantafyllopoulos, Alexander Gebhard, Simon Rampp, Björn W. Schuller 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25476v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究音频到音频的迁移学习，重点关注预训练数据规模（样本数和类别数）和任务相似性对迁移效果的影响。虽然涉及深度学习中的预训练和微调概念，但所有关键词均针对大语言模型（LLMs）及其相关技术（如MoE、RLHF、RAG、量化等），而本文研究的是音频领域的卷积神经网络或音频专用模型，未涉及任何大语言模型技术、架构或应用。因此，所有关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了音频迁移学习中预训练数据的样本规模、类别规模和任务相似性如何影响下游任务的性能，发现增加预训练数据规模和类别数对迁移学习有积极影响，但任务相似性通常能带来更大的性能提升。

摘要翻译

迁移学习是深度学习中的关键概念，它使得人工神经网络在面对数据有限的任务时，能够从大规模预训练数据中获益。尽管迁移学习已被广泛应用且优势明显，但其内部工作机制仍存在许多悬而未决的问题，特别是关于如何理解其何时有效以及效果如何。为此，我们开展了一项严谨的研究，聚焦于音频到音频的迁移学习：我们在AudioSet（基于本体论的）子集上预训练多种模型状态，并在三个计算机听觉任务上进行微调，即声学场景识别、鸟类活动识别和语音命令识别。研究发现，增加预训练数据中的样本数量和类别数量均对迁移学习产生积极影响。然而，这种影响通常被预训练任务与下游任务之间的相似性所超越，这种相似性能够引导模型学习可比的特征。

摘要 (Abstract)

Transfer learning is a crucial concept within deep learning that allows artificial neural networks to benefit from a large pre-training data basis when confronted with a task of limited data. Despite its ubiquitous use and clear benefits, there are still many open questions regarding the inner workings of transfer learning and, in particular, regarding the understanding of when and how well it works. To that extent, we perform a rigorous study focusing on audio-to-audio transfer learning, in which we pre-train various model states on (ontology-based) subsets of AudioSet and fine-tune them on three computer audition tasks, namely acoustic scene recognition, bird activity recognition, and speech command recognition. We report that increasing the number of samples and classes in the pre-training data both have a positive impact on transfer learning. This is, however, generally surpassed by similarity between pre-training and the downstream task, which can lead the model to learn comparable features.

关键词: transfer learning, audio-to-audio, pre-training, fine-tuning, AudioSet, acoustic scene recognition, bird activity recognition, speech command recognition

245. ❌ Causal-INSIGHT: Probing Temporal Models to Extract Causal Structure

作者: Benjamin Redden, Hui Wang, Shuyan Li 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25473v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于时间序列预测模型的可解释性方法（Causal-INSIGHT框架），与大多数大模型技术关键词（如LLM、MoE、RLHF、RAG等）完全无关。唯一相关的是’Mechanistic Interpretability OR Explainable AI’（10分），因为论文核心是模型解释方法。‘AI for Science OR Bioinformatics OR Cheminformatics’（5分）有弱关联，因论文应用于科学领域（如生物信息学）的时间序列数据，但非核心。其他关键词均无直接联系。

!!! tip deepseek-chat TL;DR

该论文提出了Causal-INSIGHT框架，用于从预训练的时间序列预测模型中提取因果结构，通过干预式输入分析和稀疏图选择，提高了模型依赖关系的可解释性和延迟定位准确性。

摘要翻译

理解多元时间序列中的定向时序交互对于阐释复杂动态系统及其训练所得的预测模型至关重要。本文提出Causal-INSIGHT——一种与模型无关的事后解释框架，用于从已训练的时序预测器中提取模型隐含的（预测变量依赖型）、定向的、时滞影响结构。该框架并非在数据生成过程层面推断因果结构，而是通过分析固定参数的预训练预测器在推理阶段如何响应系统化的、受干预启发的输入钳制操作，从而构建反映预测器所依赖关系的定向时序影响信号。我们进一步提出Qbic准则，这是一种感知稀疏性的图选择标准，可在无需真实图标签的情况下平衡预测保真度与结构复杂度。在合成数据、仿真数据及现实基准测试上的实验表明，Causal-INSIGHT能够泛化至多种骨干架构，保持竞争力的结构识别精度，并在应用于现有预测器时显著提升时序延迟定位能力。

摘要 (Abstract)

Understanding directed temporal interactions in multivariate time series is essential for interpreting complex dynamical systems and the predictive models trained on them. We present Causal-INSIGHT, a model-agnostic, post-hoc interpretation framework for extracting model-implied (predictor-dependent), directed, time-lagged influence structure from trained temporal predictors. Rather than inferring causal structure at the level of the data-generating process, Causal-INSIGHT analyzes how a fixed, pre-trained predictor responds to systematic, intervention-inspired input clamping applied at inference time. From these responses, we construct directed temporal influence signals that reflect the dependencies the predictor relies on for prediction, and introduce Qbic, a sparsity-aware graph selection criterion that balances predictive fidelity and structural complexity without requiring ground-truth graph labels. Experiments across synthetic, simulated, and realistic benchmarks show that Causal-INSIGHT generalizes across diverse backbone architectures, maintains competitive structural accuracy, and yields significant improvements in temporal delay localization when applied to existing predictors.

关键词: causal inference, temporal models, model interpretability, time series analysis, post-hoc interpretation, graph selection, predictive models, intervention analysis

246. ❌ Not a fragment, but the whole: Map-based evaluation of data-driven Fire Danger Index models

作者: Shahbaz Alvi, Italo Epicoco, Jose Maria Costa Saura 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25469v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究森林火灾预测模型的评估方法，属于机器学习在环境科学领域的应用，与"AI for Science OR Bioinformatics OR Cheminformatics"有一定关联（评5分），但论文未涉及大模型、深度学习技术原理创新或任何其他评分关键词中的具体技术（如LLMs、MoE、Scaling Laws等），因此其他关键词均评0分。

!!! tip deepseek-chat TL;DR

该论文针对森林火灾危险指数预测模型，提出了一种基于地图评估的新方法，并证明机器学习模型集成能提高火灾识别准确率并减少误报。

摘要翻译

越来越多的文献开始关注利用机器学习方法预测野火发生，这些方法利用了高分辨率数据以及传统基于过程的框架通常忽略的火灾预测因子。虽然机器学习分类器的标准评估指标很重要，但对于火灾危险指数（Fire Danger Index, FDI）预报而言，这些指标对模型实际运行性能的衡量可能有限。此外，模型评估往往未能充分考虑误报率，尽管其在业务场景中至关重要。本文重新审视了每日FDI模型的评估范式，并提出了一种与真实世界决策相一致的新型森林火灾预报模型评估方法。我们系统性地评估了模型在准确预测火灾活动和误报（虚假警报）方面的性能。进一步研究表明，机器学习模型的集成能够同时提升火灾识别能力并减少误报。

摘要 (Abstract)

A growing body of literature has focused on predicting wildfire occurrence using machine learning methods, capitalizing on high-resolution data and fire predictors that canonical process-based frameworks largely ignore. Standard evaluation metrics for an ML classifier, while important, provide a potentially limited measure of the model’s operational performance for the Fire Danger Index (FDI) forecast. Furthermore, model evaluation is frequently conducted without adequately accounting for false positive rates, despite their critical relevance in operational contexts. In this paper, we revisit the daily FDI model evaluation paradigm and propose a novel method for evaluating a forest fire forecasting model that is aligned with real-world decision-making. Furthermore, we systematically assess performance in accurately predicting fire activity and the false positives (false alarms). We further demonstrate that an ensemble of ML models improves both fire identification and reduces false positives.

关键词: wildfire prediction, Fire Danger Index, machine learning, model evaluation, false positives, ensemble models, forest fire forecasting, operational performance

247. ❌ Residual-as-Teacher: Mitigating Bias Propagation in Student–Teacher Estimation

作者: Kakei Yamamoto, Martin J. Wainwright 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25466v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是学生-教师框架下的统计估计问题，提出了一种名为Residual-as-Teacher（RaT）的方法来减轻教师模型偏差传播。论文内容聚焦于统计学习理论、偏差传播分析和优化算法，属于机器学习理论范畴。所有评分关键词都涉及大模型、深度学习技术原理或特定AI应用领域，而该论文并未涉及任何大模型、深度学习技术或AI for Science的具体内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了学生-教师框架中教师模型偏差传播的问题，提出了Residual-as-Teacher方法，通过估计学生预测的残差来减轻偏差影响，理论分析和实验验证表明该方法优于直接匹配教师输出的标准方法。

摘要翻译

我们研究学生-教师框架下的统计估计问题，其中使用预训练教师的预测来指导学生模型。标准方法是训练学生直接匹配教师的输出，我们称之为学生软匹配（SM）。这种方法会直接传播教师模型中存在的任何系统性偏差或设定错误，从而降低学生的预测性能。我们提出并分析了一种替代方案，称为残差即教师（RaT），该方法利用教师来估计学生预测的残差。我们的分析表明，学生通过这种方式可以模拟求解理想优化问题的近端梯度方案，这理论上能够减少教师偏差的影响。针对一般的学生-教师组合，我们建立了任意RaT不动点的非渐近超额风险界，并给出了学生-教师迭代方案的收敛性保证。对于基于核函数的学生-教师组合，我们证明了一个显著分离现象：RaT方法达到了极小极大最优速率，而SM方法在任何样本量下都会产生恒定预测误差。在协变量偏移条件下的合成数据和ImageNette分类实验验证了我们的理论发现。

摘要 (Abstract)

We study statistical estimation in a student–teacher setting, where predictions from a pre-trained teacher are used to guide a student model. A standard approach is to train the student to directly match the teacher’s outputs, which we refer to as student soft matching (SM). This approach directly propagates any systematic bias or mis-specification present in the teacher, thereby degrading the student’s predictions. We propose and analyze an alternative scheme, known as residual-as-teacher (RaT), in which the teacher is used to estimate residuals in the student’s predictions. Our analysis shows how the student can thereby emulate a proximal gradient scheme for solving an oracle optimization problem, and this provably reduces the effect of teacher bias. For general student–teacher pairs, we establish non-asymptotic excess risk bounds for any RaT fixed point, along with convergence guarantees for the student-teacher iterative scheme. For kernel-based student–teacher pairs, we prove a sharp separation: the RaT method achieves the minimax-optimal rate, while the SM method incurs constant prediction error for any sample size. Experiments on both synthetic data and ImageNette classification under covariate shift corroborate our theoretical findings.

关键词: student-teacher estimation, bias propagation, residual-as-teacher, statistical estimation, covariate shift, kernel methods, excess risk bounds, proximal gradient

248. ❌ The Symmetric Perceptron: a Teacher-Student Scenario

作者: Giovanni Catania, Aurélien Decelle, Suhanee Korpe 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25440v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是对称二进制感知器的教师-学生场景，属于经典机器学习理论模型分析，涉及统计物理方法（退火/淬火自由熵计算）、相图分析和蒙特卡洛优化算法。所有评分关键词均针对大模型/深度学习技术及其应用，而本文完全不涉及神经网络、Transformer、大语言模型或任何深度学习技术，纯粹是传统感知器模型的数学理论分析，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文通过教师-学生框架研究对称二进制感知器，在高维极限下使用自由熵计算绘制相图，揭示了学习过程中由二阶不稳定性到一阶转变的结构，并分析了不同势函数和温度对蒙特卡洛优化算法的影响。

摘要翻译

我们提出并求解了对称二元感知机的师生问题表述，将传统上以存储为导向的模型转化为一种在任意样本密度下均存在保证解的植入式推断问题。我们改进了传统上仅考虑U型势或矩形势的对称感知机表述，通过在两个区域中均引入标签来实现这一目标。基于此表述，我们分析了无噪声示例下的贝叶斯最优机制，以及两种不同势函数/分类规则下热噪声的影响。通过在高维极限下进行退火与淬火自由熵计算，我们在三个控制参数——样本密度$α$、原点与其中一个对称超平面之间的距离$κ$以及温度$T$——所构成的相图中描绘了相变结构，并识别出一种稳健的学习情景：该情景以二阶不稳定性为组织机制，该不稳定性首先产生与教师相关的次优状态，随后通过一阶相变达到完全对齐。我们揭示了这一结构如何依赖于势函数的选择，以及次优解的亚稳态与其向植入构型融解之间的相互作用，这对于基于蒙特卡洛的优化算法具有重要意义。

摘要 (Abstract)

We introduce and solve a teacher-student formulation of the symmetric binary Perceptron, turning a traditionally storage-oriented model into a planted inference problem with a guaranteed solution at any sample density. We adapt the formulation of the symmetric Perceptron which traditionally considers either the u-shaped potential or the rectangular one, by including labels in both regions. With this formulation, we analyze both the Bayes-optimal regime at for noise-less examples and the effect of thermal noise under two different potential/classification rules. Using annealed and quenched free-entropy calculations in the high-dimensional limit, we map the phase diagram in the three control parameters, namely the sample density $α$, the distance between the origin and one of the symmetric hyperplanes $κ$ and temperature $T$, and identify a robust scenario where learning is organized by a second-order instability that creates teacher-correlated suboptimal states, followed by a first-order transition to full alignment. We show how this structure depends on the choice of potential, the interplay between metastability of the suboptimal solution and its melting towards the planted configuration, which is relevant for Monte Carlo-based optimization algorithms.

关键词: symmetric Perceptron, teacher-student scenario, phase diagram, free-entropy calculations, high-dimensional limit, Monte Carlo optimization, second-order instability, first-order transition

249. ❌ Hessian-informed machine learning interatomic potential towards bridging theory and experiments

作者: Bangchen Yin, Jian Ouyang, Zhen Fan, Kailai Lin, Hanshi Hu, Dingshun Lv, Weiluo Ren, Hai Xiao, Ji Chen, Changsu Cao 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25373v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于开发一种用于分子和材料模拟的机器学习原子间势能模型（Hi-MLIP），其核心贡献在于通过高效的Hessian监督训练协议（HINT）来准确捕捉势能面的曲率，从而改进过渡态搜索和吉布斯自由能预测。论文主题属于科学计算和计算化学领域，具体涉及机器学习在物理模拟中的应用。所有给定的评分关键词（除了最后一个）都明确指向大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理系统等），而本论文完全不涉及任何语言模型、文本生成或自然语言处理技术。因此，除了“AI for Science OR Bioinformatics OR Cheminformatics”这一关键词（因其直接涵盖科学领域的AI应用，包括计算化学和材料科学）获得10分（高度相关，核心内容）外，其余所有关键词均得0分（完全无关）。论文的研究内容（机器学习势能函数、Hessian矩阵、计算化学）与LLM技术栈无直接关联。

!!! tip deepseek-chat TL;DR

该研究开发了一种Hessian-informed机器学习原子间势能模型（Hi-MLIP）及其高效训练协议HINT，以准确捕捉势能面曲率，从而显著提升了过渡态搜索和吉布斯自由能预测的精度，并成功应用于强非谐氢化物的模拟，在数据稀缺条件下实现了接近化学精度的结果。

摘要翻译

势能面的局域曲率对于从第一性原理预测分子和材料的实验观测值至关重要，但对于复杂体系而言，其计算仍遥不可及。本研究提出了一种海森矩阵（Hessian）信息增强的机器学习原子间势（Hi-MLIP），能够可靠地捕捉此类曲率，从而实现对相关热力学与动力学现象的精确分析。为使海森矩阵监督在实际中可行，我们开发了一种高效训练方案，称为海森矩阵信息训练（HINT），将昂贵海森矩阵标签的需求降低了二至四个数量级。HINT整合了多项关键技术，包括海森矩阵预训练、构型采样、课程学习以及随机投影海森矩阵损失函数。在HINT的支持下，Hi-MLIP显著改进了过渡态搜索，并将吉布斯自由能预测的精度提升至接近化学精度，尤其在数据稀缺的情况下表现突出。我们的框架还能精确处理强非简谐氢化物，重现声子重整化效应并预测与实验高度吻合的超导临界温度，同时规避了非简谐计算的计算瓶颈。这些成果为增强机器学习原子间势的曲率感知能力开辟了一条实用路径，在广泛体系范围内架起了模拟与实验观测之间的桥梁。

摘要 (Abstract)

Local curvature of potential energy surfaces is critical for predicting certain experimental observables of molecules and materials from first principles, yet it remains far beyond reach for complex systems. In this work, we introduce a Hessian-informed Machine Learning Interatomic Potential (Hi-MLIP) that captures such curvature reliably, thereby enabling accurate analysis of associated thermodynamic and kinetic phenomena. To make Hessian supervision practically viable, we develop a highly efficient training protocol, termed Hessian INformed Training (HINT), achieving two to four orders of magnitude reduction for the requirement of expensive Hessian labels. HINT integrates critical techniques, including Hessian pre-training, configuration sampling, curriculum learning and stochastic projection Hessian loss. Enabled by HINT, Hi-MLIP significantly improves transition-state search and brings Gibbs free-energy predictions close to chemical accuracy especially in data-scarce regimes. Our framework also enables accurate treatment of strongly anharmonic hydrides, reproducing phonon renormalization and superconducting critical temperatures in close agreement with experiment while bypassing the computational bottleneck of anharmonic calculations. These results establish a practical route to enhancing curvature awareness of machine learning interatomic potentials, bridging simulation and experimental observables across a wide range of systems.

关键词: Machine Learning Interatomic Potential, Hessian-informed, Potential Energy Surfaces, Transition-state Search, Gibbs Free-energy, Anharmonic Hydrides, Computational Chemistry, HINT Training Protocol

250. ❌ From Intent to Evidence: A Categorical Approach for Structural Evaluation of Deep Research Agents

作者: Shuoling Liu, Zhiquan Tan, Kun Yi, Hui Wu, Yihan Li, Jiangpeng Yan, Liyuan Chen, Kai Chen, Qiang Yang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25342v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究深度研究智能体（DRAs）的评估方法，属于大模型应用中的智能体研究领域。高度相关关键词包括：LLM Agents（核心研究对象）、Chain of Thought/System 2 Thinking（涉及推理能力）、Hallucination Mitigation（涉及真实性验证）、Mechanistic Interpretability（提出可解释评估框架）。中等相关关键词包括：Large Language Models（智能体通常基于LLM）、Retrieval-Augmented Generation（涉及信息检索）、Self-Correction（涉及验证过程）、Tool Use（智能体可能使用工具）。其余关键词与论文的范畴理论框架、结构评估基准等具体技术内容无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对深度研究智能体缺乏系统性评估方法的问题，提出基于范畴理论的结构化评估框架和基准测试，发现当前先进模型在复杂结构信息合成方面仍存在显著能力缺陷，最高准确率仅19.9%。

摘要翻译

尽管深度研究智能体（Deep Research Agents, DRAs）已成为复杂信息合成的一种前景广阔的范式，但其评估仍受限于临时性的经验基准。这些启发式方法既未严格建模智能体行为，也未充分对长程合成与歧义消解进行压力测试。为弥补这一不足，我们通过范畴论视角对DRA行为进行形式化，将深度研究工作流建模为一系列结构保持映射（函子）的组合。基于此理论框架，我们引入了一个新颖的机制感知基准，包含296个问题，旨在沿四个可解释的维度对智能体进行压力测试：遍历序列连通链、验证V结构拉回中的交集、对检索到的子结构施加拓扑排序，以及通过米田探针执行本体证伪。我们对11个领先模型进行的严格评估确立了一个持续低迷的基线水平，最先进模型的平均准确率仅为19.9%，揭示了形式化结构压力测试的难度。此外，我们的研究结果揭示了当前人工智能能力存在鲜明二分性：尽管先进的深度研究流程能成功重新定义动态拓扑重排序，并展现出稳健的本体验证能力——在证伪幻觉前提方面匹敌纯推理模型——但它们几乎普遍在多跳结构合成任务上失效。关键的是，跨任务的巨大性能差异暴露了当前系统仍依赖于脆弱的启发式方法，而非系统性的理解。最终，本研究表明，虽然顶级自主智能体现已能够有机整合搜索与推理，但实现对复杂结构信息的泛化掌握，仍然是一个艰巨的开放挑战。

摘要 (Abstract)

Although deep research agents (DRAs) have emerged as a promising paradigm for complex information synthesis, their evaluation remains constrained by ad hoc empirical benchmarks. These heuristic approaches do not rigorously model agent behavior or adequately stress-test long-horizon synthesis and ambiguity resolution. To bridge this gap, we formalize DRA behavior through the lens of category theory, modeling deep research workflow as a composition of structure-preserving maps (functors). Grounded in this theoretical framework, we introduce a novel mechanism-aware benchmark with 296 questions designed to stress-test agents along four interpretable axes: traversing sequential connectivity chains, verifying intersections within V-structure pullbacks, imposing topological ordering on retrieved substructures, and performing ontological falsification via the Yoneda Probe. Our rigorous evaluation of 11 leading models establishes a persistently low baseline, with the state-of-the-art achieving only a 19.9% average accuracy, exposing the difficulty of formal structural stress-testing. Furthermore, our findings reveal a stark dichotomy in the current AI capabilities. While advanced deep research pipelines successfully redefine dynamic topological re-ordering and exhibit robust ontological verification – matching pure reasoning models in falsifying hallucinated premises – they almost universally collapse on multi-hop structural synthesis. Crucially, massive performance variance across tasks exposes a lingering reliance on brittle heuristics rather than a systemic understanding. Ultimately, this work demonstrates that while top-tier autonomous agents can now organically unify search and reasoning, achieving a generalized mastery over complex structural information remains a formidable open challenge.\footnote{Our implementation will be available at https://github.com/tzq1999/CDR.

关键词: deep research agents, structural evaluation, category theory, benchmark, reasoning, hallucination, autonomous agents, information synthesis

251. ❌ Practical Efficient Global Optimization is No-regret

作者: Jingyi Wang, Haowei Wang, Nai-Yuan Chiang, Juliane Mueller, Tucker Hartland, Cosmin G. Petra 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25311v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是高效全局优化（EGO）算法，属于贝叶斯优化领域，主要涉及高斯过程、期望改进函数和数值稳定性分析。论文内容完全专注于传统优化算法理论，没有涉及任何大语言模型、深度学习、AI for Science或相关技术原理。所有关键词都围绕大模型和深度学习技术，与该论文的优化算法研究主题无任何关联。

!!! tip deepseek-chat TL;DR

该论文首次为实际应用中添加正nugget的高效全局优化算法建立了累积遗憾上界，证明了其在常用核函数下具有次线性遗憾界，从而是一种无遗憾算法。

摘要翻译

高效全局优化（Efficient Global Optimization，简称EGO）是最广泛使用的无噪声贝叶斯优化算法之一。其核心由高斯过程（Gaussian Process，简称GP）代理模型与期望提升（Expected Improvement，简称EI）采集函数构成。在实际应用中，当使用EGO时，通常会在确定性高斯过程的协方差矩阵中添加一个小的正数值标量矩阵（亦称为“块金”或“抖动”），以提升数值稳定性。我们将这种添加了正块金的EGO称为实用EGO。尽管该方法已被广泛采用并取得了实证上的成功，但迄今为止，实用EGO的累积遗憾界尚未得到严格建立。本文首次给出了实用EGO的累积遗憾上界。具体而言，我们证明了实用EGO具有次线性的累积遗憾界，因此对于常用核函数（包括平方指数核与Matérn核（$ν>\frac{1}{2}$）），它是一种无遗憾算法。此外，我们分析了块金值对遗憾界的影响，并讨论了其在选择上的理论意义。数值实验被用于支持和验证我们的研究结果。

摘要 (Abstract)

Efficient global optimization (EGO) is one of the most widely used noise-free Bayesian optimization algorithms.It comprises the Gaussian process (GP) surrogate model and expected improvement (EI) acquisition function. In practice, when EGO is applied, a scalar matrix of a small positive value (also called a nugget or jitter) is usually added to the covariance matrix of the deterministic GP to improve numerical stability. We refer to this EGO with a positive nugget as the practical EGO. Despite its wide adoption and empirical success, to date, cumulative regret bounds for practical EGO have yet to be established. In this paper, we present for the first time the cumulative regret upper bound of practical EGO. In particular, we show that practical EGO has sublinear cumulative regret bounds and thus is a no-regret algorithm for commonly used kernels including the squared exponential (SE) and Matérn kernels ($ν>\frac{1}{2}$). Moreover, we analyze the effect of the nugget on the regret bound and discuss the theoretical implication on its choice. Numerical experiments are conducted to support and validate our findings.

关键词: Efficient Global Optimization, Bayesian Optimization, Gaussian Process, Expected Improvement, Cumulative Regret, No-regret Algorithm, Nugget, Numerical Stability

252. ❌ Mitigating Evasion Attacks in Fog Computing Resource Provisioning Through Proactive Hardening

作者: Younes Salmi, Hanna Bogucka 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25257v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究雾计算资源分配中k-means算法对模型完整性攻击的脆弱性及防御方法，使用传统机器学习（k-means聚类）和对抗训练技术，未涉及大模型、深度学习、AI for Science或任何评分关键词中的技术概念，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文研究了雾计算资源分配中k-means算法对模型完整性攻击的脆弱性，并提出了一种基于对抗训练的主动防御方法，有效维持了系统稳定性。

摘要翻译

本文研究了雾网络中用于资源供应的k-means算法所分配的虚拟机对模型完整性攻击的脆弱性。所考察的k-means算法迭代运行两个阶段：离线聚类以形成请求工作负载的集群，以及在线阶段将新到达的请求分类到离线创建的集群中。首先，我们考虑针对在线阶段分类器的逃避攻击。攻击者利用基于查询的反向工程发起探索性攻击，以发现机器学习（ML）模型（即聚类方案）。随后，在离线阶段触发被动诱发性（逃避）攻击。为保护模型，我们提出一种主动防御方法，利用对抗训练为分类器引入攻击鲁棒性。实验结果表明，我们的缓解技术能有效维持资源供应系统在遭受攻击时的稳定性。

摘要 (Abstract)

This paper investigates the susceptibility to model integrity attacks that overload virtual machines assigned by the k-means algorithm used for resource provisioning in fog networks. The considered k-means algorithm runs two phases iteratively: offline clustering to form clusters of requested workload and online classification of new incoming requests into offline-created clusters. First, we consider an evasion attack against the classifier in the online phase. A threat actor launches an exploratory attack using query-based reverse engineering to discover the Machine Learning (ML) model (the clustering scheme). Then, a passive causative (evasion) attack is triggered in the offline phase. To defend the model, we suggest a proactive method using adversarial training to introduce attack robustness into the classifier. Our results show that our mitigation technique effectively maintains the stability of the resource provisioning system against attacks.

关键词: fog computing, resource provisioning, k-means algorithm, evasion attacks, adversarial training, model integrity, machine learning, cybersecurity

253. ❌ Offline Decision Transformers for Neural Combinatorial Optimization: Surpassing Heuristics on the Traveling Salesman Problem

作者: Hironori Ohigashi, Shinichiro Hamada 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25241v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是将离线强化学习（Decision Transformer）应用于组合优化问题（旅行商问题），使用指针网络和期望回归等技术来超越传统启发式算法。所有关键词都专注于大语言模型（LLM）及相关技术（如MoE、RLHF、RAG、量化等），而本文完全不涉及语言模型或自然语言处理，而是纯粹的强化学习在组合优化领域的应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于离线强化学习（Decision Transformer）的神经组合优化方法，用于解决旅行商问题，实验表明该方法能够超越训练数据中的四种经典启发式算法。

摘要翻译

组合优化问题（如旅行商问题）在工业领域至关重要，但属于NP难问题。神经组合优化方法已展现出潜力，但其对在线强化学习的依赖限制了实际部署，且未能充分利用数十年的算法知识积累。为应对这些局限，我们应用离线强化学习框架——决策变换器，直接从启发式解的数据集中学习更优策略；该方法不仅旨在模仿现有解，更致力于综合并超越它们。具体而言，我们（i）集成指针网络以处理节点选择中依赖实例、可变的动作空间；（ii）采用期望分位数回归对“收益目标”进行乐观条件化处理，这对最优值差异巨大的问题实例至关重要。实验表明，我们的方法在训练所基于的四种经典启发式算法基础上，持续生成更高质量的路径，这证明了离线强化学习在挖掘并超越现有领域知识中隐含性能方面的潜力。

摘要 (Abstract)

Combinatorial optimization problems like the Traveling Salesman Problem are critical in industry yet NP-hard. Neural Combinatorial Optimization has shown promise, but its reliance on online reinforcement learning (RL) hampers deployment and underutilizes decades of algorithmic knowledge. We address these limitations by applying the offline RL framework, Decision Transformer, to learn superior strategies directly from datasets of heuristic solutions; it aims to not only to imitate but to synthesize and outperform them. Concretely, we (i) integrate a Pointer Network to handle the instance-dependent, variable action space of node selection, and (ii) employ expectile regression for optimistic conditioning of Return-to-Go, which is crucial for instances with widely varying optimal values. Experiments show that our method consistently produces higher-quality tours than the four classical heuristics it is trained on, demonstrating the potential of offline RL to unlock and exceed the performance embedded in existing domain knowledge.

关键词: Offline Reinforcement Learning, Decision Transformer, Neural Combinatorial Optimization, Traveling Salesman Problem, Pointer Network, Expectile Regression, Heuristic Solutions, Return-to-Go

254. ❌ Gap Safe Screening Rules for Fast Training of Robust Support Vector Machines under Feature Noise

作者: Tan-Hau Nguyen, Thu-Le Tran, Kien Trung Nguyen 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25221v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是鲁棒支持向量机（R-SVMs）的训练加速方法，属于传统机器学习优化领域。论文内容完全不涉及大语言模型、深度学习、AI for Science等关键词相关的技术、方法或应用。所有关键词均与大模型、深度学习、AI科学应用等主题无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对特征噪声下的鲁棒支持向量机训练计算成本高的问题，提出了基于GAP的安全样本筛选规则，在保持分类精度的同时显著减少了训练时间。

摘要翻译

鲁棒支持向量机（R-SVMs）通过采用最坏情况鲁棒优化框架，将特征噪声的不确定性集合显式纳入训练过程，从而有效应对特征噪声问题。尽管这种鲁棒性提升了模型的可靠性，但也带来了计算成本的显著增加。本研究针对R-SVMs开发了安全样本筛选规则，在不影响最优解的前提下降低训练复杂度。据我们所知，这是首次将安全筛选技术应用于监督机器学习中的最坏情况鲁棒模型。该方法能够安全识别出那些不确定性集合完全位于间隔超平面某一侧的训练样本，从而缩减问题规模并加速优化过程。由于R-SVMs的非标准结构，所提出的筛选规则基于拉格朗日对偶理论推导，而非近期方法中常用的Fenchel-Rockafellar对偶框架。基于此分析，我们首先建立了理想筛选规则，随后通过将基于间隙（GAP）的安全区域适配至鲁棒设定，推导出实用规则。实验表明，所提方法在保持分类精度的同时，显著缩短了训练时间。

摘要 (Abstract)

Robust Support Vector Machines (R-SVMs) address feature noise by adopting a worst-case robust formulation that explicitly incorporates uncertainty sets into training. While this robustness improves reliability, it also leads to increased computational cost. In this work, we develop safe sample screening rules for R-SVMs that reduce the training complexity without affecting the optimal solution. To the best of our knowledge, this is the first study to apply safe screening techniques to worst-case robust models in supervised machine learning. Our approach safely identifies training samples whose uncertainty sets are guaranteed to lie entirely on either side of the margin hyperplane, thereby reducing the problem size and accelerating optimization. Owing to the nonstandard structure of R-SVMs, the proposed screening rules are derived from the Lagrangian duality rather than the Fenchel-Rockafellar duality commonly used in recent methods. Based on this analysis, we first establish an ideal screening rule, and then derive a practical rule by adapting GAP-based safe regions to the robust setting. Experiments demonstrate that the proposed method significantly reduces training time while preserving classification accuracy.

关键词: Robust Support Vector Machines, Feature Noise, Safe Screening Rules, Training Acceleration, Lagrangian Duality, Worst-case Robust Models, GAP-based Safe Regions, Computational Complexity

255. ❌ Fair regression under localized demographic parity constraints

作者: Arthur Charpentier, Christophe Denis, Romuald Elie, Mohamed Hebiri, François HU 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25224v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究回归任务中的公平性约束（Demographic Parity），属于机器学习公平性领域，与所有评分关键词（均涉及大模型/深度学习技术原理、应用或特定技术）完全无关。论文未涉及任何大模型、深度学习技术、AI for Science应用或相关技术方法。

!!! tip deepseek-chat TL;DR

该论文针对回归任务中全分布人口统计公平性约束过于严格的问题，提出了在有限分位数或阈值上实施公平性的松弛方法，并开发了具有理论保证的后处理算法，实现了可解释的公平性与准确性权衡。

摘要翻译

人口统计均等（Demographic Parity，DP）是一种广泛使用的群体公平性准则，要求预测分布在敏感群体间保持不变。虽然在分类任务中这一准则较为自然，但在回归任务中完全分布层面的DP往往限制过强，并可能导致显著的准确性损失。我们提出了一种针对回归任务的DP松弛方法，仅在有限的分位数水平和/或分数阈值上强制实现均等。具体而言，我们引入了一种新颖的（${\ell}$, Z）-公平预测器，它对指定配对（${\ell}$ m , z m ）施加形式为 F f |S=s (z m ) = ${\ell}$ m 的群体累积分布函数约束。针对此设定，我们通过拉格朗日对偶公式推导出最优公平离散化预测器的闭式表征，并量化了离散化代价，证明随着网格细化，其与连续最优解的风险差距趋近于零。我们进一步开发了一种基于双样本（标记样本用于学习基础回归器，未标记样本用于校准）的模型无关后处理算法，并在约束违反和超额惩罚风险方面建立了有限样本保证。此外，我们引入了两种替代框架，在选定的分数阈值上匹配群体与边际累积分布函数值。在这两种设定下，我们均给出了最优公平离散化预测器的闭式解。在合成与真实数据集上的实验展示了可解释的公平性-准确性权衡，能够在保持预测性能的同时，针对决策相关的分位数或阈值进行定向修正。

摘要 (Abstract)

Demographic parity (DP) is a widely used group fairness criterion requiring predictive distributions to be invariant across sensitive groups. While natural in classification, full distributional DP is often overly restrictive in regression and can lead to substantial accuracy loss. We propose a relaxation of DP tailored to regression, enforcing parity only at a finite set of quantile levels and/or score thresholds. Concretely, we introduce a novel (${\ell}$, Z)-fair predictor, which imposes groupwise CDF constraints of the form F f |S=s (z m ) = ${\ell}$ m for prescribed pairs (${\ell}$ m , z m ). For this setting, we derive closed-form characterizations of the optimal fair discretized predictor via a Lagrangian dual formulation and quantify the discretization cost, showing that the risk gap to the continuous optimum vanishes as the grid is refined. We further develop a model-agnostic post-processing algorithm based on two samples (labeled for learning a base regressor and unlabeled for calibration), and establish finite-sample guarantees on constraint violation and excess penalized risk. In addition, we introduce two alternative frameworks where we match group and marginal CDF values at selected score thresholds. In both settings, we provide closed-form solutions for the optimal fair discretized predictor. Experiments on synthetic and real datasets illustrate an interpretable fairness-accuracy trade-off, enabling targeted corrections at decision-relevant quantiles or thresholds while preserving predictive performance.

关键词: fair regression, demographic parity, quantile constraints, post-processing algorithm, fairness-accuracy trade-off, CDF constraints, group fairness, regression fairness

256. ❌ A CDF-First Framework for Free-Form Density Estimation

作者: Chenglong Song, Mazharul Islam, Lin Wang, Bing Chen, Bo Yang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25204v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于条件密度估计（CDE）的机器学习方法，提出了一种基于累积分布函数（CDF）的框架来解决自由形式密度估计问题，涉及概率建模、神经网络参数化和实验验证。所有评分关键词均与大模型、深度学习技术原理或特定AI应用（如科学AI）直接相关，而本文内容属于传统的概率密度估计领域，未涉及大模型、LLM、MoE、缩放定律、训练技术、推理优化、智能体、量化等主题，也未应用于生物信息学或化学信息学等科学领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于累积分布函数（CDF）的框架来解决条件密度估计中的自由形式密度估计问题，通过参数化平滑CDF来保证有效的概率密度函数，并在实验中优于现有方法。

摘要翻译

条件密度估计（Conditional Density Estimation, CDE）是机器学习中的一项基础任务，其目标是对完整的条件分布律 $\mathbb{P}(\mathbf{y} \mid \mathbf{x})$ 进行建模，而不仅仅是点预测（例如均值、众数）。一个核心挑战在于自由形式的密度估计，即在不施加限制性假设的前提下，捕捉那些呈现多模态、非对称性或拓扑复杂性的分布。然而，主流方法通常直接估计概率密度函数（Probability Density Function, PDF），这在数学上是不适定的：对经验分布进行微分会放大有限数据集中固有的随机波动，从而需要引入强归纳偏置，这些偏置会限制模型的表达能力，并在假设不成立时失效。我们提出了一个“CDF优先”的框架，通过估计一个稳定且适定的目标——累积分布函数（Cumulative Distribution Function, CDF）来规避此问题，然后通过对学习到的平滑CDF进行微分来恢复PDF。我们使用平滑最小-最大（Smooth Min-Max, SMM）网络对CDF进行参数化，该框架在结构上保证了有效的PDF，支持易于处理的近似似然训练，并能保持复杂的分布形状。对于多变量输出，我们采用带有SMM因子的自回归分解方法。实验表明，在一系列单变量和多变量任务上，我们的方法优于当前最先进的密度估计器。

摘要 (Abstract)

Conditional density estimation (CDE) is a fundamental task in machine learning that aims to model the full conditional law $\mathbb{P}(\mathbf{y} \mid \mathbf{x})$, beyond mere point prediction (e.g., mean, mode). A core challenge is free-form density estimation, capturing distributions that exhibit multimodality, asymmetry, or topological complexity without restrictive assumptions. However, prevailing methods typically estimate the probability density function (PDF) directly, which is mathematically ill-posed: differentiating the empirical distribution amplifies random fluctuations inherent in finite datasets, necessitating strong inductive biases that limit expressivity and fail when violated. We propose a CDF-first framework that circumvents this issue by estimating the cumulative distribution function (CDF), a stable and well-posed target, and then recovering the PDF via differentiation of the learned smooth CDF. Parameterizing the CDF with a Smooth Min-Max (SMM) network, our framework guarantees valid PDFs by construction, enables tractable approximate likelihood training, and preserves complex distributional shapes. For multivariate outputs, we use an autoregressive decomposition with SMM factors. Experiments demonstrate our approach outperforms state-of-the-art density estimators on a range of univariate and multivariate tasks.

关键词: Conditional Density Estimation, CDF-first Framework, Free-form Density Estimation, Smooth Min-Max Network, Autoregressive Decomposition, Probability Density Function, Multimodal Distributions, Machine Learning

257. ❌ Knowledge-Guided Retrieval-Augmented Generation for Zero-Shot Psychiatric Data: Privacy Preserving Synthetic Data Generation

作者: Adam Jakobsen, Sushant Gautam, Hugo Lewi Hammer, Susanne Olofsdotter, Miriam S Johanson, Pål Halvorsen, Vajira Thambawita 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25186v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是使用大语言模型（LLMs）结合检索增强生成（RAG）技术，在医疗科学领域（精神病学）生成隐私保护的合成数据。因此，与’Large Language Models’、‘Retrieval-Augmented Generation’和’AI for Science’高度相关（10分）。其他关键词如MoE、量化、推理加速、对齐等，论文未涉及或仅作为背景提及，故评0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于大语言模型和检索增强生成的零样本框架，用于生成隐私保护的精神病学合成数据，在数据不可用或无法共享时，其生成质量与依赖真实数据的先进模型（如CTGAN）具有竞争力。

摘要翻译

医疗健康研究领域的人工智能系统已展现出提升患者诊疗效率与辅助临床工作的潜力，但其发展受限于真实患者数据的有限获取。为解决这一问题，我们提出一种面向精神科表格数据的零样本知识引导框架，该框架通过检索增强生成技术，利用《精神障碍诊断与统计手册（第五版）》（DSM-5）和《国际疾病分类（第十版）》（ICD-10）指导大语言模型（LLMs）进行数据生成。我们采用不同知识库组合进行实验，以生成具有隐私保护功能的合成数据。所构建的模型与两种当前先进的表格数据合成深度学习模型——CTGAN和TVAE进行了性能对比，后两者均依赖真实数据训练，因而存在潜在的隐私风险。评估针对六种焦虑相关障碍展开：特定恐惧症、社交焦虑障碍、场所恐惧症、广泛性焦虑障碍、分离焦虑障碍和惊恐障碍。实验结果显示，CTGAN通常在单变量分布与多变量结构上表现最优，而知识增强的LLM在成对变量结构上具有竞争力，并在分离焦虑与社交焦虑数据上实现了最低的成对误差。消融研究表明，引入临床检索机制相比无检索的LLM持续提升了单变量与成对变量的数据保真度。隐私分析表明，不依赖真实数据的LLM生成数据与原始数据重叠度有限，其平均链接风险与CTGAN相当且处于较低水平；而TVAE尽管k-映射得分较低，却显示出大量数据复制现象。总体而言，当真实数据集无法获取或共享时，基于临床知识锚定的LLM能够生成高质量且具有隐私保护功能的精神科合成数据。

摘要 (Abstract)

AI systems in healthcare research have shown potential to increase patient throughput and assist clinicians, yet progress is constrained by limited access to real patient data. To address this issue, we present a zero-shot, knowledge-guided framework for psychiatric tabular data in which large language models (LLMs) are steered via Retrieval-Augmented Generation using the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) and the International Classification of Diseases (ICD-10). We conducted experiments using different combinations of knowledge bases to generate privacy-preserving synthetic data. The resulting models were benchmarked against two state-of-the-art deep learning models for synthetic tabular data generation, namely CTGAN and TVAE, both of which rely on real data and therefore entail potential privacy risks. Evaluation was performed on six anxiety-related disorders: specific phobia, social anxiety disorder, agoraphobia, generalized anxiety disorder, separation anxiety disorder, and panic disorder. CTGAN typically achieves the best marginals and multivariate structure, while the knowledge-augmented LLM is competitive on pairwise structure and attains the lowest pairwise error in separation anxiety and social anxiety. An ablation study shows that clinical retrieval reliably improves univariate and pairwise fidelity over a no-retrieval LLM. Privacy analyses indicate that the real data-free LLM yields modest overlaps and a low average linkage risk comparable to CTGAN, whereas TVAE exhibits extensive duplication despite a low k-map score. Overall, grounding an LLM in clinical knowledge enables high-quality, privacy-preserving synthetic psychiatric data when real datasets are unavailable or cannot be shared.

关键词: Large Language Models, Retrieval-Augmented Generation, Synthetic Data Generation, Psychiatric Data, Privacy Preservation, Zero-shot Learning, Healthcare AI, Clinical Knowledge

258. ❌ Process-Aware AI for Rainfall-Runoff Modeling: A Mass-Conserving Neural Framework with Hydrological Process Constraints

作者: Mohammad A. Farmani, Hoshin V. Gupta, Ali Behrangi, Muhammad Jawad, Sadaf Moghisi, Guo-Yue Niu 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25093v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于水文领域的物理感知AI框架（Mass-Conserving Perceptron），与绝大多数大模型技术关键词（如LLM、MoE、RLHF、RAG等）完全无关。唯一相关的是’Explainable AI’（5分），因为论文强调物理可解释性；以及’AI for Science’（8分），因为它属于科学AI应用，但具体是水文学而非生物信息学或化学信息学。

!!! tip deepseek-chat TL;DR

该研究通过逐步嵌入水文过程约束到质量守恒感知机框架中，提高了降雨径流模型的预测性能和物理可解释性，在15个美国流域的评估中显示最佳配置接近LSTM基准性能。

摘要翻译

机器学习模型在水文应用中可实现较高的预测精度，但往往缺乏物理可解释性。质量守恒感知机（Mass-Conserving Perceptron, MCP）提供了一个具有物理意识的人工智能（AI）框架，该框架在强制遵循守恒原理的同时，允许从数据中学习水文过程关系。本研究探讨了如何在一个单一的MCP蓄水单元中逐步嵌入具有物理意义的水文过程表征，以提升降雨径流模拟的预测能力和可解释性。我们从最简化的MCP公式出发，依次引入了有界土壤蓄水、状态依赖的导水率、可变孔隙度、下渗能力、地表积水、垂向排水以及非线性地下水位动态。我们以日径流预测为目标，在美国大陆五个水文气候区域的15个流域中，评估了由此形成的具有过程意识的MCP模型层级。结果表明，逐步增强MCP单元的内部物理结构通常能改善预测性能。这些过程表征的影响强烈依赖于水文气候条件：垂向排水显著提高了干旱和冰雪主导流域的模型技能，但在降雨主导区域却降低了性能，而地表积水的影响相对较小。表现最佳的MCP配置在保持明确物理可解释性的同时，其预测技能接近长短期记忆网络基准模型。这些结果证明，将水文过程约束嵌入人工智能架构，为发展可解释且具有过程意识的降雨径流模型提供了一条前景广阔的路径。

摘要 (Abstract)

Machine learning models can achieve high predictive accuracy in hydrological applications but often lack physical interpretability. The Mass-Conserving Perceptron (MCP) provides a physics-aware artificial intelligence (AI) framework that enforces conservation principles while allowing hydrological process relationships to be learned from data. In this study, we investigate how progressively embedding physically meaningful representations of hydrological processes within a single MCP storage unit improves predictive skill and interpretability in rainfall-runoff modeling. Starting from a minimal MCP formulation, we sequentially introduce bounded soil storage, state-dependent conductivity, variable porosity, infiltration capacity, surface ponding, vertical drainage, and nonlinear water-table dynamics. The resulting hierarchy of process-aware MCP models is evaluated across 15 catchments spanning five hydroclimatic regions of the continental United States using daily streamflow prediction as the target. Results show that progressively augmenting the internal physical structure of the MCP unit generally improves predictive performance. The influence of these process representations is strongly hydroclimate dependent: vertical drainage substantially improves model skill in arid and snow-dominated basins but reduces performance in rainfall-dominated regions, while surface ponding has comparatively small effects. The best-performing MCP configurations approach the predictive skill of a Long Short-Term Memory benchmark while maintaining explicit physical interpretability. These results demonstrate that embedding hydrological process constraints within AI architectures provides a promising pathway toward interpretable and process-aware rainfall-runoff modeling.

关键词: rainfall-runoff modeling, physics-aware AI, Mass-Conserving Perceptron, hydrological process constraints, interpretable machine learning, daily streamflow prediction, process-aware modeling, hydroclimatic regions

259. ❌ Ultra-fast Traffic Nowcasting and Control via Differentiable Agent-based Simulation

作者: Fumiyasu Makinoshima, Yuya Yamaguchi, Eigo Segawa, Koichiro Niinuma, Sean Qian 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25068v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于交通数字孪生和基于代理的交通模拟，核心贡献是可微分计算技术用于交通模拟、校准和控制。所有评分关键词均与大语言模型、深度学习技术原理或AI for Science（生物信息学/化学信息学）相关，而本文研究的是交通工程领域的模拟优化问题，未涉及任何大模型、深度学习技术或AI for Science的具体应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种可微分的基于代理的交通模拟器，解决了传统交通模拟非可微、校准计算成本高的问题，实现了在大型路网上超快速的模型校准、交通临近预报和控制。

摘要翻译

交通数字孪生通过基于真实交通数据校准的大规模高保真计算模型，为决策者提供有效干预措施，有望应对快速城市化世界中的社会挑战。然而，传统的细粒度交通模拟具有不可微分的特性，通常依赖于低效的无梯度优化方法，导致实际应用中的模型校准在计算上难以实现。本文提出了一种基于可微分智能体的交通模拟器，能够在大规模路网上实现超快速模型校准、交通实时推演与控制。我们开发了多种可微分计算技术来模拟个体车辆运动，包括随机决策过程与智能体间交互，同时确保整个模拟轨迹保持端到端的可微分性，从而实现高效的基于梯度的优化。在包含超过10,000个校准参数的大规模芝加哥道路网络上，我们的模型以173倍实时速度模拟了超过一百万辆车。这种超快速模拟与高效的梯度优化相结合，使我们能够在455秒内利用过去30分钟的交通数据完成模型校准，在21秒内提供未来一小时的交通实时推演，并在728秒内求解相应的交通控制问题。这实现了完整的“校准-推演-控制”闭环流程，总耗时低于20分钟，为实施干预措施留出约40分钟的提前量。因此，我们的工作为实现交通数字孪生提供了切实可行的计算基础。

摘要 (Abstract)

Traffic digital twins, which inform policymakers of effective interventions based on large-scale, high-fidelity computational models calibrated to real-world traffic, hold promise for addressing societal challenges in our rapidly urbanizing world. However, conventional fine-grained traffic simulations are non-differentiable and typically rely on inefficient gradient-free optimization, making calibration for real-world applications computationally infeasible. Here we present a differentiable agent-based traffic simulator that enables ultra-fast model calibration, traffic nowcasting, and control on large-scale networks. We develop several differentiable computing techniques for simulating individual vehicle movements, including stochastic decision-making and inter-agent interactions, while ensuring that entire simulation trajectories remain end-to-end differentiable for efficient gradient-based optimization. On the large-scale Chicago road network, with over 10,000 calibration parameters, our model simulates more than one million vehicles at 173 times real-time speed. This ultra-fast simulation, together with efficient gradient-based optimization, enables us to complete model calibration using the previous 30 minutes of traffic data in 455 s, provide a one-hour-ahead traffic nowcast in 21 s, and solve the resulting traffic control problem in 728 s. This yields a full calibration–nowcast–control loop in under 20 minutes, leaving about 40 minutes of lead time for implementing interventions. Our work thus provides a practical computational basis for realizing traffic digital twins.

关键词: differentiable simulation, agent-based simulation, traffic digital twins, model calibration, traffic nowcasting, traffic control, gradient-based optimization, large-scale networks

260. ❌ MP-MoE: Matrix Profile-Guided Mixture of Experts for Precipitation Forecasting

作者: Huyen Ngoc Tran, Dung Trung Tran, Hong Nguyen, Xuan Vu Phan, Nam-Phong Nguyen 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25046v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文核心贡献是提出MP-MoE框架，将矩阵轮廓（Matrix Profile）目标与混合专家（MoE）架构结合用于降水预报，因此与’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分）。研究属于气象科学领域的AI应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（8分）。论文提到使用数据驱动的后处理（data-driven post-processing），可视为一种监督微调形式，与’Post-training OR Supervised Fine-tuning OR SFT’有弱关联（5分）。其他关键词涉及大模型技术原理、对齐、推理、代理等，均未在论文中涉及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对热带地区降水预报中数值天气预报模型的偏差问题，提出了矩阵轮廓引导的混合专家框架（MP-MoE），通过结合强度损失和结构感知的矩阵轮廓目标，在越南两大流域的降雨数据集上验证了其在强降雨事件预测精度和风暴形态保持方面的优越性能。

摘要翻译

降水预报在越南等热带地区始终是一项严峻挑战，复杂地形和对流不稳定性常制约数值天气预报（Numerical Weather Prediction, NWP）模型的精度。尽管数据驱动的后处理技术被广泛用于缓解此类偏差，但现有框架大多依赖逐点目标函数，其在轻微时间错位下易受“双重惩罚”效应影响。本研究提出矩阵轮廓引导的专家混合（Matrix Profile-guided Mixture of Experts, MP-MoE）框架，该框架将传统强度损失与结构感知的矩阵轮廓目标函数相结合。通过利用子序列层面的相似性而非逐点误差，所提出的损失函数实现了更可靠的专家选择，并缓解了相位偏移导致的过度惩罚。我们在越南两大流域的降雨数据集上对MP-MoE进行了多时间尺度的评估，包括1小时降雨强度及12、24、48小时累积降雨量。实验结果表明，在强降雨事件的平均临界成功指数（CSI-M）方面，MP-MoE优于原始NWP输出及基线学习方法，同时显著降低了动态时间规整（Dynamic Time Warping, DTW）值。这些发现凸显了该框架在捕捉峰值降雨强度及保持风暴事件形态完整性方面的有效性。

摘要 (Abstract)

Precipitation forecasting remains a persistent challenge in tropical regions like Vietnam, where complex topography and convective instability often limit the accuracy of Numerical Weather Prediction (NWP) models. While data-driven post-processing is widely used to mitigate these biases, most existing frameworks rely on point-wise objective functions, which suffer from the ``double penalty’’ effect under minor temporal misalignments. In this work, we propose the Matrix Profile-guided Mixture of Experts (MP-MoE), a framework that integrates conventional intensity loss with a structural-aware Matrix Profile objective. By leveraging subsequence-level similarity rather than point-wise errors, the proposed loss facilitates more reliable expert selection and mitigates excessive penalization caused by phase shifts. We evaluate MP-MoE on rainfall datasets from two major river basins in Vietnam across multiple horizons, including 1-hour intensity and accumulated rainfall over 12, 24, and 48 hours. Experimental results demonstrate that MP-MoE outperforms raw NWP and baseline learning methods in terms of Mean Critical Success Index (CSI-M) for heavy rainfall events, while significantly reducing Dynamic Time Warping (DTW) values. These findings highlight the framework’s efficacy in capturing peak rainfall intensities and preserving the morphological integrity of storm events.

关键词: Precipitation forecasting, Mixture of Experts, Matrix Profile, Numerical Weather Prediction, Data-driven post-processing, Dynamic Time Warping, Rainfall intensity, Tropical regions

261. ❌ Epistemic Compression: The Case for Deliberate Ignorance in High-Stakes AI

作者: Steffen Lukas 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25033v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心讨论基础模型（Foundation Models）在高风险领域（如医学、金融）的可靠性问题，提出Epistemic Compression原则，强调模型复杂度应与数据时效性匹配而非盲目扩展参数。因此，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文涉及数据质量与模型扩展的权衡，与’Scaling Laws AND Data Quality’有一定关联（5分）。论文关注高风险科学应用领域，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在论文中提及，均为0分。

!!! tip deepseek-chat TL;DR

论文研究了基础模型在高风险领域因数据不稳定而失效的Fidelity Paradox问题，提出了Epistemic Compression原则和Regime Index，实证表明在86.7%的高风险案例中，匹配模型复杂度与数据时效性的策略更优。

摘要翻译

基础模型在稳定环境中表现出色，但在可靠性至关重要的领域——如医学、金融和政策制定中——却常常失效。这种“保真度悖论”不仅是数据问题，更是结构性问题。在规则随时间变化的领域中，额外的模型容量会放大噪声而非捕捉有效信号。我们提出“认知压缩”原则：鲁棒性源于使模型复杂度与数据有效期限相匹配，而非通过扩展参数规模获得。与经典的正则化方法（事后惩罚权重）不同，认知压缩通过架构设计强制实现简约性：模型结构本身被设计为通过提高架构成本来表征超出数据证据的方差，从而减少过拟合。我们通过“机制指数”实现这一原则，该指数将“漂移机制”（不稳定、数据贫乏；简约性占优）与“稳定机制”（恒定、数据丰富；复杂性可行）区分开来。在对15个高风险领域进行的探索性综合研究中，该指数与实证中更优的建模策略在86.7%的案例（13/15）中保持一致。高风险人工智能领域需要从盲目追求规模扩展转向有原则的简约性设计。

摘要 (Abstract)

Foundation models excel in stable environments, yet often fail where reliability matters most: medicine, finance, and policy. This Fidelity Paradox is not just a data problem; it is structural. In domains where rules change over time, extra model capacity amplifies noise rather than capturing signal. We introduce Epistemic Compression: the principle that robustness emerges from matching model complexity to the shelf life of the data, not from scaling parameters. Unlike classical regularization, which penalizes weights post hoc, Epistemic Compression enforces parsimony through architecture: the model structure itself is designed to reduce overfitting by making it architecturally costly to represent variance that exceeds the evidence in the data. We operationalize this with a Regime Index that separates Shifting Regime (unstable, data-poor; simplicity wins) from Stable Regime (invariant, data-rich; complexity viable). In an exploratory synthesis of 15 high-stakes domains, this index was concordant with the empirically superior modeling strategy in 86.7% of cases (13/15). High-stakes AI demands a shift from scaling for its own sake to principled parsimony.

关键词: Foundation Models, Epistemic Compression, High-Stakes AI, Fidelity Paradox, Regime Index, Model Complexity, Data Shelf Life, Robustness

262. ❌ Optimal High-Probability Regret for Online Convex Optimization with Two-Point Bandit Feedback

作者: Haishan Ye 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25029v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究在线凸优化（OCO）中的两点强盗反馈问题，属于经典优化理论领域，与所有评分关键词（均围绕大模型、深度学习技术及其应用）完全无关。论文未涉及任何大模型技术、训练方法、推理优化、对齐、应用或相关概念。

!!! tip deepseek-chat TL;DR

该论文解决了在线凸优化中两点强盗反馈下强凸损失函数的高概率遗憾界开放问题，首次获得了关于时间范围和维度的极小极大最优遗憾界。

摘要翻译

本文研究了对抗环境下具有两点反馈的在线凸优化问题。在该设定中，玩家试图最小化一系列对抗性生成的凸损失函数，但仅能观测到每个函数在两点的取值。尽管已知两点反馈可用于梯度估计，但如\citet{agarwal2010optimal}所指出，对于强凸函数如何获得紧的高概率遗憾界仍是一个未解决的问题。主要挑战在于反馈梯度估计量的重尾特性，这使得标准的集中性分析难以进行。本文通过首次给出μ-强凸损失函数下$O(d(\log T + \log(1/δ))/μ)$的高概率遗憾界，解决了这一公开难题。我们的结果在时间范围$T$和维度$d$上均达到极小极大最优。

摘要 (Abstract)

We consider the problem of Online Convex Optimization (OCO) with two-point bandit feedback in an adversarial environment. In this setting, a player attempts to minimize a sequence of adversarially generated convex loss functions, while only observing the value of each function at two points. While it is well-known that two-point feedback allows for gradient estimation, achieving tight high-probability regret bounds for strongly convex functions still remained open as highlighted by \citet{agarwal2010optimal}. The primary challenge lies in the heavy-tailed nature of bandit gradient estimators, which makes standard concentration analysis difficult. In this paper, we resolve this open challenge by providing the first high-probability regret bound of $O(d(\log T + \log(1/δ))/μ)$ for $μ$-strongly convex losses. Our result is minimax optimal with respect to both the time horizon $T$ and the dimension $d$.

关键词: Online Convex Optimization, two-point bandit feedback, adversarial environment, strongly convex functions, high-probability regret bounds, minimax optimal, gradient estimation, heavy-tailed estimators

263. ❌ Improving Infinitely Deep Bayesian Neural Networks with Nesterov’s Accelerated Gradient Method

作者: Chenxu Yu, Wenqi Fang 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25024v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于随机微分方程（SDE）的贝叶斯神经网络（BNN）的优化方法，通过引入Nesterov加速梯度（NAG）来减少函数评估次数（NFEs）并提高收敛稳定性。论文内容聚焦于深度神经网络（特别是无限深度网络）的数值优化和计算效率问题，属于深度学习技术原理的底层优化范畴。然而，论文并未涉及任何大语言模型（LLM）、大模型技术、AI for Science应用或评审关键词列表中指定的任何具体技术（如MoE、RLHF、RAG、量化等）。论文虽然属于深度学习领域，但与评审关注的大模型及指定技术主题完全无关。

!!! tip deepseek-chat TL;DR

该论文针对基于随机微分方程的贝叶斯神经网络计算成本高和收敛不稳定的问题，提出了一种集成Nesterov加速梯度的改进方法，显著减少了训练和测试时的函数评估次数并提高了预测准确性。

摘要翻译

作为连续深度神经网络方法的代表，基于随机微分方程的贝叶斯神经网络因其坚实的理论基础和强大的实际应用潜力而备受关注。然而，其对数值随机微分方程求解器的依赖不可避免地导致大量的函数评估次数，从而产生高昂的计算成本，并偶尔引发收敛不稳定性。为应对这些挑战，我们提出了一种内斯特罗夫加速梯度增强的随机微分方程-贝叶斯神经网络模型。通过将内斯特罗夫加速梯度整合到随机微分方程-贝叶斯神经网络框架中，并结合一个与函数评估次数相关的残差跳跃连接，我们的方法在训练和测试阶段均加速了收敛过程，并显著减少了函数评估次数。大量的实证结果表明，我们的模型在图像分类和序列建模等多种任务中，始终优于传统的随机微分方程-贝叶斯神经网络，实现了更低的函数评估次数和更高的预测准确性。

摘要 (Abstract)

As a representative continuous-depth neural network approach, stochastic differential equation (SDE)-based Bayesian neural networks (BNNs) have attracted considerable attention due to their solid theoretical foundations and strong potential for real-world applications. However, their reliance on numerical SDE solvers inevitably incurs a large number of function evaluations (NFEs), resulting in high computational cost and occasional convergence instability. To address these challenges, we propose a Nesterov-accelerated gradient (NAG) enhanced SDE-BNN model. By integrating NAG into the SDE-BNN framework along with an NFE-dependent residual skip connection, our method accelerates convergence and substantially reduces NFEs during both training and testing. Extensive empirical results show that our model consistently outperforms conventional SDE-BNNs across various tasks, including image classification and sequence modeling, achieving lower NFEs and improved predictive accuracy.

关键词: Bayesian neural networks, stochastic differential equations, Nesterov-accelerated gradient, function evaluations, convergence acceleration, infinite-depth networks, numerical SDE solvers, computational efficiency

264. ❌ A Public Theory of Distillation Resistance via Constraint-Coupled Reasoning Architectures

作者: Peng Wei, Wesley Shu 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25022v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究知识蒸馏、模型提取和行为转移的风险，提出了一种约束耦合推理架构的理论框架来增强蒸馏抵抗性。与关键词的相关性分析如下：1) “Large Language Models” 得5分：论文讨论前沿AI的蒸馏风险，LLMs是前沿AI的核心，但未深入LLMs技术细节；2) “Instruction Tuning OR Alignment OR Value Alignment” 得8分：论文明确提及"alignment”，并探讨模型治理和蒸馏抵抗性，与对齐概念相关，但非核心对齐技术；3) 其他关键词得0分：论文未涉及MoE、SLMs、训练方法、推理技术、代理系统、压缩加速、科学AI等具体技术。

!!! tip deepseek-chat TL;DR

该论文针对知识蒸馏和模型提取带来的治理风险，提出了一个约束耦合推理架构的理论框架，通过耦合能力与内部稳定性约束来降低蒸馏价值，为蒸馏抵抗性和模型治理提供了可检验的假设。

摘要翻译

知识蒸馏、模型提取与行为迁移已成为前沿人工智能领域的核心关切。主要风险不仅在于复制行为本身，更在于有用能力可能以远低于其原有治理架构成本的代价被转移。本文提出一种公开的、可保护商业机密的理论框架，旨在架构层面降低这种不对称性。核心论点是：当高层次能力与塑造状态随时间演化的内部稳定性约束相耦合时，蒸馏作为捷径的价值将显著降低。为形式化这一思想，本文引入包含四个要素的约束耦合推理框架：有界转移负担、路径负载累积、动态演化可行域以及能力-稳定性耦合条件。本文设计遵循公开安全性原则：省略了专有实现细节、训练方案、阈值设定、隐状态监测机制、部署流程及机密系统设计选择。因此，本研究的贡献在于理论层面而非操作层面。它提出了一个可证伪的架构命题、清晰的威胁模型，以及一组可通过实验验证的假设，为未来关于蒸馏抵抗、对齐与模型治理的研究提供理论基础。

摘要 (Abstract)

Knowledge distillation, model extraction, and behavior transfer have become central concerns in frontier AI. The main risk is not merely copying, but the possibility that useful capability can be transferred more cheaply than the governance structure that originally accompanied it. This paper presents a public, trade-secret-safe theoretical framework for reducing that asymmetry at the architectural level. The core claim is that distillation becomes less valuable as a shortcut when high-level capability is coupled to internal stability constraints that shape state transitions over time. To formalize this idea, the paper introduces a constraint-coupled reasoning framework with four elements: bounded transition burden, path-load accumulation, dynamically evolving feasible regions, and a capability-stability coupling condition. The paper is intentionally public-safe: it omits proprietary implementation details, training recipes, thresholds, hidden-state instrumentation, deployment procedures, and confidential system design choices. The contribution is therefore theoretical rather than operational. It offers a falsifiable architectural thesis, a clear threat model, and a set of experimentally testable hypotheses for future work on distillation resistance, alignment, and model governance.

关键词: knowledge distillation, model extraction, distillation resistance, constraint-coupled reasoning, model governance, capability-stability coupling, architectural framework, frontier AI

265. ❌ A Systematic Empirical Study of Grokking: Depth, Architecture, Activation, and Regularization

作者: Shalima Binta Manir, Anamika Paul Rupa 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25009v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究神经网络中的’grokking’现象（从记忆到泛化的延迟过渡），主要关注架构、优化和正则化的相互作用。论文内容与绝大多数关键词无关，因为这些关键词主要涉及大语言模型（LLMs）及其相关技术（如微调、对齐、推理加速、智能体等），而该论文研究的是基础神经网络（MLPs、Transformers）在模块化加法任务上的泛化行为。唯一相关的关键词是’Mechanistic Interpretability OR Explainable AI’，因为论文通过系统实验揭示了神经网络泛化行为的机制，属于可解释AI的范畴，但并非核心内容，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文通过系统实验研究了神经网络中'grokking'现象（从记忆到泛化的延迟过渡）的机制，发现其动态主要由优化稳定性和正则化（特别是权重衰减）的相互作用决定，而非架构本身。

摘要翻译

理解神经网络中从记忆到泛化的延迟转变现象（grokking）仍不充分，部分原因是先前的实证研究混淆了架构、优化和正则化的作用。我们通过一项受控研究，在模加法（mod 97）任务上系统性地分离了这些因素，并在不同模型间采用匹配且精细调优的训练方案。我们的核心发现是：grokking动态主要由优化稳定性与正则化之间的相互作用决定，而非架构本身。具体而言，我们表明：（1）深度具有非单调效应：深度为4的多层感知机（MLP）始终无法实现grokking，而深度为8的残差网络能恢复泛化能力，这表明深度需要架构稳定性支持；（2）在超参数匹配的条件下，Transformer与MLP之间的明显差距基本消失（延迟仅1.11倍），说明先前报道的差异主要源于优化器和正则化的混淆；（3）激活函数的影响依赖于训练机制：仅当正则化允许记忆时，GELU可比ReLU快达4.3倍；（4）权重衰减是主导控制参数，存在一个狭窄的“恰到好处”区间使grokking发生，而过少或过多的权重衰减均会阻碍泛化。基于每种配置3–5次随机种子的实验，这些结果为grokking作为一种相互作用驱动的现象提供了统一的实证解释。我们的发现挑战了以架构为中心的解释，并阐明了优化与正则化如何共同调控延迟泛化。

摘要 (Abstract)

Grokking the delayed transition from memorization to generalization in neural networks remains poorly understood, in part because prior empirical studies confound the roles of architecture, optimization, and regularization. We present a controlled study that systematically disentangles these factors on modular addition (mod 97), with matched and carefully tuned training regimes across models. Our central finding is that grokking dynamics are not primarily determined by architecture, but by interactions between optimization stability and regularization. Specifically, we show: (1) \textbf{depth has a non-monotonic effect}, with depth-4 MLPs consistently failing to grok while depth-8 residual networks recover generalization, demonstrating that depth requires architectural stabilization; (2) \textbf{the apparent gap between Transformers and MLPs largely disappears} (1.11$\times$ delay) under matched hyperparameters, indicating that previously reported differences are largely due to optimizer and regularization confounds; (3) \textbf{activation function effects are regime-dependent}, with GELU up to 4.3$\times$ faster than ReLU only when regularization permits memorization; and (4) \textbf{weight decay is the dominant control parameter}, exhibiting a narrow ``Goldilocks’’ regime in which grokking occurs, while too little or too much prevents generalization. Across 3–5 seeds per configuration, these results provide a unified empirical account of grokking as an interaction-driven phenomenon. Our findings challenge architecture-centric interpretations and clarify how optimization and regularization jointly govern delayed generalization.

关键词: grokking, neural networks, generalization, regularization, optimization, depth, activation function, weight decay

266. ❌ The Value of Information in Resource-Constrained Pricing

作者: Ruicheng Ao, Jiashuo Jiang, David Simchi-Levi 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.24974v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究资源受限定价中的信息价值，属于运筹学、动态定价和需求预测领域，未涉及大模型、深度学习或AI for Science的任何技术、方法或应用，与所有评分关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了在容量约束下，需求预测不确定性如何影响动态定价决策，证明了带有误差界的认证预测可将遗憾从O(√T)降至O(log T)，并展示了替代模型通过控制变量减少学习方差。

摘要翻译

企业对易逝资源（如航空座位、酒店客房、季节性库存）进行定价时，现已普遍采用需求预测，但这些预测的质量差异显著。在严格的产能约束下，依据不准确的预测采取行动可能导致未来时期所需库存的不可逆损耗。本研究探讨了在线性需求、随机噪声及有限产能条件下，预测不确定性如何影响动态定价决策。一个具有已知误差界~$ε^0$的认证需求预测指明了系统应运行的区域：当$ε^0 \lesssim T^{-1/4}$时，它将遗憾度从$O(\sqrt{T})$降低至$O(\log T)$，我们证明该阈值是紧的。一个误设的替代模型——虽存在偏差但与真实需求相关——虽不能直接设定价格，但可通过控制变量法将学习方差降低$(1-ρ^2)$倍。两种机制可协同作用：预测决定遗憾机制；替代模型则在该机制内提升估计精度。所有算法均基于一种边界吸引机制，该机制能在无需非退化假设的情况下，使定价在退化的产能边界附近保持稳定。实验验证了相变阈值、替代模型带来的方差缩减效果，以及算法在不同问题实例中的鲁棒性。

摘要 (Abstract)

Firms that price perishable resources – airline seats, hotel rooms, seasonal inventory – now routinely use demand predictions, but these predictions vary widely in quality. Under hard capacity constraints, acting on an inaccurate prediction can irreversibly deplete inventory needed for future periods. We study how prediction uncertainty propagates into dynamic pricing decisions with linear demand, stochastic noise, and finite capacity. A certified demand forecast with known error bound~$ε^0$ specifies where the system should operate: it shifts regret from $O(\sqrt{T})$ to $O(\log T)$ when $ε^0 \lesssim T^{-1/4}$, and we prove this threshold is tight. A misspecified surrogate model – biased but correlated with true demand – cannot set prices directly but reduces learning variance by a factor of $(1-ρ^2)$ through control variates. The two mechanisms compose: the forecast determines the regret regime; the surrogate tightens estimation within it. All algorithms rest on a boundary attraction mechanism that stabilizes pricing near degenerate capacity boundaries without requiring non-degeneracy assumptions. Experiments confirm the phase transition threshold, the variance reduction from surrogates, and robustness across problem instances.

关键词: dynamic pricing, demand prediction, capacity constraints, regret analysis, control variates, perishable resources, stochastic noise, boundary attraction

267. ❌ Design Once, Deploy at Scale: Template-Driven ML Development for Large Model Ecosystems

作者: Jiang Liu, John Martabano Landy, Yao Xuan, Swamy Muddu, Nhat Le, Munaf Sahaf, Luc Kien Hang, Rupinder Khandpour, Kevin De Angeli, Chang Yang, Shouyuan Chen, Shiblee Sadik, Ani Agrawal, Djordje Gligorijevic, Jingzheng Qin, Peggy Yao, Alireza Vahdatpour 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.24963v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于推荐系统中的机器学习模型标准化框架（SMT），旨在解决大规模模型生态系统中的开发和部署效率问题。所有关键词均与大模型技术原理、训练方法、推理优化、对齐、应用等具体方向相关，而本文讨论的是通用的ML模型构建框架和工程效率，未涉及任何特定的大模型技术、训练方法或科学应用。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

论文研究了推荐系统中大规模机器学习模型生态系统的开发和部署效率问题，提出了标准化模型模板（SMT）框架，实验表明该框架在保持模型性能的同时，显著减少了工程迭代时间并提高了技术传播效率。

摘要翻译

现代计算广告平台通常依赖推荐系统来预测用户响应，例如点击率、转化率及其他优化事件。为支持多样化的产品界面和广告主目标，这些平台往往需要维护一个庞大的机器学习模型生态系统。然而，在这种规模下运行会带来显著的开发与效率挑战。定期更新机器学习模型并推广新技术需要大量的工程投入，导致机器学习创新在整个生态系统中的部署存在较长延迟。
本文通过大规模实证研究，比较了推荐系统中标准化建模方法与独立单模型优化在模型性能、效率及机器学习技术传播方面的差异。为实现这种标准化，我们提出了标准模型模板——一种能够生成适应不同数据分布和优化事件的高性能模型框架。通过采用标准化、可组合的机器学习模型组件，SMT将技术传播复杂度从$O(n \cdot 2^k)$降低至$O(n + k)$，其中$n$为模型数量，$k$为技术数量。
通过在Meta广告排序生产生态系统中对大量模型进行四个全球开发周期的评估，我们的研究结果表明：（1）在保持服务能力不变的情况下，交叉熵平均提升0.63%；（2）单模型迭代工程时间减少92%；（3）技术-模型配对采用吞吐量提升$6.3$倍。这些发现对“多样化优化目标必然需要差异化机器学习模型设计”的传统观点提出了挑战。

摘要 (Abstract)

Modern computational advertising platforms typically rely on recommendation systems to predict user responses, such as click-through rates, conversion rates, and other optimization events. To support a wide variety of product surfaces and advertiser goals, these platforms frequently maintain an extensive ecosystem of machine learning (ML) models. However, operating at this scale creates significant development and efficiency challenges. Substantial engineering effort is required to regularly refresh ML models and propagate new techniques, which results in long latencies when deploying ML innovations across the ecosystem. We present a large-scale empirical study comparing model performance, efficiency, and ML technique propagation between a standardized model-building approach and independent per-model optimization in recommendation systems. To facilitate this standardization, we propose the Standard Model Template (SMT) – a framework that generates high-performance models adaptable to diverse data distributions and optimization events. By utilizing standardized, composable ML model components, SMT reduces technique propagation complexity from $O(n \cdot 2^k)$ to $O(n + k)$ where $n$ is the number of models and $k$ the number of techniques. Evaluating an extensive suite of models over four global development cycles within Meta’s production ads ranking ecosystem, our results demonstrate: (1) a 0.63% average improvement in cross-entropy at neutral serving capacity, (2) a 92% reduction in per-model iteration engineering time, and (3) a $6.3\times$ increase in technique-model pair adoption throughput. These findings challenge the conventional wisdom that diverse optimization goals inherently require diversified ML model design.

关键词: recommendation systems, machine learning models, model standardization, Standard Model Template, technique propagation, large-scale ecosystems, engineering efficiency, model performance

268. ❌ MobileDev-Bench: A Comprehensive Benchmark for Evaluating Language Models on Mobile Application Development

作者: Moshood A. Fakorede, Krishna Upadhyay, A. B. Siddique, Umar Farooq 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.24946v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文的核心是评估大型语言模型（LLMs）在移动应用开发任务上的性能，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文聚焦于构建基准测试和评估现有模型，不涉及模型架构创新（如MoE、SLMs）、训练方法（如预训练、微调、对齐、RLHF、PEFT）、推理优化（如RAG、上下文扩展、注意力优化、量化、解码加速）、推理能力（如思维链、系统2思维、MCTS、自我纠正）、智能体系统（如LLM智能体、工具使用、多智能体）、模型可解释性、世界模型、模型融合、上下文学习或特定科学领域（如生物信息学）的应用，因此这些关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有基准在移动应用开发领域覆盖不足的问题，提出了MobileDev-Bench基准来评估大型语言模型在解决真实移动应用问题上的性能，结果显示当前最先进的代码生成模型在该任务上的端到端解决率很低（3.39%-5.21%），揭示了显著的性能差距。

摘要翻译

大语言模型（LLM）在自动化软件工程任务上已展现出强大性能，然而现有基准测试主要聚焦于通用库或Web应用程序，移动应用开发领域在很大程度上尚未被充分探索，尽管其具有严格的平台约束、框架驱动的生命周期以及复杂的平台API交互特性。我们推出MobileDev-Bench基准测试，该基准包含从18个生产级移动应用中收集的384个真实问题解决任务，涵盖Android Native（Java/Kotlin）、React Native（TypeScript）和Flutter（Dart）平台。每个任务将一个真实的开发者报告问题与可执行的测试补丁配对，从而能够在移动构建环境中对模型生成的修复方案进行全自动化验证。该基准测试展现出显著的补丁复杂性：修复平均涉及修改12.5个文件和324.9行代码，且35.7%的实例需要跨多种工件类型（如源代码和清单文件）进行协同修改。对四种先进的代码生成大语言模型（GPT-5.2、Claude Sonnet 4.5、Gemini Flash 2.5和Qwen3-Coder）的评估显示，其端到端问题解决率仅为3.39%-5.21%，与先前基准测试相比存在显著性能差距。进一步分析揭示了系统性的失败模式，其中跨多文件与多工件变更的故障定位成为主要瓶颈。

摘要 (Abstract)

Large language models (LLMs) have shown strong performance on automated software engineering tasks, yet existing benchmarks focus primarily on general-purpose libraries or web applications, leaving mobile application development largely unexplored despite its strict platform constraints, framework-driven lifecycles, and complex platform API interactions. We introduce MobileDev-Bench, a benchmark comprising 384 real-world issue-resolution tasks collected from 18 production mobile applications spanning Android Native (Java/Kotlin), React Native (TypeScript), and Flutter (Dart). Each task pairs an authentic developer-reported issue with executable test patches, enabling fully automated validation of model-generated fixes within mobile build environments. The benchmark exhibits substantial patch complexity: fixes modify 12.5 files and 324.9 lines on average, and 35.7% of instances require coordinated changes across multiple artifact types, such as source and manifest files. Evaluation of four state-of-the-art code-capable LLMs, GPT- 5.2, Claude Sonnet 4.5, Gemini Flash 2.5, and Qwen3-Coder, yields low end-to-end resolution rates of 3.39%-5.21%, revealing significant performance gaps compared to prior benchmarks. Further analysis reveals systematic failure modes, with fault localization across multi-file and multi-artifact changes emerging as the primary bottleneck.

关键词: Large Language Models, Mobile Application Development, Benchmark, Code Generation, Software Engineering, Automated Validation, Performance Evaluation, Multi-platform

269. ❌ CVA: Context-aware Video-text Alignment for Video Temporal Grounding

作者: Sungho Moon, Seunghun Lee, Jiwan Seo, Sunghoon Im 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.24934v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频时序定位（Video Temporal Grounding）任务，提出了一种名为CVA的框架，包含数据增强策略（QCD）、对比损失（CBD）和Transformer编码器架构（CTE）。虽然论文涉及视频-文本对齐和Transformer架构，但所有关键词均明确指向大语言模型（LLMs）及其相关技术（如MoE、RLHF、RAG、量化等）或特定科学AI应用（如生物信息学）。论文的研究内容（视频理解、时序边界检测、多模态对齐）与提供的关键词列表在技术领域和应用方向上均无直接关联，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为CVA的上下文感知视频-文本对齐框架，通过查询感知上下文多样化、上下文不变边界判别损失和上下文增强Transformer编码器，在视频时序定位任务上实现了最先进的性能，显著提升了Recall@1分数。

摘要翻译

我们提出上下文感知视频文本对齐框架，这是一种解决视频时序定位中关键挑战的新方法：实现对无关背景上下文保持鲁棒性的时序敏感视频文本对齐。该框架基于三个核心组件构建。首先，我们提出查询感知上下文多样化策略，这是一种新的数据增强方法，确保仅混合语义无关的内容。该方法通过构建基于视频文本相似度的替换片段池来模拟多样化上下文，同时避免因查询无关混合导致的“假阴性”问题。其次，我们引入上下文不变边界判别损失，这是一种对比损失函数，通过在具有挑战性的时序边界上强化语义一致性，使其表征对上下文变化和困难负样本具有鲁棒性。第三，我们设计上下文增强型Transformer编码器，这是一种分层架构，将窗口化自注意力机制、双向交叉注意力机制与可学习查询向量相结合，以捕捉多尺度时序上下文。通过这些以数据为中心和架构增强的协同作用，CVA在主流视频时序定位基准测试中实现了最先进的性能，包括QVHighlights和Charades-STA。值得注意的是，本方法在Recall@1指标上较现有最优方法提升约5个百分点，显著证明了其在缓解假阴性问题上的有效性。

摘要 (Abstract)

We propose Context-aware Video-text Alignment (CVA), a novel framework to address a significant challenge in video temporal grounding: achieving temporally sensitive video-text alignment that remains robust to irrelevant background context. Our framework is built on three key components. First, we propose Query-aware Context Diversification (QCD), a new data augmentation strategy that ensures only semantically unrelated content is mixed in. It builds a video-text similarity-based pool of replacement clips to simulate diverse contexts while preventing the ``false negative" caused by query-agnostic mixing. Second, we introduce the Context-invariant Boundary Discrimination (CBD) loss, a contrastive loss that enforces semantic consistency at challenging temporal boundaries, making their representations robust to contextual shifts and hard negatives. Third, we introduce the Context-enhanced Transformer Encoder (CTE), a hierarchical architecture that combines windowed self-attention and bidirectional cross-attention with learnable queries to capture multi-scale temporal context. Through the synergy of these data-centric and architectural enhancements, CVA achieves state-of-the-art performance on major VTG benchmarks, including QVHighlights and Charades-STA. Notably, our method achieves a significant improvement of approximately 5 points in Recall@1 (R1) scores over state-of-the-art methods, highlighting its effectiveness in mitigating false negatives.

关键词: Video Temporal Grounding, Video-text Alignment, Context-aware, Query-aware Context Diversification, Context-invariant Boundary Discrimination, Context-enhanced Transformer Encoder, Multi-scale Temporal Context, State-of-the-art Performance

270. ❌ Once-for-All Channel Mixers (HYPERTINYPW): Generative Compression for TinyML

作者: Yassien Shaalan 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.24916v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于TinyML领域的模型压缩技术，与大多数大语言模型相关关键词无关。主要相关点：1) “Small Language Models OR SLMs OR On-device AI”（8分）：论文研究在微控制器上部署神经网络，属于设备端AI范畴，但未明确涉及语言模型；2) “Quantization OR Model Compression OR Low-bit Weights”（10分）：核心贡献是HYPER-TINYPW压缩方法，涉及INT8量化和权重生成压缩，是论文的核心技术；3) “AI for Science OR Bioinformatics OR Cheminformatics”（8分）：论文在ECG心电图分析和生物信号处理领域有应用，属于生物信息学范畴。其他关键词如MoE、预训练、对齐、RAG等均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为HYPER-TINYPW的生成式压缩方法，用于解决在内存受限的微控制器上部署神经网络时点混合器权重占用内存过大的问题，通过在三个ECG基准测试上实现6.31倍的模型压缩同时保持至少95%的宏F1分数性能。

摘要翻译

在微控制器上部署神经网络受限于仅数KB的闪存和SRAM，即使在视觉、音频和可穿戴传感任务中应用INT8量化后，1x1逐点混合器仍常占据主要内存。我们提出HYPER-TINYPW，一种“压缩即生成”方法，用生成的权重替代大部分存储的逐点混合器权重：一个共享的微型多层感知机在加载时通过微小的逐层编码一次性合成逐点卷积核，将其缓存后使用标准整数算子执行。该方法保持了商用微控制器的运行时效率，仅增加一次性合成开销；稳态延迟和能耗与INT8可分离卷积神经网络基线持平。通过强制跨层共享潜在基向量消除了层间冗余，同时将第一层逐点卷积保持为INT8以稳定对形态敏感的早期混合操作。我们的贡献包括：（i）涵盖生成器、头部分解模块、编码、保留的第一层逐点卷积及主干网络的TinyML精确字节占用统计方法；（ii）采用验证集调优阈值t*与自助法置信区间的统一评估框架；（iii）覆盖纯整数推理及启动时合成与惰性合成的可部署性分析。在三个心电图基准数据集（Apnea-ECG、PTB-XL、MIT-BIH）上，HYPER-TINYPW重塑了宏F1分数与闪存占用的帕累托边界：在约225 kB内存下，其性能相当于约1.4 MB的卷积神经网络，体积缩小6.31倍（减少84.15%字节），同时保持至少95%的大模型宏F1分数。在32-64 kB内存预算下，该方法能维持均衡的检测性能，而紧凑基线模型则出现性能衰退。该机制可广泛适用于其他一维生物信号、端侧语音及嵌入式传感任务——这些场景中层间冗余占主导地位，表明“压缩即生成”在资源受限机器学习系统中具有更广泛的应用潜力。除心电图外，HYPER-TINYPW可迁移至TinyML音频任务：在Speech Commands数据集上达到96.2%测试准确率（最佳验证准确率98.2%），证明该方法对以重复线性混合器为主要内存占用的嵌入式传感任务具有更广泛的适用性。

摘要 (Abstract)

Deploying neural networks on microcontrollers is constrained by kilobytes of flash and SRAM, where 1x1 pointwise (PW) mixers often dominate memory even after INT8 quantization across vision, audio, and wearable sensing. We present HYPER-TINYPW, a compression-as-generation approach that replaces most stored PW weights with generated weights: a shared micro-MLP synthesizes PW kernels once at load time from tiny per-layer codes, caches them, and executes them with standard integer operators. This preserves commodity MCU runtimes and adds only a one-off synthesis cost; steady-state latency and energy match INT8 separable CNN baselines. Enforcing a shared latent basis across layers removes cross-layer redundancy, while keeping PW1 in INT8 stabilizes early, morphology-sensitive mixing. We contribute (i) TinyML-faithful packed-byte accounting covering generator, heads/factorization, codes, kept PW1, and backbone; (ii) a unified evaluation with validation-tuned t* and bootstrap confidence intervals; and (iii) a deployability analysis covering integer-only inference and boot versus lazy synthesis. On three ECG benchmarks (Apnea-ECG, PTB-XL, MIT-BIH), HYPER-TINYPW shifts the macro-F1 versus flash Pareto frontier: at about 225 kB it matches a roughly 1.4 MB CNN while being 6.31x smaller (84.15% fewer bytes), retaining at least 95% of large-model macro-F1. Under 32-64 kB budgets it sustains balanced detection where compact baselines degrade. The mechanism applies broadly to other 1D biosignals, on-device speech, and embedded sensing tasks where per-layer redundancy dominates, indicating a wider role for compression-as-generation in resource-constrained ML systems. Beyond ECG, HYPER-TINYPW transfers to TinyML audio: on Speech Commands it reaches 96.2% test accuracy (98.2% best validation), supporting broader applicability to embedded sensing workloads where repeated linear mixers dominate memory.

关键词: TinyML, model compression, generative compression, on-device inference, ECG analysis, microcontroller deployment, INT8 quantization, embedded sensing

271. ❌ Learning to Staff: Offline Reinforcement Learning and Fine-Tuned LLMs for Warehouse Staffing Optimization

作者: Kalle Kujanpää, Yuying Zhu, Kristina Klinkner, Shervin Malmasi 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24883v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在仓库人员配置优化中的应用，明确使用了监督微调(SFT)和直接偏好优化(DPO)技术，因此与’Large Language Models’、‘Post-training/SFT’和’RLHF/DPO’高度相关(10分)。论文未涉及其他关键词的技术细节或应用场景，因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究探索了离线强化学习和微调大语言模型两种方法优化半自动化仓库分拣系统的人员配置决策，分别实现了2.4%的吞吐量提升和达到或略超历史基线性能。

摘要翻译

本研究探讨了优化半自动化仓库分拣系统实时人力配置决策的机器学习方法。运营决策支持可在不同抽象层级上实现，并涉及不同的权衡。我们在匹配的仿真环境中评估了两种方法。首先，我们基于详细的历史状态表示，采用离线强化学习训练定制的基于Transformer架构的策略，在学习的模拟器中实现了相比历史基线2.4%的吞吐量提升。在高吞吐量的仓库运营中，此等规模的提升意味着可观的成本节约。其次，我们探索了基于抽象化、人类可读状态描述进行操作的大语言模型。这类模型天然适用于仓库管理者依据高层运营摘要进行决策的场景。我们系统比较了提示工程技术、自动提示优化以及微调策略。尽管仅使用提示被证明效果不足，但结合监督微调与在模拟器生成的偏好数据上进行直接偏好优化后，其在手工构建的模拟器中达到了匹配甚至略微超越历史基线的性能。我们的研究结果表明，两种方法均为实现人工智能辅助运营决策提供了可行路径。离线强化学习在特定任务架构上表现卓越，而大语言模型支持人类可读的输入，并能与可融入管理者偏好的迭代反馈循环相结合。

摘要 (Abstract)

We investigate machine learning approaches for optimizing real-time staffing decisions in semi-automated warehouse sortation systems. Operational decision-making can be supported at different levels of abstraction, with different trade-offs. We evaluate two approaches, each in a matching simulation environment. First, we train custom Transformer-based policies using offline reinforcement learning on detailed historical state representations, achieving a 2.4% throughput improvement over historical baselines in learned simulators. In high-volume warehouse operations, improvements of this size translate to significant savings. Second, we explore LLMs operating on abstracted, human-readable state descriptions. These are a natural fit for decisions that warehouse managers make using high-level operational summaries. We systematically compare prompting techniques, automatic prompt optimization, and fine-tuning strategies. While prompting alone proves insufficient, supervised fine-tuning combined with Direct Preference Optimization on simulator-generated preferences achieves performance that matches or slightly exceeds historical baselines in a hand-crafted simulator. Our findings demonstrate that both approaches offer viable paths toward AI-assisted operational decision-making. Offline RL excels with task-specific architectures. LLMs support human-readable inputs and can be combined with an iterative feedback loop that can incorporate manager preferences.

关键词: warehouse staffing optimization, offline reinforcement learning, large language models, supervised fine-tuning, direct preference optimization, simulation environment, operational decision-making, throughput improvement

272. ❌ Towards automatic smoke detector inspection: Recognition of the smoke detectors in industrial facilities and preparation for future drone integration

作者: Lukas Kratochvila, Jakub Stefansky, Simon Bilik, Robert Rous, Tomas Zemcik, Michal Wolny, Frantisek Rusnak, Ondrej Cech, Karel Horak 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24850v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于计算机视觉的烟雾探测器自动检测系统，使用YOLOv11、SSD和RT-DETRv2等目标检测模型，并涉及数据增强和半合成数据训练策略。所有评分关键词均与大语言模型、深度学习技术原理创新或AI在科学领域的应用相关，而本文专注于传统的计算机视觉目标检测任务，未涉及大模型、深度学习技术原理创新或AI在科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了工业设施中烟雾探测器的自动识别系统，通过比较YOLOv11、SSD和RT-DETRv2等目标检测模型在不同训练策略下的性能，发现YOLOv11n在测试数据集上取得了最佳的检测效果（mAP@0.5为0.884），为未来无人机集成自动检测系统奠定了基础。

摘要翻译

消防安全是一个复杂的系统性工程，也是备受关注的重要议题。其中，烟雾探测器作为前端环节，其作用是在大规模火灾发生前发出警报。由于探测器常安装于高处或难以触及的位置，自动检测系统具有显著优势：它能加快检查速度，避免工作人员从事高空危险作业，并降低整体成本。本研究提出了自动检测系统中的烟雾探测器识别模块，该模块可便捷集成至无人机系统。作为研究的一部分，我们比较了两种广泛应用于嵌入式设备的卷积神经网络目标检测器YOLOv11和SSD，以及基于不同规模骨干网络的先进Transformer架构检测器RT-DETRv2。鉴于在真实环境中收集足量训练数据存在困难，我们还对比了多种训练策略，包括使用真实数据与半合成数据结合不同增强方法。为确保测试的鲁棒性，所有模型均在两个测试数据集上进行评估，其中烟雾探测器呈现预期外观及困难场景（包括运动模糊、低分辨率或目标不完整）。性能最佳的检测器为YOLOv11n，其平均mAP@0.5得分达到0.884。我们的代码、预训练模型及数据集均已公开。

摘要 (Abstract)

Fire safety consists of a complex pipeline, and it is a very important topic of concern. One of its frontal parts are the smoke detectors, which are supposed to provide an alarm prior to a massive fire appears. As they are often difficult to reach due to high ceilings or problematic locations, an automatic inspection system would be very beneficial as it could allow faster revisions, prevent workers from dangerous work in heights, and make the whole process cheaper. In this study, we present the smoke detector recognition part of the automatic inspection system, which could easily be integrated to the drone system. As part of our research, we compare two popular convolutional-based object detectors YOLOv11 and SSD widely used on embedded devices together with the state-of-the-art transformer-based RT-DETRv2 with the backbones of different sizes. Due to a complicated way of collecting a sufficient amount of data for training in the real-world environment, we also compare several training strategies using the real and semi-synthetic data together with various augmentation methods. To achieve a robust testing, all models were evaluated on two test datasets with an expected and difficult appearance of the smoke detectors including motion blur, small resolution, or not complete objects. The best performing detector is the YOLOv11n, which reaches the average mAP@0.5 score of 0.884. Our code, pretrained models and dataset are publicly available.

关键词: smoke detector recognition, automatic inspection system, object detection, YOLOv11, RT-DETRv2, semi-synthetic data, drone integration, industrial facilities

273. ❌ Flow matching on homogeneous spaces

作者: Francesco Ruscelli 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24829v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于数学和机器学习交叉领域，提出了一种在齐次空间（李群的商空间）上扩展流匹配（Flow Matching）的通用框架。其核心贡献在于通过将数据分布提升到李群上，将问题转化为李群上的流匹配任务，进而简化为李代数上的欧几里得流匹配。论文内容涉及微分几何、李群理论、生成模型（流匹配）和计算效率优化，但完全不涉及大语言模型（LLMs）、深度学习技术原理创新、AI for Science应用或任何评分关键词中列出的具体技术（如MoE、RLHF、RAG、量化等）。所有关键词均与大模型、深度学习技术或特定AI应用领域相关，而本文是纯数学机器学习方法研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种在齐次空间（李群的商空间）上扩展流匹配的通用框架，通过将问题转化为李群上的流匹配并进一步简化为李代数上的欧几里得流匹配，避免了复杂的几何计算，从而提供了一个更简单、快速且完全内蕴的方法。

摘要翻译

我们提出一个将流匹配（Flow Matching）推广至齐性空间（即李群商空间）的通用框架。该方法通过提升数据分布，将问题重新表述为底层李群上的流匹配任务。这一策略通过直接在李群上操作，避免了齐性空间可能复杂的几何结构，进而使我们能够将问题简化为李代数上的欧几里得流匹配任务。与黎曼流匹配（Riemannian Flow Matching）相比，我们的方法无需定义和计算预度量（premetrics）或测地线（geodesics），从而构建了一个更简洁、更快速且完全内蕴的框架。

摘要 (Abstract)

We propose a general framework to extend Flow Matching to homogeneous spaces, i.e. quotients of Lie groups. Our approach reformulates the problem as a flow matching task on the underlying Lie group by lifting the data distributions. This strategy avoids the potentially complicated geometry of homogeneous spaces by working directly on Lie groups, which in turn enables us reduce the problem to a Euclidean flow matching task on Lie algebras. In contrast to Riemannian Flow Matching, our method eliminates the need to define and compute premetrics or geodesics, resulting in a simpler, faster, and fully intrinsic framework.

关键词: Flow Matching, Homogeneous Spaces, Lie Groups, Lie Algebras, Riemannian Flow Matching, Generative Models, Geometric Deep Learning, Manifold Learning

274. ❌ NeuroVLM-Bench: Evaluation of Vision-Enabled Large Language Models for Clinical Reasoning in Neurological Disorders

作者: Katarina Trojachanec Dineva, Stefan Andonov, Ilinka Ivanoska, Ivan Kitanovski, Sasho Gramatikov, Tamara Kostova, Monika Simjanoska Misheva, Kostadin Mishev 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24846v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文的核心是评估视觉增强大语言模型（V-LLMs）在神经影像临床推理中的应用，属于大模型在生物医学领域的应用研究。因此，与’Large Language Models OR LLMs OR Foundation Models’和’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文提到使用few-shot prompting，这与’In-context Learning OR Many-shot Learning’有一定关联（5分）。论文未涉及其他关键词所描述的具体技术原理、训练方法、优化技术或特定应用范式（如MoE、SFT、RLHF、RAG、CoT、Agents、量化等），因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究通过构建NeuroVLM-Bench基准，系统评估了20种前沿视觉大语言模型在神经影像（MRI/CT）上对多种神经系统疾病（如多发性硬化、中风、脑肿瘤）进行诊断、分型等临床推理任务的性能、可靠性和效率，发现诊断推理尤其是亚型预测仍具挑战性，而Gemini-2.5-Pro/GPT-5-Chat诊断性能最佳，Gemini-2.5-Flash和开源模型MedGemma-1.5-4B在效率与性能平衡上表现突出。

摘要翻译

多模态大语言模型的最新进展为基于图像的决策支持开辟了新的可能性。然而，其在神经影像领域的可靠性与实际应用权衡仍未得到充分理解。本研究针对支持视觉的大语言模型在二维神经影像中的应用，开展了一项全面的基准测试。我们使用精心构建的涵盖多发性硬化、卒中、脑肿瘤、其他异常及正常对照的MRI与CT数据集。模型需同时生成多项输出，包括诊断、诊断亚型、成像模态、专用序列和解剖平面。性能评估涵盖四个维度：带弃权选项的判别性分类、校准度、结构化输出有效性以及计算效率。我们采用多阶段框架以确保公平比较，同时控制选择偏倚。在对二十个前沿多模态模型的测试中，结果显示，诸如模态与平面等技术性影像属性已近乎解决，而诊断推理，尤其是亚型预测，仍然具有挑战性。肿瘤分类是最可靠的任务，卒中分类中等可解，而多发性硬化与罕见异常则依然困难。少样本提示能提升部分模型的性能，但会增加令牌使用量、延迟和成本。Gemini-2.5-Pro和GPT-5-Chat取得了最强的整体诊断性能，而Gemini-2.5-Flash则提供了最佳的效率-性能平衡。在开源权重架构中，MedGemma-1.5-4B展现出最有前景的结果，在少样本提示下，其诊断性能接近多个专有模型的零样本水平，同时保持了完美的结构化输出。这些发现为理解多模态大语言模型在神经影像中的性能、可靠性和效率权衡提供了实用见解，有助于推动该领域的标准化评估。

摘要 (Abstract)

Recent advances in multimodal large language models enable new possibilities for image-based decision support. However, their reliability and operational trade-offs in neuroimaging remain insufficiently understood. We present a comprehensive benchmarking study of vision-enabled large language models for 2D neuroimaging using curated MRI and CT datasets covering multiple sclerosis, stroke, brain tumors, other abnormalities, and normal controls. Models are required to generate multiple outputs simultaneously, including diagnosis, diagnosis subtype, imaging modality, specialized sequence, and anatomical plane. Performance is evaluated across four directions: discriminative classification with abstention, calibration, structured-output validity, and computational efficiency. A multi-phase framework ensures fair comparison while controlling for selection bias. Across twenty frontier multimodal models, the results show that technical imaging attributes such as modality and plane are nearly solved, whereas diagnostic reasoning, especially subtype prediction, remains challenging. Tumor classification emerges as the most reliable task, stroke is moderately solvable, while multiple sclerosis and rare abnormalities remain difficult. Few-shot prompting improves performance for several models but increases token usage, latency, and cost. Gemini-2.5-Pro and GPT-5-Chat achieve the strongest overall diagnostic performance, while Gemini-2.5-Flash offers the best efficiency-performance trade-off. Among open-weight architectures, MedGemma-1.5-4B demonstrates the most promising results, as under few-shot prompting, it approaches the zero-shot performance of several proprietary models, while maintaining perfect structured output. These findings provide practical insights into performance, reliability, and efficiency trade-offs, supporting standardized evaluation of multimodal LLMs in neuroimaging.

关键词: Vision-Enabled Large Language Models, Clinical Reasoning, Neurological Disorders, Neuroimaging Benchmark, Multimodal LLM Evaluation, MRI and CT, Diagnostic Performance, Computational Efficiency

275. ❌ A Practical Guide Towards Interpreting Time-Series Deep Clinical Predictive Models: A Reproducibility Study

作者: Yongda Fan, John Wu, Andrea Fitzpatrick, Naveen Baskaran, Jimeng Sun, Adam Cross 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24828v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于临床预测模型的解释性方法评估，与大多数大模型技术关键词（如LLM、MoE、SFT、RLHF等）完全无关。唯一相关的是’Mechanistic Interpretability OR Explainable AI’（评分10分），因为论文核心是评估模型解释方法（如attention、KernelSHAP、LIME）在临床任务中的可靠性。‘AI for Science OR Bioinformatics OR Cheminformatics’（评分8分）部分相关，因为论文涉及生物医学AI应用，但重点在解释性而非一般科学AI。其他关键词均无关联。

!!! tip deepseek-chat TL;DR

该研究通过系统评估不同解释方法在临床预测任务中的表现，发现attention机制能有效解释模型预测，而某些黑盒解释方法在时间序列临床任务中不可行或不可靠。

摘要翻译

临床决策具有高风险性且需明确论证依据，这使得模型可解释性在部署深度临床模型前的审计过程中至关重要。随着模型架构与可解释性方法生态系统的扩展，若干关键问题依然存在：注意力等架构特征是否能提升可解释性？可解释性方法能否跨临床任务泛化？尽管已有前期基准测试研究，但它们往往缺乏可扩展性与可复现性，且未能系统性地考察可解释性在临床任务与模型架构交互作用中的变化规律。为填补这些空白，我们提出了一个综合性基准测试框架，用于评估多样化临床预测任务与模型架构下的可解释性方法。我们的分析表明：（1）经恰当运用的注意力机制是忠实解释模型预测的高效方法；（2）KernelSHAP与LIME等黑盒解释器在时间序列临床预测任务中存在计算可行性障碍；（3）部分可解释性方法的可靠性不足，难以达到可信标准。基于研究发现，我们提出了若干改进临床预测流程中可解释性的指导原则。为支持可复现性与可扩展性，我们通过PyHealth（一个文档完备的开源框架）提供了完整实现代码：https://github.com/sunlabuiuc/PyHealth。

摘要 (Abstract)

Clinical decisions are high-stakes and require explicit justification, making model interpretability essential for auditing deep clinical models prior to deployment. As the ecosystem of model architectures and explainability methods expands, critical questions remain: Do architectural features like attention improve explainability? Do interpretability approaches generalize across clinical tasks? While prior benchmarking efforts exist, they often lack extensibility and reproducibility, and critically, fail to systematically examine how interpretability varies across the interplay of clinical tasks and model architectures. To address these gaps, we present a comprehensive benchmark evaluating interpretability methods across diverse clinical prediction tasks and model architectures. Our analysis reveals that: (1) attention when leveraged properly is a highly efficient approach for faithfully interpreting model predictions; (2) black-box interpreters like KernelSHAP and LIME are computationally infeasible for time-series clinical prediction tasks; and (3) several interpretability approaches are too unreliable to be trustworthy. From our findings, we discuss several guidelines on improving interpretability within clinical predictive pipelines. To support reproducibility and extensibility, we provide our implementations via PyHealth, a well-documented open-source framework: https://github.com/sunlabuiuc/PyHealth.

关键词: clinical predictive models, interpretability, attention mechanism, time-series data, benchmark evaluation, reproducibility, explainable AI, PyHealth framework

276. ❌ GoldiCLIP: The Goldilocks Approach for Balancing Explicit Supervision for Language-Image Pretraining

作者: Deen Dayal Mohan, Hossein Souri, Vitali Petsiuk, Juhong Min, Gopal Sharma, Luowei Zhou, Suren Kumar 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24804v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文GoldiCLIP专注于视觉-语言模型（VLM）的预训练方法创新，核心是提出一个平衡监督信号的框架，包括文本条件自蒸馏、集成解码器的VQA目标和不确定性加权机制。它直接涉及’Pre-training OR Continual Pre-training OR Domain Adaptation’（权重1.0），因为这是关于视觉-语言预训练（VLP）的新方法，旨在提高数据效率和质量，与预训练技术高度相关，评10分。其他关键词主要针对纯语言模型（如LLMs、MoE、SFT、RLHF等）、推理技术（如CoT、MCTS）、代理系统、压缩加速或特定科学领域（如生物信息学），而本文是视觉-语言多模态预训练，不涉及这些方面，因此评0分。加权总分仅基于一个相关关键词计算：10 × 1.0 = 10.0。

!!! tip deepseek-chat TL;DR

论文提出GoldiCLIP框架，通过平衡监督信号（包括文本条件自蒸馏、VQA集成解码器和不确定性加权）来解决视觉-语言预训练中数据效率低的问题，在仅用3000万图像（比主流方法少300倍）的情况下，在多个检索任务上达到最先进水平。

摘要翻译

长期以来，大规模视觉语言模型（VLMs）的成功主要依赖于数十亿样本的数据集，这构成了研究进展的重大障碍。近期研究开始通过提升监督质量来缩小这一差距，但每项工作仅解决了对比预训练中部分薄弱环节。我们提出GoldiCLIP框架，该框架基于“金发姑娘”原则构建，旨在寻找监督信号的恰当平衡。我们的多维度训练框架协同整合了三大关键创新：（1）一种文本条件自蒸馏方法，用于对齐文本无关特征与文本条件特征；（2）集成解码器的编码器结合视觉问答（VQA）目标，使编码器能够泛化至超越字幕式查询的任务；（3）基于不确定性的加权机制，可自动平衡所有异构损失函数。仅使用3000万张图像（数据量仅为主流方法的1/300）训练的GoldiCLIP，在数据高效方法中实现了最先进的性能：在MSCOCO检索任务上超越最佳可比基线2.2个点，在细粒度检索任务上提升2.0个点，在基于问题的检索任务上提升5.9个点，同时仍与十亿规模模型保持竞争力。项目页面：https://petsi.uk/goldiclip。

摘要 (Abstract)

Until recently, the success of large-scale vision-language models (VLMs) has primarily relied on billion-sample datasets, posing a significant barrier to progress. Latest works have begun to close this gap by improving supervision quality, but each addresses only a subset of the weaknesses in contrastive pretraining. We present GoldiCLIP, a framework built on a Goldilocks principle of finding the right balance of supervision signals. Our multifaceted training framework synergistically combines three key innovations: (1) a text-conditioned self-distillation method to align both text-agnostic and text-conditioned features; (2) an encoder integrated decoder with Visual Question Answering (VQA) objective that enables the encoder to generalize beyond the caption-like queries; and (3) an uncertainty-based weighting mechanism that automatically balances all heterogeneous losses. Trained on just 30 million images, 300x less data than leading methods, GoldiCLIP achieves state-of-the-art among data-efficient approaches, improving over the best comparable baseline by 2.2 points on MSCOCO retrieval, 2.0 on fine-grained retrieval, and 5.9 on question-based retrieval, while remaining competitive with billion-scale models. Project page: https://petsi.uk/goldiclip.

关键词: vision-language models, pretraining, data-efficient, self-distillation, visual question answering, uncertainty weighting, retrieval, GoldiCLIP

277. ❌ Local learning for stable backpropagation-free neural network training towards physical learning

作者: Yaqi Guo, Fabian Braun, Bastiaan Ketelaar, Stephanie Tan, Richard Norte, Siddhant Kumar 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24790v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是物理神经网络的前向学习框架FFzero，专注于替代反向传播的本地学习算法，应用于模拟光子神经网络。所有评分关键词均与大语言模型、对齐、推理、代理、科学AI应用等具体技术相关，而本文的核心是通用的神经网络训练范式创新（前向优化、本地学习），不涉及任何大模型特定技术、科学领域应用或评分关键词中的具体方法。因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了FFzero前向学习框架，通过层间本地学习和仅前向评估实现了无需反向传播的稳定神经网络训练，并在模拟光子神经网络中验证了其可行性。

摘要翻译

尽管反向传播与自动微分推动了深度学习的成功，但芯片制造的物理极限及深度学习日益增长的环境成本催生了物理神经网络等替代性学习范式。然而，现有的大多数物理神经网络在训练时仍依赖数字计算，这主要是因为反向传播与自动微分在物理系统中难以实现。我们提出FFzero——一种仅需前向计算、无需反向传播或自动微分即可实现稳定神经网络训练的前向学习框架。FFzero通过纯前向评估，结合了逐层局部学习、基于原型的表征以及基于方向导数的优化方法。我们证明，在反向传播失效的纯前向优化场景下，局部学习依然有效。FFzero可泛化应用于多层感知器和卷积神经网络，涵盖分类与回归任务。以模拟光子神经网络为例，我们展示了FFzero为无需反向传播的物理原位学习提供了一条可行路径。

摘要 (Abstract)

While backpropagation and automatic differentiation have driven deep learning’s success, the physical limits of chip manufacturing and rising environmental costs of deep learning motivate alternative learning paradigms such as physical neural networks. However, most existing physical neural networks still rely on digital computing for training, largely because backpropagation and automatic differentiation are difficult to realize in physical systems. We introduce FFzero, a forward-only learning framework enabling stable neural network training without backpropagation or automatic differentiation. FFzero combines layer-wise local learning, prototype-based representations, and directional-derivative-based optimization through forward evaluations only. We show that local learning is effective under forward-only optimization, where backpropagation fails. FFzero generalizes to multilayer perceptron and convolutional neural networks across classification and regression. Using a simulated photonic neural network as an example, we demonstrate that FFzero provides a viable path toward backpropagation-free in-situ physical learning.

关键词: forward-only learning, local learning, backpropagation-free, physical neural networks, in-situ training, photonic neural networks, layer-wise optimization, directional-derivative-based optimization

278. ❌ Dissecting Model Failures in Abdominal Aortic Aneurysm Segmentation through Explainability-Driven Analysis

作者: Abu Noman Md Sakib, Merjulah Roby, Zijie Zhang, Satish Muluk, Mark K. Eskandari, Ender A. Finol 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24801v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于医学图像分割（特别是腹主动脉瘤分割）中的模型失败分析，并提出了一种基于可解释AI（XAI）的编码器塑造框架。论文的核心是XAI在医学图像分析中的应用，因此与’Mechanistic Interpretability OR Explainable AI’高度相关（10分）。论文属于AI在生物医学领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（8分），但并非核心。其他所有关键词均涉及大语言模型（LLM）及其相关技术（如训练、对齐、推理、代理等），而本文研究的是计算机视觉中的分割模型，完全不涉及LLM或自然语言处理，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对腹主动脉瘤CT图像分割中模型失败的问题，提出了一种基于可解释AI（XAI）引导的编码器塑造框架，通过优化编码器焦点显著提升了复杂场景下的分割可靠性。

摘要翻译

复杂腹主动脉瘤（AAA）的计算机断层扫描图像分割常因模型将内部注意力分配给无关结构或未能聚焦于薄壁、低对比度目标而失败。模型关注的位置是核心训练信号，因此我们提出一种可解释人工智能（XAI）引导的编码器塑形框架。该方法从最终编码器块计算基于归因的密集编码器注意力图（“XAI场”），并通过两种互补方式运用该图：（i）将预测概率质量与XAI场对齐，以促进注意力与输出的一致性；（ii）将XAI场输入轻量级优化路径及一个置信度先验模块，在推理时对逻辑值进行调制，从而抑制干扰结构同时保留细微特征。目标函数仅作为控制信号；本研究的贡献在于将归因引导整合至表征学习与解码过程中。我们在临床验证的挑战性病例集（专门针对易失败场景构建）上进行评估。相较于基础SAM架构，本方法实现了显著提升。观察到的改进表明，通过XAI引导显式优化编码器注意力，是复杂场景下实现可靠分割的实用且有效的原则。

摘要 (Abstract)

Computed tomography image segmentation of complex abdominal aortic aneurysms (AAA) often fails because the models assign internal focus to irrelevant structures or do not focus on thin, low-contrast targets. Where the model looks is the primary training signal, and thus we propose an Explainable AI (XAI) guided encoder shaping framework. Our method computes a dense, attribution-based encoder focus map (“XAI field”) from the final encoder block and uses it in two complementary ways: (i) we align the predicted probability mass to the XAI field to promote agreement between focus and output; and (ii) we route the field into a lightweight refinement pathway and a confidence prior that modulates logits at inference, suppressing distractors while preserving subtle structures. The objective terms serve only as control signals; the contribution is the integration of attribution guidance into representation and decoding. We evaluate clinically validated challenging cases curated for failure-prone scenarios. Compared to a base SAM setup, our implementation yields substantial improvements. The observed gains suggest that explicitly optimizing encoder focus via XAI guidance is a practical and effective principle for reliable segmentation in complex scenarios.

关键词: Abdominal Aortic Aneurysm, Image Segmentation, Explainable AI, XAI, Encoder Focus, Model Failure Analysis, Medical Imaging, Computed Tomography

279. ❌ Transformers in the Dark: Navigating Unknown Search Spaces via Bandit Feedback

作者: Jungtaek Kim, Thomas Zeng, Ziqian Lin, Minjae Lee, Chungpa Lee, Jy-yong Sohn, Hyung Il Koo, Kangwook Lee 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24780v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs能否近似搜索算法，直接涉及LLMs技术原理创新，因此’Large Language Models OR LLMs OR Foundation Models’得10分。论文提到通过fine-tuning LLM on search trajectories来解锁预训练LLM的完整能力，这与’Post-training OR Supervised Fine-tuning OR SFT’有一定关联，得5分。其他关键词如MoE、SLMs、Scaling Laws、RAG、CoT、Agents、AI for Science等，论文未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究大型语言模型（LLMs）能否近似搜索算法以在未知树状搜索空间中导航，结果表明Transformer架构理论上能够实现不同的搜索策略，并且通过任务导向的微调可以解锁预训练LLM的完整搜索能力。

摘要翻译

当大型语言模型（LLM）与外部搜索算法结合时，其有效解决问题的能力可以得到增强。通过将多样化观点及其后续可能性的空间视为树状结构，搜索算法能够在此类搜索空间中导航，并更高效地引导LLM找到更优解决方案。尽管搜索算法能够在树状结构的利用与探索之间实现有效平衡，但引入外部组件可能会使整体问题解决过程复杂化。因此，我们提出以下问题：LLM或其底层的Transformer架构能否近似实现搜索算法？为回答这一问题，我们首先引入一个简化框架，其中树扩展与反馈信号由外部指定，从而实现对搜索能力的受控评估。我们将此设定称为“基于赌博机反馈的未知树搜索”。在此设定下，我们证明Transformer在理论上具备足够的表达能力来实现不同的搜索策略，并且可以通过从头训练来近似这些策略。我们的Transformer模型展现出泛化到未见条件（如更长的时间跨度或更深的树结构）的可能性。此外，我们通过针对搜索轨迹对LLM进行微调，证明了持续的任务导向训练能够释放预训练LLM的全部潜力。

摘要 (Abstract)

Effective problem solving with Large Language Models (LLMs) can be enhanced when they are paired with external search algorithms. By viewing the space of diverse ideas and their follow-up possibilities as a tree structure, the search algorithm can navigate such a search space and guide the LLM toward better solutions more efficiently. While the search algorithm enables an effective balance between exploitation and exploration of a tree-structured space, the need for an external component can complicate the overall problem-solving process. We therefore pose the following question: Can LLMs or their underlying Transformer architectures approximate a search algorithm? To answer this question, we first introduce a simplified framework in which tree extensions and feedback signals are externally specified, allowing for controlled evaluation of search capabilities. We call this setting unknown tree search with bandit feedback. Within this setting, we show that Transformers are theoretically expressive enough to implement distinct search strategies and can be trained from scratch to approximate those strategies. Our Transformer models exhibit the possibility of generalizing to unseen conditions such as longer horizons or deeper trees. Furthermore, we demonstrate that continued task-focused training unlocks the complete capabilities of a pretrained LLM, by fine-tuning the LLM on search trajectories.

关键词: Large Language Models, Transformers, search algorithm, tree search, bandit feedback, fine-tuning, generalization, problem solving

280. ❌ Binary Expansion Group Intersection Network

作者: Sicheng Zhou, Kai Zhang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24763v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是条件独立性的统计理论框架和图形表示方法，专注于二元数据和多变量统计，不涉及大模型、深度学习、AI应用或任何评分关键词中的技术。论文内容完全是统计理论和方法论，与人工智能、机器学习或大模型技术无关。

!!! tip deepseek-chat TL;DR

该论文提出了BEGIN网络，一种用于多元二元数据和位编码多变量数据的分布无关图形表示方法，证明了条件独立性等价于稀疏线性表示、块因子分解和块对角化，为高斯图形建模提供了非高斯类比。

摘要翻译

条件独立性是现代统计学的核心，但在特殊参数族之外，它很少具备精确的协方差表征。本文引入二元展开群交集网络（BEGIN），这是一种针对多元二元数据及比特编码多项变量的无分布图表示方法。对于任意二元随机向量及多项变量的比特表示，我们证明了条件独立性等价于条件期望的稀疏线性表示、对应交互协方差矩阵的块分解，以及相关广义舒尔补的块对角性。所得图由二元交互乘法群的交集索引，从而在非高斯设定下构建了高斯图模型的类比。该视角将数据比特视为原子，并将局部BEGIN分子视为大型马尔可夫随机场的构建单元。我们还展示了在温和正则条件下，二元比特表示如何使BEGIN能够近似一般随机向量的条件独立性。一个关键的技术工具是哈达玛棱镜，这是一种将交互协方差与群结构联系起来的线性映射。

摘要 (Abstract)

Conditional independence is central to modern statistics, but beyond special parametric families it rarely admits an exact covariance characterization. We introduce the binary expansion group intersection network (BEGIN), a distribution-free graphical representation for multivariate binary data and bit-encoded multinomial variables. For arbitrary binary random vectors and bit representations of multinomial variables, we prove that conditional independence is equivalent to a sparse linear representation of conditional expectations, to a block factorization of the corresponding interaction covariance matrix, and to block diagonality of an associated generalized Schur complement. The resulting graph is indexed by the intersection of multiplicative groups of binary interactions, yielding an analogue of Gaussian graphical modeling beyond the Gaussian setting. This viewpoint treats data bits as atoms and local BEGIN molecules as building blocks for large Markov random fields. We also show how dyadic bit representations allow BEGIN to approximate conditional independence for general random vectors under mild regularity conditions. A key technical device is the Hadamard prism, a linear map that links interaction covariances to group structure.

关键词: conditional independence, binary expansion, graphical representation, multivariate binary data, interaction covariance, Hadamard prism, Markov random fields, statistical theory

281. ❌ Synthetic Cardiac MRI Image Generation using Deep Generative Models

作者: Ishan Kumarasinghe, Dasuni Kawya, Madhura Edirisooriya, Isuri Devindi, Isuru Nawinne, Vajira Thambawita 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24764v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于使用GANs、VAEs、扩散模型和流匹配等深度生成模型生成合成心脏MRI图像，属于医学影像领域的AI应用。所有关键词均与大语言模型（LLMs）相关，而论文未涉及LLMs，因此除’AI for Science OR Bioinformatics OR Cheminformatics’外，其他关键词评分为0。该关键词评分为5，因为论文属于AI在生物医学（心脏MRI）领域的应用，但未明确提及生物信息学或化学信息学，相关性中等。

!!! tip deepseek-chat TL;DR

该综述论文比较了使用深度生成模型（如GANs、VAEs、扩散模型）生成合成心脏MRI图像的方法，旨在解决医学数据稀缺问题，并评估了其在保真度、实用性和隐私方面的表现，以支持临床工作流程。

摘要翻译

合成心脏磁共振成像（CMRI）生成已成为解决标注医学影像数据稀缺问题的一种前景广阔的策略。生成对抗网络（GANs）、变分自编码器（VAEs）、扩散概率模型以及流匹配技术的最新进展，旨在生成解剖结构准确的图像，同时应对诸如标注数据集有限、厂商设备差异以及模型记忆导致的隐私泄露风险等挑战。掩码条件生成通过使用分割图指导合成，提高了结构保真度；而扩散模型和流匹配模型则提供了强大的边界保持能力和高效确定性变换能力。通过厂商风格条件化以及强度归一化等预处理步骤，进一步支持了跨域泛化能力。为确保隐私安全，研究越来越多地结合成员推理攻击、最近邻分析和差分隐私机制。实用性评估通常通过下游分割性能来衡量，有证据表明，解剖结构受限的合成数据能够提升多厂商环境下的分割准确性与鲁棒性。本综述旨在从保真度、实用性和隐私性三个维度比较现有的CMRI生成方法，指出当前局限，并强调需要建立集成化、评估驱动的框架以支持可靠的临床工作流程。

摘要 (Abstract)

Synthetic cardiac MRI (CMRI) generation has emerged as a promising strategy to overcome the scarcity of annotated medical imaging data. Recent advances in GANs, VAEs, diffusion probabilistic models, and flow-matching techniques aim to generate anatomically accurate images while addressing challenges such as limited labeled datasets, vendor variability, and risks of privacy leakage through model memorization. Maskconditioned generation improves structural fidelity by guiding synthesis with segmentation maps, while diffusion and flowmatching models offer strong boundary preservation and efficient deterministic transformations. Cross-domain generalization is further supported through vendor-style conditioning and preprocessing steps like intensity normalization. To ensure privacy, studies increasingly incorporate membership inference attacks, nearest-neighbor analyses, and differential privacy mechanisms. Utility evaluations commonly measure downstream segmentation performance, with evidence showing that anatomically constrained synthetic data can enhance accuracy and robustness across multi-vendor settings. This review aims to compare existing CMRI generation approaches through the lenses of fidelity, utility, and privacy, highlighting current limitations and the need for integrated, evaluation-driven frameworks for reliable clinical workflows.

关键词: Synthetic cardiac MRI, Deep generative models, GANs, VAEs, Diffusion models, Privacy preservation, Medical imaging, Segmentation performance

282. ❌ Light Cones For Vision: Simple Causal Priors For Visual Hierarchy

作者: Manglam Kartik, Neel Tushar Shah 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24753v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉模型的几何结构（洛伦兹几何与欧几里得几何对比）用于层次化物体发现，属于计算机视觉领域的基础模型架构研究，与所有评分关键词（均聚焦于大语言模型技术、训练方法、推理优化、应用等）完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了视觉层次化物体发现需要编码非对称因果关系的几何结构，提出使用洛伦兹光锥的Worldline Slot Attention模型，在仅11K参数下实现了比欧几里得几何和双曲嵌入更好的性能。

摘要翻译

标准视觉模型将物体视为欧几里得空间中的独立点，无法捕捉整体与部分之间的层次结构。我们引入了世界线槽注意力模型，该模型将物体建模为穿越时空世界线的持续轨迹，其中每个物体在不同层级拥有多个槽位，这些槽位共享相同的空间坐标但具有不同的时间坐标。若缺乏几何结构，此架构始终无法有效工作：欧几里得世界线仅达到0.078的层级准确率，低于随机概率（0.33）；而洛伦兹世界线在三个数据集上实现了0.479至0.661的准确率——这一超过6倍的提升在20多次独立实验中得以复现。洛伦兹几何结构也优于双曲嵌入，表明视觉层次结构需要因果结构（时间依赖性）而非树状结构（径向分支）。我们的研究结果表明，层次化物体发现需要编码非对称因果关系的几何结构，这是欧几里得空间所缺乏但天然存在于洛伦兹光锥中的归纳偏置，且该模型仅需11K参数即可实现。代码发布于：https://github.com/iclrsubmissiongram/loco。

摘要 (Abstract)

Standard vision models treat objects as independent points in Euclidean space, unable to capture hierarchical structure like parts within wholes. We introduce Worldline Slot Attention, which models objects as persistent trajectories through spacetime worldlines, where each object has multiple slots at different hierarchy levels sharing the same spatial position but differing in temporal coordinates. This architecture consistently fails without geometric structure: Euclidean worldlines achieve 0.078 level accuracy, below random chance (0.33), while Lorentzian worldlines achieve 0.479-0.661 across three datasets: a 6x improvement replicated over 20+ independent runs. Lorentzian geometry also outperforms hyperbolic embeddings showing visual hierarchies require causal structure (temporal dependency) rather than tree structure (radial branching). Our results demonstrate that hierarchical object discovery requires geometric structure encoding asymmetric causality, an inductive bias absent from Euclidean space but natural to Lorentzian light cones, achieved with only 11K parameters. The code is available at: https://github.com/iclrsubmissiongram/loco.

关键词: Worldline Slot Attention, visual hierarchy, Lorentzian geometry, causal structure, hierarchical object discovery, light cones, geometric structure, spacetime worldlines

283. ❌ Compiling molecular ultrastructure into neural dynamics

作者: Konrad P. Kording, Anton Arkhipov, Davy Deng, Sean Escola, Seth G. N. Grant, Gal Haspel, Michał Januszewski, Narayanan Kasthuri, Nina Khera, Richie E. Kohman, Grace Lindsay, Jeantine Lunshof, Adam Marblestone, David A. Markowitz, Jordan Matelsky, Brett Mensh, Patrick Mineault, Andrew Payne, Joanne Peng, Xaq Pitkow, Philip Shiu, Gregor Schuhknecht, Sven Truckenbrodt, Joshua T. Vogelstein, Edward S. Boyden 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25713v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于神经科学领域，提出了一种将分子超微结构数据编译为神经动力学参数的机器学习方法。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理系统等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物科学（神经科学）领域的应用，但并非核心的生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种'超微结构到动力学编译器'的概念，旨在通过机器学习模型，将高分辨率脑成像获得的分子超微结构数据，直接预测并转化为能够驱动生物物理模拟的、具有不确定性的局部生理参数，从而将神经解剖结构图转化为可预测的电路动力学模型。

摘要翻译

高分辨率脑成像技术现已能够捕捉突触位置及其分子构成，且此类图谱的绘制成本正呈指数级下降。然而迄今为止，这类超微结构数据对局部神经元生理学的揭示仍十分有限——特别是关于调控神经动力学的关键参数（如突触效能、局部电导等）。我们提出将分子标注的超微结构转化为生理学参数，引入“超微结构-动力学编译器”的概念：这是一种从分子标注的超微结构到具备不确定性感知的、可直接用于模拟的生理学参数的学习映射。该方法需要配对的训练数据，即通过成像技术获取的超微结构与通过生理学实验获得的扰动动态响应数据相结合。利用此类数据，我们可以训练直接从结构预测局部生理学参数的模型。这样的编译器能够将解剖图谱转化为环路动力学模型，从而支持生物物理模拟，将结构到功能的研究从描述性范式转向预测性范式，并为理解神经计算和预测干预效果开辟新途径。

摘要 (Abstract)

High-resolution brain imaging can now capture not just synapse locations but their molecular composition, with the cost of such mapping falling exponentially. Yet such ultrastructural data has so far told us little about local neuronal physiology - specifically, the parameters (e.g., synaptic efficacies, local conductances) that govern neural dynamics. We propose to translate molecularly annotated ultrastructure into physiology, introducing the concept of an ultrastructure-to-dynamics compiler: a learned mapping from molecularly annotated ultrastructure to simulator-ready, uncertainty-aware physiological parameters. The requirement is paired training data, with jointly acquired ultrastructure from imaging, and dynamical responses to perturbations from physiological experiments. With this data we can train models that predict local physiology directly from structure. Such a compiler would support biophysical simulations by turning anatomical maps into models of circuit dynamics, shifting structure-to-function from a descriptive program to a predictive one and opening routes to understanding neural computation and forecasting intervention effects.

关键词: ultrastructure-to-dynamics compiler, molecular ultrastructure, neural dynamics, biophysical simulations, synaptic efficacies, local physiology prediction, brain imaging, circuit dynamics

284. ❌ A Bayesian Gamma-power-mixture survival regression model: predicting the recurrence of prostate cancer post-prostatectomy

作者: Tommy Walker Mackay, Mingtong Xu, Shahrokh F. Shariat, Roger Sewell 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25455v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用贝叶斯Gamma-power-mixture生存回归模型预测前列腺癌术后复发，属于生物医学统计领域。论文未涉及任何大模型、深度学习或AI技术，所有技术关键词（如LLM、MoE、RLHF等）均完全无关。唯一相关的是"AI for Science OR Bioinformatics OR Cheminformatics"关键词，因为论文属于生物信息学应用，但未使用AI方法，仅使用传统统计模型，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究使用贝叶斯Gamma-power-mixture生存回归模型预测前列腺癌根治术后生化复发时间，发现仅使用年龄和术前血液生物标志物（PSA、TGFbeta1等）即可提取最多信息，而添加其他临床变量并未增加信息量。

摘要翻译

在一项包含423名因局限性前列腺癌接受根治性前列腺切除术患者的数据集中，我们采用贝叶斯伽马幂混合生存回归模型，基于术前可用变量的不同子集，估算了关于生化复发时间的表观香农信息（ASI, apparent Shannon information）。在所有检验的子集中，ASI均为正值，其后验概率大于0.975。
仅使用年龄及术前血液检测结果（前列腺特异性抗原PSA及生物标志物）时，我们获得了0.232（0.180至0.290）纳特（nat）的ASI（相当于0.335（0.260至0.419）比特），该值为后验均值及等尾95%后验置信区间。这一数值超过同一研究团队部分作者此前在同一数据集上使用对数偏斜学生混合模型所得后验ASI均值的两倍，且超出该先前值的后验概率大于0.99。额外加入术前或术后格里森分级（Gleason grades）、手术发现、临床分期、前列腺外侵犯或精囊侵犯状态，均未能提升所提取的ASI。然而，若去除血液生物标志物，代之以术前格里森分级或磁共振成像（MRI）扫描结果，则可用ASI大幅下降至分别为0.077（0.038至0.120）纳特和0.088（0.045至0.132）纳特（两者均低于使用血液生物标志物所得值，后验概率大于0.995）。通过贪心算法筛选最佳生物标志物，从所检测的标志物中按重要性降序排列得出：TGFβ1、VCAM1、IL6sR和uPA。

摘要 (Abstract)

In a dataset of 423 patients who had had radical prostatectomy for localised prostate cancer we estimated the apparent Shannon information (ASI) about time to biochemical recurrence in various subsets of the available pre-op variables using a Bayesian Gamma-power-mixture survival regression model. In all the subsets examined the ASI was positive with posterior probability greater than 0.975 . Using only age and results of pre-operative blood tests (PSA and biomarkers) we achieved 0.232 (0.180 to 0.290) nats ASI (0.335 (0.260 to 0.419) bits) (posterior mean and equitailed 95% posterior confidence intervals). This is more than double the mean posterior ASI previously achieved on the same dataset by a subset of the current authors using a log-skew-Student-mixture model, and is greater than that previous value with posterior probability greater than 0.99 . Additionally using pre- or post-operative Gleason grades, operative findings, clinical stage, and presence or absence of extraprostatic extension or seminal vesicle invasion did not increase the ASI extracted. However removing the blood-based biomarkers and replacing them with either pre-operative Gleason grades or findings available from MRI scanning greatly reduced the available ASI to respectively 0.077 (0.038 to 0.120) and 0.088 (0.045 to 0.132) nats (both less than the values using blood-based biomarkers with posterior probability greater than 0.995). A greedy approach to selection of the best biomarkers gave TGFbeta1, VCAM1, IL6sR, and uPA in descending order of importance from those examined.

关键词: Bayesian survival regression, prostate cancer recurrence, biochemical recurrence, Shannon information, biomarkers, Gamma-power-mixture model, radical prostatectomy, predictive modeling

285. ❌ Learning relationships in epidemiological data using graph neural networks

作者: Anthony J Wood, Aeron R Sanchez, Rowland R Kao 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24745v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究使用图神经网络（GNNs）分析流行病学数据以推断疾病传播关系，属于AI在科学领域的应用。论文未涉及任何大模型（LLMs）、深度学习技术原理创新或关键词列表中的具体技术（如MoE、SFT、RAG等）。唯一相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文应用AI（GNNs）于生物信息学/流行病学领域，但并非核心创新或直接匹配，故给5分（有一定关联）。其他关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出使用图神经网络（GNNs）建模流行病学数据中的宿主关系和遗传距离，以预测疾病传播路径，相比传统方法具有性能优势但计算成本更高。

摘要翻译

在设计传染病控制策略时，识别关键传播路径至关重要。感染宿主的相关数据——包括其出生时间、居住地点及接触对象——有助于推断感染源和传播集群。然而，这类数据通常不足以准确确定传染者-被传染者配对关系。
另一方面，病原体的全基因组测序数据可作为此类数据的有力补充，因为它们可用于估算两个感染宿主之间最近共同祖先的时间，进而反映其在传播树中的相对接近程度。因此，能够解释不同宿主病原体间遗传距离及相关风险因素的统计模型，可为识别传播本身的关键风险因素提供依据。
我们展示了图神经网络（GNNs）如何成为解决此类问题的强大且自然的建模架构。通过将流行病学数据集视为一个图（其中感染宿主为节点，边由不同宿主对之间的遗传距离加权），我们阐释了如何利用GNN拟合模型，以预测已知宿主与新的未测序宿主之间的遗传距离。与其他成熟方法的比较表明，图神经网络虽计算成本较高，但具有显著性能优势。

摘要 (Abstract)

When designing control strategies for an infectious disease it is critical to identify the key pathways of transmission. Data on infected hosts - when they were born, where they lived and with whom they interacted - can help infer sources of infection and transmission clusters. However such data are generally not powerful enough to identify infector-infectee pairs with any certainty. Whole-genome sequencing data of the underlying pathogen, on the other hand, can serve as a powerful adjoint to these data as they can be used to estimate a time to a most recent common ancestor between two infected hosts. and in turn their relative proximity in the transmission tree. A statistical model that explains the genetic distance between different host pathogens and associated risk factors can therefore inform key risk factors for transmission itself. We show how graph neural networks (GNNs) are a powerful and natural modelling architecture for such a problem. By treating the epidemiological dataset as a graph where infected hosts are nodes and edges are weighted by the genetic distance between different host pairs, we show how a GNN can be fit to predict the genetic distance between known hosts and new, unsequenced hosts. Comparisons with other established approaches show that GNNs have useful performance advantages albeit with greater computational cost.

关键词: graph neural networks, epidemiological data, infectious disease transmission, genetic distance, transmission tree, host-pathogen relationships, computational modeling, risk factors

286. ❌ Lingshu-Cell: A generative cellular world model for transcriptome modeling toward virtual cells

作者: Han Zhang, Guo-Hua Yuan, Chaohao Yuan, Tingyang Xu, Tian Bian, Hong Cheng, Wenbing Huang, Deli Zhao, Yu Rong 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25240v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出Lingshu-Cell，一个用于单细胞转录组建模的生成式细胞世界模型，属于AI for Science（生物信息学）领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文明确将模型描述为’cellular world model’，与关键词’World Models AND General World Models’高度相关（10分）。论文提到’foundation models for single-cell transcriptomics’，属于基础模型在科学领域的应用，与关键词’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分）。其他关键词主要涉及大模型技术细节（如MoE、RLHF、量化等）或特定应用（如Agent、工具调用），论文未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该研究提出了Lingshu-Cell，一个基于掩码离散扩散的生成式细胞世界模型，用于模拟单细胞转录组状态分布并预测扰动下的条件响应，在虚拟细胞挑战和人类PBMC细胞因子响应预测中取得了领先性能。

摘要翻译

对细胞状态进行建模并预测其对扰动的响应，是计算生物学和虚拟细胞开发中的核心挑战。现有的单细胞转录组学基础模型提供了强大的静态表征，但并未明确建模细胞状态的分布以支持生成式模拟。本文介绍Lingshu-Cell——一种掩码离散扩散模型，该模型能够学习转录组状态分布并支持扰动条件下的条件模拟。通过直接在离散标记空间中操作（该空间兼容单细胞转录组数据的稀疏性及非序列特性），Lingshu-Cell能够捕捉约18,000个基因间复杂的全转录组表达依赖关系，而无需依赖先验的基因筛选（例如基于高变异性过滤或表达水平排序）。在不同组织和物种中，Lingshu-Cell能准确复现转录组分布、标记基因表达模式和细胞亚型比例，证明了其捕捉复杂细胞异质性的能力。此外，通过将细胞类型或供体身份与扰动信息共同嵌入，Lingshu-Cell能够预测身份与扰动的新组合所产生的全转录组表达变化。该模型在Virtual Cell Challenge H1遗传扰动基准测试中表现领先，并在预测人类外周血单个核细胞（PBMCs）的细胞因子诱导反应方面取得优异效果。综上，这些成果确立了Lingshu-Cell作为一个灵活的细胞世界模型，可用于细胞状态及扰动响应的计算机模拟，为生物发现和扰动筛选的新范式奠定了基础。

摘要 (Abstract)

Modeling cellular states and predicting their responses to perturbations are central challenges in computational biology and the development of virtual cells. Existing foundation models for single-cell transcriptomics provide powerful static representations, but they do not explicitly model the distribution of cellular states for generative simulation. Here, we introduce Lingshu-Cell, a masked discrete diffusion model that learns transcriptomic state distributions and supports conditional simulation under perturbation. By operating directly in a discrete token space that is compatible with the sparse, non-sequential nature of single-cell transcriptomic data, Lingshu-Cell captures complex transcriptome-wide expression dependencies across approximately 18,000 genes without relying on prior gene selection, such as filtering by high variability or ranking by expression level. Across diverse tissues and species, Lingshu-Cell accurately reproduces transcriptomic distributions, marker-gene expression patterns and cell-subtype proportions, demonstrating its ability to capture complex cellular heterogeneity. Moreover, by jointly embedding cell type or donor identity with perturbation, Lingshu-Cell can predict whole-transcriptome expression changes for novel combinations of identity and perturbation. It achieves leading performance on the Virtual Cell Challenge H1 genetic perturbation benchmark and in predicting cytokine-induced responses in human PBMCs. Together, these results establish Lingshu-Cell as a flexible cellular world model for in silico simulation of cell states and perturbation responses, laying the foundation for a new paradigm in biological discovery and perturbation screening.

关键词: generative cellular world model, transcriptome modeling, virtual cells, masked discrete diffusion model, single-cell transcriptomics, perturbation prediction, biological discovery, in silico simulation

287. ❌ OpenCap Monocular: 3D Human Kinematics and Musculoskeletal Dynamics from a Single Smartphone Video

作者: Selim Gilon, Emily Y. Miller, Scott D. Uhlrich 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24733v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是基于智能手机视频的3D人体运动学和肌肉骨骼动力学估计方法，属于计算机视觉和生物力学交叉领域。论文中使用了机器学习方法（如优化、物理模拟和机器学习），但所有关键词都专门针对大语言模型（LLM）及其相关技术（如MoE、SFT、RLHF、RAG、CoT、Agent等）。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在生物医学/科学领域的应用，但论文本身并不涉及大模型技术，因此给予5分（有一定关联）。其他所有关键词均与大模型技术直接相关，而该论文完全不涉及LLM、深度学习技术原理或大模型在不同领域的应用创新，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文开发了OpenCap Monocular算法，通过单部智能手机视频估计3D人体运动学和肌肉骨骼动力学，实现了低成本、可扩展的生物力学评估，并在行走、深蹲和坐站转换等任务中验证了其准确性。

摘要翻译

量化大规模人体运动（运动学）与肌肉骨骼力（动力学），例如估算从坐到站动作中的股四头肌力，有望变革行动能力相关疾病的预测、治疗与监测。然而，传统量化运动学与动力学需要在专业实验室进行昂贵且耗时的分析，限制了临床转化。因此，需要可扩展且精准的生物力学评估工具。我们提出OpenCap Monocular算法，该算法可从单部智能手机视频中估算三维骨骼运动学与动力学。该方法通过优化改进单目姿态估计模型（WHAM）输出的三维人体姿态估计，计算生物力学约束骨骼模型的运动学，并通过基于物理的仿真与机器学习估算动力学。我们针对行走、下蹲和坐站任务，将OpenCap Monocular与基于标记的动作捕捉及测力台数据进行了验证。OpenCap Monocular实现了较低的运动学误差（旋转自由度平均绝对误差为4.8°；骨盆平移误差为3.4厘米），其旋转精度较纯回归计算机视觉基线提升48%（p = 0.036），平移精度提升69%（p < 0.001）。在行走过程中，OpenCap Monocular估算的地面反作用力精度与我们之前双摄像头OpenCap系统相当或更优。我们证明，该算法在衰弱症和膝骨关节炎相关应用中，能以具有临床意义的精度估算重要的动力学结果，包括估算坐站转换中的膝伸力矩和行走中的膝内收力矩。OpenCap Monocular已通过智能手机应用、网页应用及安全云计算平台（https://opencap.ai）部署，可实现免费、便捷的单智能手机生物力学评估。

摘要 (Abstract)

Quantifying human movement (kinematics) and musculoskeletal forces (kinetics) at scale, such as estimating quadriceps force during a sit-to-stand movement, could transform prediction, treatment, and monitoring of mobility-related conditions. However, quantifying kinematics and kinetics traditionally requires costly, time-intensive analysis in specialized laboratories, limiting clinical translation. Scalable, accurate tools for biomechanical assessment are needed. We introduce OpenCap Monocular, an algorithm that estimates 3D skeletal kinematics and kinetics from a single smartphone video. The method refines 3D human pose estimates from a monocular pose estimation model (WHAM) via optimization, computes kinematics of a biomechanically constrained skeletal model, and estimates kinetics via physics-based simulation and machine learning. We validated OpenCap Monocular against marker-based motion capture and force plate data for walking, squatting, and sit-to-stand tasks. OpenCap Monocular achieved low kinematic error (4.8° mean absolute error for rotational degrees of freedom; 3.4 cm for pelvis translations), outperforming a regression-only computer vision baseline by 48% in rotational accuracy (p = 0.036) and 69% in translational accuracy (p < 0.001). OpenCap Monocular also estimated ground reaction forces during walking with accuracy comparable to, or better than, our prior two-camera OpenCap system. We demonstrate that the algorithm estimates important kinetic outcomes with clinically meaningful accuracy in applications related to frailty and knee osteoarthritis, including estimating knee extension moment during sit-to-stand transitions and knee adduction moment during walking. OpenCap Monocular is deployed via a smartphone app, web app, and secure cloud computing (https://opencap.ai), enabling free, accessible single-smartphone biomechanical assessments.

关键词: 3D human kinematics, musculoskeletal dynamics, smartphone video, biomechanical assessment, monocular pose estimation, physics-based simulation, machine learning, OpenCap

288. ❌ Automating Computational Chemistry Workflows via OpenClaw and Domain-Specific Skills

作者: Mingwei Ding, Chen Huang, Yibo Hu, Yifan Li, Zitian Lu, Xingtai Yu, Duo Zhang, Wenxi Zhai, Tong Zhu, Qiangqiang Gu, Jinzhe Zeng 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25522v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于计算化学工作流程的自动化，提出了基于OpenClaw的解耦代理-技能设计，用于分子动力学模拟等任务。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐技术等）完全无关，因为这些关键词主要针对大语言模型（LLM）的技术细节和应用。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学计算（具体为计算化学）领域的应用，与’AI for Science’高度相关，但未涉及生物信息学或化学信息学的具体内容，因此评分为10分（核心内容）。

!!! tip deepseek-chat TL;DR

该论文解决了多步骤计算化学工作流程自动化的挑战，通过OpenClaw的解耦代理-技能设计实现了跨工具执行、运行时故障恢复和反应网络提取，并以甲烷氧化的分子动力学案例验证了该方法的可扩展性和可维护性。

摘要翻译

自动化多步骤计算化学任务仍然具有挑战性，因为推理、工作流规范、软件执行和高性能计算（HPC）执行通常紧密耦合。我们展示了一种基于OpenClaw的解耦式智能体-技能设计，用于实现计算化学自动化。具体而言，OpenClaw提供集中式控制与监督；通过模式定义的规划技能将科学目标转化为可执行的任务规范；领域技能封装特定的计算化学流程；而DPDispatcher则管理跨异构HPC环境的作业执行。在甲烷氧化的分子动力学（MD）案例研究中，该系统完成了跨工具执行、运行时故障的有界恢复以及反应网络提取，展示了一种可扩展且可维护的多步骤计算化学自动化方法。

摘要 (Abstract)

Automating multistep computational chemistry tasks remains challenging because reasoning, workflow specification, software execution, and high-performance computing (HPC) execution are often tightly coupled. We demonstrate a decoupled agent-skill design for computational chemistry automation leveraging OpenClaw. Specifically, OpenClaw provides centralized control and supervision; schema-defined planning skills translate scientific goals into executable task specifications; domain skills encapsulate specific computational chemistry procedures; and DPDispatcher manages job execution across heterogeneous HPC environments. In a molecular dynamics (MD) case study of methane oxidation, the system completed cross-tool execution, bounded recovery from runtime failures, and reaction network extraction, illustrating a scalable and maintainable approach to multistep computational chemistry automation.

关键词: computational chemistry, workflow automation, OpenClaw, agent-skill design, molecular dynamics, HPC execution, task specification, reaction network extraction

289. ❌ Complementary Eigen-Zundel Interpretation Reconciles Thermodynamics and Spectroscopy of Excess Protons in Aqueous HF Solutions

作者: Louis Lehmann, Florian N. Brünig, Jonathan Scherlitzki, Morten Lehmann, Martin Kaupp, Beate Paulus, Roland R. Netz 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25371v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究水溶液中过量质子的微观行为，属于计算化学和物理化学领域，使用从头算分子动力学方法。所有关键词均与大语言模型、深度学习技术、AI系统等计算机科学领域无关，因此除’AI for Science OR Bioinformatics OR Cheminformatics’外，其他关键词均得0分。‘AI for Science’关键词得5分，因为该研究属于科学计算应用，但论文未明确使用AI或机器学习方法，而是传统计算化学方法，因此相关性有限。

!!! tip deepseek-chat TL;DR

该论文通过从头算分子动力学模拟，揭示了HF水溶液中过量质子的动态共享机制，提出了修正的Eigen-Zundel模型，解释了HF和HCl溶液相似振动光谱的原因，统一了热力学和光谱学观测结果。

摘要翻译

氢氟酸（HF）与盐酸（HCl）的水溶液在中等浓度下表现出显著差异：HCl完全解离，而HF仅部分解离并形成双氟离子（HF$_2^-$）。根据传统化学理论，这应导致HF与HCl溶液中的过剩质子光谱不同，然而实验观测却与此不符。通过从头算分子动力学模拟，我们发现HF中的质子并非如教科书化学所描述的那样牢固结合于F$^-$，而是动态地与水合水分子共享。这一现象可通过修正的艾根态（Eigen-state）模型加以解释，该模型同时阐明了HF$_2^-$的形成机制。HF与HCl溶液振动光谱的相似性，则可通过互补的宗德尔（Zundel）图像进行说明：两者具有几乎相同的过剩质子转移自由能曲线。这些结果调和了热力学与光谱学观测之间的矛盾，并为水溶液中过剩质子提供了统一的微观图像。

摘要 (Abstract)

Aqueous solutions of HF and HCl behave very differently at intermediate concentrations: HCl dissociates completely, whereas HF remains only partially dissociated and forms bifluoride (HF$_2^-$). This should lead to different excess-proton spectra in HF and HCl solutions, in contrast to experimental reports. Using ab initio molecular dynamics, we show that in HF the proton is not firmly bound to F$^-$, as suggested by textbook chemistry, but dynamically shared with a hydrating water molecule. This is rationalized by a modified Eigen-state description which also explains the formation of HF$_2^-$. The similar vibrational spectra of HF and HCl solutions are explained by a complementary Zundel picture in terms of almost identical excess proton transfer free-energy profiles for HF and HCl. These results reconcile thermodynamic and spectroscopic observations and provide a unified microscopic picture of excess protons in aqueous solution.

关键词: ab initio molecular dynamics, excess protons, aqueous HF solutions, Eigen-Zundel model, vibrational spectra, proton transfer, thermodynamics, spectroscopy

290. ❌ Deep learning of committor and explainable artificial intelligence analysis for identifying reaction coordinates

作者: Toshifumi Mori, Kei-ichi Okazaki, Kang Kim, Nobuyuki Matubayasi 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.25237v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文主要研究分子系统中的反应坐标识别，使用深度学习预测committor函数，并应用可解释人工智能（XAI）分析输入变量的贡献。论文与大多数关键词（如LLM、MoE、SFT、RLHF、RAG等）完全无关，因为这些关键词涉及大语言模型技术、训练方法或推理优化，而本文专注于传统深度学习在分子科学中的应用。唯一相关的关键词是’Mechanistic Interpretability OR Explainable AI’（评分10分），因为论文明确使用XAI技术分析模型预测；以及’AI for Science OR Bioinformatics OR Cheminformatics’（评分10分），因为论文属于AI在科学（具体为分子系统）领域的应用。其他关键词均未涉及，评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于深度学习和可解释人工智能（XAI）的框架，用于识别复杂分子系统中的反应坐标，通过训练神经网络预测committor函数并分析集体变量的贡献，从而揭示分子机制。

摘要翻译

在复杂分子体系中，表征过渡路径的反应坐标对于理解其内在分子机制至关重要。本综述探讨了一种通过将深度学习应用于承诺概率来识别反应坐标的框架，该概率为衡量过渡路径进程提供了最可靠的指标。神经网络的输入是表示为系统原子坐标函数的集体变量，通过以承诺概率为学习目标训练网络，将相应的反应坐标作为输出进行预测。由于深度学习模型通常以黑箱方式运行，难以确定哪些输入变量主导着预测结果。可解释人工智能技术的引入使得能够定量评估各输入变量对预测结果的贡献度。该方法可识别起主导作用的集体变量，并证明使用重要集体变量构建的势能面上，承诺概率分布被清晰的边界所分隔。该框架为从反应坐标推断分子机制提供了一种可解释的深度学习策略，可广泛应用于各类复杂分子系统。

摘要 (Abstract)

In complex molecular systems, the reaction coordinate (RC) that characterizes transition pathways is essential to understand underlying molecular mechanisms. This review surveys a framework for identifying the RC by applying deep learning to the committor, which provides the most reliable measure of the progress along a transition path. The inputs to the neural network are collective variables (CVs) expressed as functions of atomic coordinates of the system, and the corresponding RC is predicted as the output by training the network on the committor as the learning target. Because deep learning models typically operate in a black-box manner, it is difficult to determine which input variables govern the predictions. The incorporation of eXplainable Artificial Intelligence (XAI) techniques enables quantitative assessment of the contributions of individual input variables to the predictions. This approach allows the identification of CVs that play dominant roles and demonstrates that the committor distribution on the surface using important CVs is separated by well-defined boundaries. The framework provides an explainable deep learning strategy for assigning a molecular mechanism from the RC and is applicable to a wide range of complex molecular systems.

关键词: reaction coordinate, committor, deep learning, explainable AI, collective variables, molecular systems, neural network, XAI

291. ❌ A sustainable photocatalytic pathway for concurrent hydrogen and value-added chemical production utilizing microalgae as bio-scavenger in water

作者: Ho Truong Nam Hai, Augusto Ducati Luchessi, Kaveh Edalati 期刊/来源: arxiv 发布日期: 2026-03-26 arXiv链接: http://arxiv.org/abs/2603.24924v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是利用微藻作为生物清除剂的光催化制氢和化学品生产，属于环境科学、化学工程和可再生能源领域。论文内容完全不涉及大模型、深度学习、人工智能或任何计算机科学技术。所有评分关键词都是关于大模型技术及其应用的，与该论文的研究主题（光催化、微藻、氢能生产）完全无关，因此所有关键词的相关度评分均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用微藻作为生物清除剂与二氧化钛光催化剂结合的新策略，在最大化绿色氢气生产的同时将微藻转化为甲烷和一氧化碳等有价值产品，在最优条件下氢气产量比不使用微藻时提高了13倍。

摘要翻译

微藻作为一种丰富的生物有机材料来源，通过进行光合作用捕获二氧化碳（CO2）并将其转化为氧气（O2），对地球生命起着重要作用。本研究提出了一种新策略：将微藻作为负二氧化碳排放的牺牲剂，与使用板钛矿型二氧化钛（brookite TiO2）作为模型光催化剂的传统光催化水分解过程相结合，以最大化绿色氢气（H2）产量，同时将微藻转化为有价值的产物，如甲烷（CH4）和一氧化碳（CO）。在最佳条件下，该过程在不添加助催化剂时氢气产量高达0.990 mmol/g.h，添加铂（Pt）助催化剂时可达3.200 mmol/g.h，这是无微藻时产率的13倍。在光催化中使用微藻的策略在绿色氢气生产中具有巨大潜力，因为它不仅消除了如醇类等有价值的空穴牺牲剂，还能生产其他有用化合物如CH4和CO。此外，这一可持续过程有助于在微藻培养过程中实现二氧化碳的捕获与转化。

摘要 (Abstract)

Microalgae are an abundant bioorganic material source and play a significant role in life on Earth by conducting photosynthesis for carbon dioxide (CO2) capture and its conversion to oxygen (O2). In this study, a combination of microalgae as a negative-CO2-emitting sacrificial agent with the traditional photocatalytic water-splitting process using brookite TiO2, as a model photocatalyst, is introduced as a new strategy to maximize green hydrogen (H2) production while converting microalgae to valuable products, like methane (CH4) and carbon monoxide (CO). The process, under optimal conditions, produces up to 0.990 mmol/g.h of H2 without cocatalyst addition and 3.200 mmol/g.h with platinum (Pt) cocatalyst, which is 13 times higher than the production rate without microalgae. The strategy of using microalgae in photocatalysis has high potential in green H2 production, as it not only eliminates valuable hole sacrificial agents, like alcohol, but also produces other useful compounds, like CH4 and CO. Moreover, this sustainable process contributes to CO2 capture and conversion during microalgae cultivation.

关键词: photocatalytic, microalgae, hydrogen production, brookite TiO2, CO2 capture, value-added chemicals, sustainable process, sacrificial agent

292. ❌ Implementation of the multigrid Gaussian-Plane-Wave algorithm with GPU acceleration in PySCF

作者: Rui Li, Xing Zhang, Qiming Sun, Yuanheng Wang, Junjie Yang, Garnet Kin-Lic Chan 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24881v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算化学领域，开发了一种GPU加速的多重网格高斯-平面波密度拟合算法，用于Kohn-Sham密度泛函理论中的Fock构建和核梯度评估。论文内容与绝大多数关键词（涉及大模型、深度学习、训练方法、推理优化、智能体等）完全无关，因为这些关键词都属于人工智能/机器学习领域，而该论文属于计算化学/量子化学领域。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于计算化学（与化学信息学相关）并应用了GPU加速计算（属于科学计算），但论文本身并未涉及AI模型或算法，因此相关性较弱，给予5分。

!!! tip deepseek-chat TL;DR

该论文开发了一种GPU加速的多重网格高斯-平面波密度拟合算法，用于高效计算Kohn-Sham密度泛函理论中的Fock矩阵和核梯度，在H100 GPU上实现了相对于CPU高达25倍的加速，为大规模分子和固体的第一性原理计算提供了高效工具。

摘要翻译

我们提出了一种基于GPU加速的多重网格高斯-平面波密度拟合方法，用于在Kohn-Sham密度泛函理论框架内高效构建福克矩阵并计算核梯度，该方法已集成于PySCF的GPU4PySCF模块中。我们的CUDA内核采用基于网格的并行化策略来收缩高斯基函数对，在NVIDIA GPU上实现了高达80%的双精度浮点峰值性能，且在处理高角动量函数时效率无损。对包含多达1536个原子和20480个基函数的分子与固体体系进行基准测试表明，相较于28核共享内存节点的CPU实现，单个H100 GPU可带来最高25倍的加速。对于包含256个水分子的团簇体系，在单个H100 GPU上仅需约30秒即可完成基态能量与核梯度的计算。该开源实现为诸多应用领域提供了基础支撑，例如从头算分子动力学模拟与高通量计算。

摘要 (Abstract)

We introduce a GPU-accelerated multigrid Gaussian-Plane-Wave density fitting (FFTDF) approach for efficient Fock builds and nuclear gradient evaluations within Kohn-Sham density functional theory, as implemented in the GPU4PySCF module of PySCF. Our CUDA kernels employ a grid-based parallelization strategy for contracting Gaussian basis function pairs and achieve up to 80% of the FP64 peak performance on NVIDIA GPUs, with no loss of efficiency for high angular momentum (up to f-shell) functions. Benchmark calculations on molecules and solids with up to 1536 atoms and 20480 basis functions show up to 25x speedup on an H100 GPU relative to the CPU implementation on a 28-core shared memory node. For a 256-water cluster, the ground-state energy and nuclear gradients can be computed in ~30 seconds on a single H100 GPU. This implementation serves as an open-source foundation for many applications, such as ab initio molecular dynamics and high-throughput calculations.

关键词: GPU acceleration, multigrid Gaussian-Plane-Wave, density functional theory, Fock build, nuclear gradient, PySCF, high-throughput calculations, ab initio molecular dynamics

293. ❌ Permeation of hydrogen across graphdiyne: molecular dynamics vs. quantum simulations and role of membrane motion

作者: Mateo Rodríguez, José Campos-Martínez, Marta I. Hernández 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24827v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究氢分子在石墨炔膜中的渗透过程，使用量子力学计算和分子动力学模拟进行比较，属于计算化学和材料科学领域。所有关键词均与大语言模型、深度学习技术原理或AI应用直接相关，而本文完全不涉及这些主题。唯一可能相关的关键词是"AI for Science OR Bioinformatics OR Cheminformatics"，因为论文属于科学计算领域，但并未使用AI方法，而是传统的计算化学方法，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文通过比较量子力学计算和分子动力学模拟，研究了氢分子在静态和动态石墨炔膜中的渗透行为，发现分子动力学模拟能合理再现量子渗透率的温度依赖性，且考虑膜热运动能显著提高渗透率。

摘要翻译

先前基于电子结构计算与分子动力学模拟的研究表明，石墨炔是一种非常适用于分离不同种类气体混合物中小分子的二维膜材料。然而，当涉及穿越石墨炔亚纳米孔洞的分子为轻质分子时，量子效应可能在渗透过程的动力学中起到重要作用。本研究以氢气分子通过静态石墨炔膜的传输为案例，系统报告了严格的量子力学计算及等效的分子动力学模拟，以检验经典动力学方法在此类问题中应用的有效性。所采用的力场基于改进的Lennard-Jones公式，其参数通过精确的从头计算优化获得。研究发现，尽管在相关温度区间（250至350 K）量子效应仍较为显著，但分子动力学模拟能够合理地复现量子渗透率随温度变化的趋势。此外，采用Feynman-Hibbs有效势进行量子修正的分子动力学模拟所得的渗透率构成了量子渗透率的下限，而纯经典模拟结果则给出上限，从而为渗透结果划定了一个明确的置信区间。进一步地，在分子动力学模拟中引入石墨炔层的热运动后，由于石墨炔原子振动使渗透势垒显著降低，渗透率相较于固定膜情形有所提升。因此，模拟膜的运动对于可靠地预测气体传输特性至关重要。

摘要 (Abstract)

Previous research based on electronic structure calculations and molecular dynamics (MD) simulations have demonstrated that graphdiyne (GDY) is a very suitable two-dimensional membrane for the separation of small molecules in a gas mixture of different species. However, quantum effects may play a role in the dynamics of these permeation processes when light molecules are the ones involved in the crossing of the GDY subnanometric pores. In this work we report rigorous quantum-mechanical calculations together with equivalent MD simulations of the transport of H2 molecules through a static GDY membrane, as a case study for the validity of the application to these problems of classical dynamics. The force fields employed are based on an improved Lennard-Jones formulation, with parameters optimized by means of accurate ab initio calculations. It is found that, although quantum effects are still significant at the temperatures of interest (between 250 and 350 K), MD simulations are able to reasonably reproduce the dependence of the quantum permeances with the temperature. Moreover, MD permeances computed with quantum corrections through Feynman-Hibbs effective potentials provide a lower bound to quantum permeances, while the pure classical counterpart gives an upper bound, thus leading to a well delimited range of confidence of the permeation results. Furthermore, within MD simulations it is possible to incorporate the thermal motion of the GDY layer and in this situation it is observed an enhancement of the permeances with respect to the fixed membrane case, due to a significant reduction of the permeation barriers when the GDY atoms are allowed to vibrate. It seems apparent therefore, that modeling the membrane motion is crucial to provide reliable simulations of the gas transport features.

关键词: graphdiyne, hydrogen permeation, molecular dynamics, quantum simulations, membrane motion, permeation barriers, Feynman-Hibbs, gas transport

294. ❌ Concerted Electron-Ion Transport by Polyacrylonitrile Elucidated with Reactive Deep Learning Potentials

作者: Rajni Chahal-Crockett, Michael D. Toomey, Logan T. Kearney, Yawei Gao, Joshua T. Damron, Amit K. Naskar, Santanu Roy 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24798v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于使用深度学习势能（deep-learning potential）研究聚丙烯腈（PAN）的电荷传输机制，属于AI在科学（具体是化学/材料科学）领域的应用。所有关键词中，只有“AI for Science OR Bioinformatics OR Cheminformatics”与论文主题有直接关联，因为论文应用深度学习于化学反应模拟和材料设计，属于AI for Science范畴。其他关键词均涉及大模型（LLM）技术、训练方法、推理优化、代理系统等，与论文研究的分子动力学模拟和反应机理分析无直接关系。

!!! tip deepseek-chat TL;DR

该研究开发了一种深度学习势能模型，揭示了聚丙烯腈（PAN）中由亲核攻击引发的环化反应动力学，发现第一步环化是限速步骤，并触发锂离子耦合的电子转移，使后续环化速率提高约10,000倍，为设计高性能电荷传输聚合物提供了途径。

摘要翻译

聚合物（如聚丙烯腈，PAN）中的电荷传输对电子器件和储能领域至关重要。例如，PAN可通过促进电池中动态的阳离子-腈基配位来传输阳离子（如Li+）。然而，关于复杂反应性聚合物构型的基础作用尚不明确。本文中，我们开发了一种深度学习势函数，该模型基于非平衡反应性PAN构型的从头算能量和力进行训练，以揭示由亲核试剂（从LiOH解离的OH-）攻击末端腈基碳所引发的PAN环化动力学。基于反应自由能、速率及电荷分析，我们发现亲核试剂攻击形成第一个环是速率决定步骤，该步骤随后触发沿PAN主链的Li+耦合电子转移，导致剩余腈基的顺序成环速率提高约10,000倍。PAN的伸展构型中偶极相互作用和氢键作用最小，从而实现了这种快速动力学。通过红外光谱（IR）和核磁共振（NMR）实验验证计算结果，我们为设计具有增强电荷传输能力的反应性聚合物以用于能源领域确立了一条路径。

摘要 (Abstract)

Charge transport in polymers, such as polyacrylonitrile (PAN), is crucial for electronics and energy storage. For instance, PAN can transport cations e.g., Li+, by facilitating dynamic cation-nitrile coordination in batteries. However, little is known regarding the underlying role of complex reactive polymer configurations. Herein, we develop a deep-learning potential, trained on ab initio energies and forces of nonequilibrium reactive PAN configurations, to unravel the kinetics of PAN cyclization initiated by a nucleophile (OH- dissociated from LiOH) attacking the terminal nitrile carbon. We find, based on the reaction free-energetics, rates, and charge analysis, that the nucleophile attack producing the first ring is the rate-limiting step, which subsequently triggers Li+-coupled electron transfer along the PAN backbone, causing ~10,000 times faster sequential ring-formation of the remaining nitriles. PAN’s extended configurations, where dipolar and H-bonding interactions are minimal, enable such rapid kinetics. By validating our computational findings with IR and NMR experiments, we establish a pathway for designing reactive polymers with enhanced charge transport for energy applications.

关键词: deep-learning potential, polyacrylonitrile, charge transport, reaction kinetics, cyclization, electron-ion transport, reactive polymers, energy storage

295. ❌ Autotuning T-PaiNN: Enabling Data-Efficient GNN Interatomic Potential Development via Classical-to-Quantum Transfer Learning

作者: Vivienne Pelletier, Vedant Bhat, Daniel J. Rivera, Steven A. Wilson, Christopher L. Muhich 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24752v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于开发用于分子系统的图神经网络（GNN）原子间势能模型，属于科学AI应用领域。核心贡献是提出了一种从经典力场到量子数据的迁移学习框架（T-PaiNN），涉及预训练和微调（autotuning）技术。因此，与’Pre-training OR Continual Pre-training OR Domain Adaptation’和’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分），因为论文明确使用了预训练和微调方法。同时，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为它直接应用于化学信息学和计算化学。其他关键词主要涉及大语言模型（LLM）相关技术（如LLMs、MoE、RLHF、RAG等）、推理方法（如CoT、System 2）、代理系统、优化技术（如量化、注意力机制）或通用AI主题，这些均未在论文中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为T-PaiNN的迁移学习框架，通过利用廉价的经典力场数据进行预训练，并结合小规模量子力学数据进行微调，显著提高了图神经网络原子间势能模型的数据效率和准确性，在分子和液体水系统中实现了误差数量级的降低。

摘要翻译

机器学习原子间势函数（MLIPs），特别是基于图神经网络（GNN）的模型，为实现接近密度泛函理论（DFT）的精度同时显著降低计算成本提供了一条前景广阔的路径。然而，其实际应用常因需要大量昂贵的量子力学训练数据而受到限制。本研究引入了一种迁移学习框架——Transfer-PaiNN（T-PaiNN），该框架通过利用廉价的经典力场数据，大幅提升了GNN-MLIPs的数据效率。该方法首先在由经典分子模拟生成的大规模数据集上对PaiNN MLIP架构进行预训练，随后使用相对较小的DFT数据集进行微调（称为自动调优）。我们在气相分子体系（QM9数据集）和凝聚相液态水体系中验证了自动调优T-PaiNN的有效性。在所有案例中，T-PaiNN均显著优于仅基于DFT数据训练的模型，实现了平均绝对误差的数量级降低，同时加速了训练收敛。例如，使用QM9数据集时，在低数据量情况下观察到误差降低高达25倍；而液态水模拟则显示出在能量、力以及密度和扩散系数等实验相关性质预测上的改进。这些优势源于模型能够从广泛的经典采样中学习势能面的一般特征，随后将其精修至量子精度。总体而言，本研究确立了从经典力场出发的迁移学习作为一种实用且计算高效的策略，可用于开发高精度、高数据效率的GNN原子间势函数，从而推动MLIPs在复杂化学体系中得到更广泛的应用。

摘要 (Abstract)

Machine-learned interatomic potentials (MLIPs), particularly graph neural network (GNN)-based models, offer a promising route to achieving near-density functional theory (DFT) accuracy at significantly reduced computational cost. However, their practical deployment is often limited by the large volumes of expensive quantum mechanical training data required. In this work, we introduce a transfer learning framework, Transfer-PaiNN (T-PaiNN), that substantially improves the data efficiency of GNN-MLIPs by leveraging inexpensive classical force field data. The approach consists of pretraining a PaiNN MLIP architecture on large-scale datasets generated from classical molecular simulations, followed by fine-tuning (dubbed autotuning) using a comparatively small DFT dataset. We demonstrate the effectiveness of autotuning T-PaiNN on both gas-phase molecular systems (QM9 dataset) and condensed-phase liquid water. Across all cases, T-PaiNN significantly outperforms models trained solely on DFT data, achieving order-of-magnitude reductions in mean absolute error while accelerating training convergence. For example, using the QM9 data set, error reductions of up to 25 times are observed in low-data regimes, while liquid water simulations show improved predictions of energies, forces, and experimentally relevant properties such as density and diffusion. These gains arise from the model’s ability to learn general features of the potential energy surface from extensive classical sampling, which are subsequently refined to quantum accuracy. Overall, this work establishes transfer learning from classical force fields as a practical and computationally efficient strategy for developing high-accuracy, data-efficient GNN interatomic potentials, enabling broader application of MLIPs to complex chemical systems.

关键词: graph neural network, interatomic potentials, transfer learning, pretraining, fine-tuning, classical force fields, quantum mechanical data, data efficiency

296. ❌ How unconstrained machine-learning models learn physical symmetries

作者: Michelangelo Domina, Joseph William Abbott, Paolo Pegolo, Filippo Bigi, Michele Ceriotti 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24638v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究无约束机器学习模型如何学习物理对称性，重点关注物理模拟中的机器学习模型，特别是Transformer-based模型在原子模拟和粒子物理中的应用。与绝大多数关键词（如LLMs、MoE、RLHF、RAG等）完全无关，因为这些关键词主要针对大语言模型的技术细节和应用。唯一相关的是’Mechanistic Interpretability OR Explainable AI’（5分），因为论文涉及分析模型如何学习对称性并建立诊断框架，属于可解释AI范畴；以及’AI for Science OR Bioinformatics OR Cheminformatics’（10分），因为论文明确属于AI在科学（物理模拟）领域的应用，是核心内容。

!!! tip deepseek-chat TL;DR

该论文研究了无约束机器学习模型如何学习物理对称性，通过引入严格的度量标准评估模型表示中的对称性内容，并基于Transformer架构在原子模拟和粒子物理中的应用，建立了一个诊断机器学习模型频谱故障模式的框架，最终证明通过战略性地注入最小归纳偏置可以实现更好的稳定性和准确性。

摘要翻译

在物理模拟的机器学习模型发展过程中，生成严格满足对应物理量基本对称性的预测要求产生了深远影响。在许多情况下，模型通过使用受约束的数学形式构建，以确保对称性被精确遵循。然而，不遵循旋转对称性的无约束模型常表现出与之相当的性能，并能通过简单的数据增强策略，以高精度“学习”到近似等变行为。本文引入严格的度量标准，以衡量此类模型中学习到的表征所包含的对称性内容，并评估其输出满足等变条件的准确程度。我们将这些度量应用于两个基于Transformer的无约束模型（一个用于原子模拟的图神经网络和一个用于粒子物理的PointNet风格架构），这些模型操作于带标注的点云数据，以探究对称性信息如何在架构各层间被处理，并在训练过程中被学习。基于这些发现，我们建立了一个严格的框架，用于诊断机器学习模型中的谱失效模式。借助此分析，我们证明，通过策略性地注入所需的最小归纳偏置，可以在保持无约束架构高表达能力和可扩展性的同时保证物理保真度，从而实现更优的稳定性和准确性。

摘要 (Abstract)

The requirement of generating predictions that exactly fulfill the fundamental symmetry of the corresponding physical quantities has profoundly shaped the development of machine-learning models for physical simulations. In many cases, models are built using constrained mathematical forms that ensure that symmetries are enforced exactly. However, unconstrained models that do not obey rotational symmetries are often found to have competitive performance, and to be able to \emph{learn} to a high level of accuracy an approximate equivariant behavior with a simple data augmentation strategy. In this paper, we introduce rigorous metrics to measure the symmetry content of the learned representations in such models, and assess the accuracy by which the outputs fulfill the equivariant condition. We apply these metrics to two unconstrained, transformer-based models operating on decorated point clouds (a graph neural network for atomistic simulations and a PointNet-style architecture for particle physics) to investigate how symmetry information is processed across architectural layers and is learned during training. Based on these insights, we establish a rigorous framework for diagnosing spectral failure modes in ML models. Enabled by this analysis, we demonstrate that one can achieve superior stability and accuracy by strategically injecting the minimum required inductive biases, preserving the high expressivity and scalability of unconstrained architectures while guaranteeing physical fidelity.

关键词: machine-learning models, physical symmetries, transformer-based models, equivariant behavior, atomistic simulations, particle physics, spectral failure modes, inductive biases

Token 消耗统计

总计: 871,537 tokens（输入 557,240 / 输出 314,297）

模型	输入	输出	合计
deepseek-chat	529,924	292,417	822,341
glm-4.7	27,316	21,880	49,196

📊 ArXiv 研究报告 (2026-03-28)

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

评分设置

📈 论文统计

⭐ 及格论文详细分析

1. Closing the Confidence-Faithfulness Gap in Large Language Models

缩小大语言模型中的置信度-忠实度差距

2. AD-CARE: A Guideline-grounded, Modality-agnostic LLM Agent for Real-world Alzheimer’s Disease Diagno

AD-CARE：基于指南、模态无关的LLM智能体用于现实世界阿尔茨海默病诊断的多队列评估、公平性分析与读者研究

3. FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Prot

FinMCP-Bench：基于模型上下文协议的现实世界金融工具使用大模型智能体基准测试

4. SEVerA: Verified Synthesis of Self-Evolving Agents

SEVerA：自进化智能体的验证合成

5. Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Reward

基于约束数据合成与分级奖励的多步工具编排大模型训练

6. An Experimental Comparison of the Most Popular Approaches to Fake News Detection

假新闻检测最流行方法的实验比较

7. TAPO: Translation Augmented Policy Optimization for Multilingual Mathematical Reasoning

TAPO：面向多语言数学推理的翻译增强策略优化

📋 所有论文列表

1. ✅ Closing the Confidence-Faithfulness Gap in Large Language Models

2. ✅ AD-CARE: A Guideline-grounded, Modality-agnostic LLM Agent for Real-world Alzheimer’s Disease Diagnosis with Multi-cohort Assessment, Fairness Analysis, and Reader Study

3. ✅ FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol

4. ✅ SEVerA: Verified Synthesis of Self-Evolving Agents

5. ✅ Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards

6. ✅ An Experimental Comparison of the Most Popular Approaches to Fake News Detection

7. ✅ TAPO: Translation Augmented Policy Optimization for Multilingual Mathematical Reasoning

8. ❌ Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance?

9. ❌ Optimizing Multilingual LLMs via Federated Learning: A Study of Client Language Composition

10. ❌ Designing Any Imaging System from Natural Language: Agent-Constrained Composition over a Finite Primitive Basis

11. ❌ SIGMA: Structure-Invariant Generative Molecular Alignment for Chemical Language Models via Autoregressive Contrastive Learning

12. ❌ Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving

13. ❌ Comparing Natural and Synthetic Structured Data: A Study of the Passive Verb Alternation in French and Italian

14. ❌ Adaptive Chunking: Optimizing Chunking-Method Selection for RAG

15. ❌ GeoHeight-Bench: Towards Height-Aware Multimodal Reasoning in Remote Sensing

16. ❌ CIV-DG: Conditional Instrumental Variables for Domain Generalization in Medical Imaging

17. ❌ Factors Influencing the Quality of AI-Generated Code: A Synthesis of Empirical Evidence

18. ❌ Learning domain-invariant features through channel-level sparsification for Out-Of Distribution Generalization

19. ❌ To Write or to Automate Linguistic Prompts, That Is the Question

20. ❌ Enabling ab initio geometry optimization of strongly correlated systems with transferable deep quantum Monte Carlo

21. ❌ Back to Basics: Revisiting ASR in the Age of Voice Agents

22. ❌ Insights on back marking for the automated identification of animals

23. ❌ Training-free Detection and 6D Pose Estimation of Unseen Surgical Instruments

24. ❌ A Distribution-to-Distribution Neural Probabilistic Forecasting Framework for Dynamical Systems

25. ❌ Training the Knowledge Base through Evidence Distillation and Write-Back Enrichment

26. ❌ Vega: Learning to Drive with Natural Language Instructions

27. ❌ PixelSmile: Toward Fine-Grained Facial Expression Editing

28. ❌ PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

29. ❌ Natural-Language Agent Harnesses

30. ❌ R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning

31. ❌ Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?

32. ❌ Neural Network Conversion of Machine Learning Pipelines

33. ❌ Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

34. ❌ The Kitchen Loop: User-Spec-Driven Development for a Self-Evolving Codebase

35. ❌ A Unified Memory Perspective for Probabilistic Trustworthy AI

36. ❌ Just Zoom In: Cross-View Geo-Localization via Autoregressive Zooming

37. ❌ Measuring What Matters – or What’s Convenient?: Robustness of LLM-Based Scoring Systems to Construct-Irrelevant Factors

38. ❌ A Mentalistic Interface for Probing Folk-Psychological Attribution to Non-Humanoid Robots

39. ❌ Beyond Via: Analysis and Estimation of the Impact of Large Language Models in Academic Papers

40. ❌ Visual or Textual: Effects of Explanation Format and Personal Characteristics on the Perception of Explanations in an Educational Recommender System

41. ❌ Demographic Fairness in Multimodal LLMs: A Benchmark of Gender and Ethnicity Bias in Face Verification

42. ❌ DeepFAN, a transformer-based deep learning model for human-artificial intelligence collaborative assessment of incidental pulmonary nodules in CT scans: a multi-reader, multi-case trial

43. ❌ Are LLMs Overkill for Databases?: A Study on the Finiteness of SQL

44. ❌ TAAC: A gate into Trustable Audio Affective Computing

45. ❌ Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

46. ❌ Voxtral TTS

47. ❌ CHIRP dataset: towards long-term, individual-level, behavioral monitoring of bird populations in the wild

48. ❌ Challenges in Hyperspectral Imaging for Autonomous Driving: The HSI-Drive Case

49. ❌ NERO-Net: A Neuroevolutionary Approach for the Design of Adversarially Robust CNNs

50. ❌ Lightweight GenAI for Network Traffic Synthesis: Fidelity, Augmentation, and Classification

51. ❌ EcoThink: A Green Adaptive Inference Framework for Sustainable and Accessible Agents

52. ❌ Retraining as Approximate Bayesian Inference

53. ❌ Interpretable PM2.5 Forecasting for Urban Air Quality: A Comparative Study of Operational Time-Series Models

54. ❌ Maximum Entropy Behavior Exploration for Sim2Real Zero-Shot Reinforcement Learning

55. ❌ Temporally Decoupled Diffusion Planning for Autonomous Driving

56. ❌ Cross-Model Disagreement as a Label-Free Correctness Signal

57. ❌ From Manipulation to Mistrust: Explaining Diverse Micro-Video Misinformation for Robust Debunking in the Wild

58. ❌ Modernising Reinforcement Learning-Based Navigation for Embodied Semantic Scene Graph Generation

59. ❌ Decidable By Construction: Design-Time Verification for Trustworthy AI