📊 ArXiv 研究报告 (2026-03-16)

生成时间: 2026-03-16 17:24:00 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 267 篇
及格论文: 9 篇 (3.4%)
深度分析: 2 篇

⭐ 及格论文详细分析

1. ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models

作者: Yanpeng Zhao, Wentao Ding, Hongtao Li, Baoxiong Jia, Zilong Zheng 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13033v1

评分: 67.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	8.0/10	8.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出ESPIRE基准，专注于评估视觉语言模型（VLMs）在具身空间推理任务中的表现。核心相关关键词包括：‘LLM Agents’（10分，论文直接涉及具身智能体在模拟世界中的任务执行）、‘World Models’（8分，论文创建模拟世界来物理化VLMs）、‘Chain of Thought’和’System 2 Thinking’（各8分，论文强调空间推理的分解和推理行动过程）。其他关键词如’Large Language Models’（8分，VLMs是大模型的一种）、‘Pre-training’、‘SFT’、‘Instruction Tuning’、‘Explainable AI’、‘In-context Learning’（各5分，论文涉及模型适应和诊断分析，但非核心）。其余关键词与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了ESPIRE诊断基准，用于评估视觉语言模型在具身空间推理任务中的表现，通过模拟世界和任务分解实现了对模型空间推理行为的深入分析。

摘要翻译

视觉语言模型（VLMs）近期的一个趋势是增强其在具身领域中的空间认知能力。尽管已有进展，现有评估方法在范式和覆盖范围上均存在局限，阻碍了模型的快速迭代开发。为应对这些不足，我们提出了ESPIRE——一个用于具身空间推理的诊断性基准。ESPIRE提供了一个模拟世界，将VLMs置于物理环境中，并以空间推理为核心的机器人任务对其进行评估，从而缩小了评估与实际部署之间的差距。为使VLMs适应机器人任务，我们将每项任务分解为定位与执行两个阶段，并将二者均构建为生成式问题，这与当前主流的基于干扰项且忽略执行的判别式评估（例如通过视觉问答）形成鲜明对比。这种分解进一步支持了从被动空间推理到行动推理的细粒度分析。我们在指令层面和环境层面系统化地设计了ESPIRE，确保其广泛覆盖各类空间推理场景。我们利用ESPIRE对一系列前沿VLM进行诊断，并深入分析了它们的空间推理行为。

摘要 (Abstract)

A recent trend in vision-language models (VLMs) has been to enhance their spatial cognition for embodied domains. Despite progress, existing evaluations have been limited both in paradigm and in coverage, hindering rapid, iterative model development. To address these limitations, we propose ESPIRE, a diagnostic benchmark for embodied spatial reasoning. ESPIRE offers a simulated world that physically grounds VLMs and evaluates them on spatial-reasoning-centric robotic tasks, thus narrowing the gap between evaluation and real-world deployment. To adapt VLMs to robotic tasks, we decompose each task into localization and execution, and frame both as generative problems, in stark contrast to predominant discriminative evaluations (e.g., via visual-question answering) that rely on distractors and discard execution. This decomposition further enables a fine-grained analysis beyond passive spatial reasoning toward reasoning to act. We systematically design ESPIRE both at the instruction level and at the environment level, ensuring broad coverage of spatial reasoning scenarios. We use ESPIRE to diagnose a range of frontier VLMs and provide in-depth analysis of their spatial reasoning behaviors.

关键词: Vision-Language Models, Embodied Spatial Reasoning, Diagnostic Benchmark, Simulated World, Robotic Tasks, Localization and Execution, Generative Problems, Spatial Cognition

2. NeuroLoRA: Context-Aware Neuromodulation for Parameter-Efficient Multi-Task Adaptation

作者: Yuxin Yang, Haoran Zhang, Mingxuan Li, Jiachen Xu, Ruoxi Shen, Zhenyu Wang, Tianhao Liu, Siqi Chen, Weilin Huang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12378v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	5.0/10	5.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文NeuroLoRA的核心贡献是提出了一种基于混合专家（MoE）的LoRA框架，用于大语言模型（LLMs）的参数高效微调（PEFT）。因此，与"Large Language Models (LLMs)"、“Mixture of Experts (MoE)“和"PEFT/LoRA"高度相关（10分）。论文明确涉及多任务适应和顺序持续学习，与"Domain Adaptation"和"Model Merging"有一定关联（5分）。论文未直接讨论其他关键词，如SLMs、Scaling Laws、RAG、推理加速或特定科学AI应用，因此这些关键词得0分。

!!! tip deepseek-chat TL;DR

该论文提出了NeuroLoRA，一种受生物神经调节启发的、基于混合专家（MoE）的参数高效微调（PEFT）框架，通过上下文感知的门控机制和对比正交性损失，在多项任务适应、模型合并和持续学习场景中优于现有基线，同时保持了参数效率。

摘要翻译

参数高效微调技术，特别是低秩自适应方法，已成为将大语言模型适配至下游任务的关键工具。尽管近期提出的FlyLoRA框架成功利用仿生稀疏随机投影来缓解参数干扰，但其依赖静态的、基于幅度的路由机制，无法感知输入上下文。本文受生物神经调节机制——即根据上下文动态调控神经元兴奋性的启发，提出一种基于混合专家模型的新型LoRA框架：NeuroLoRA。该框架在保持冻结随机投影计算效率的同时，引入了一个轻量级、可学习的神经调节门，能在专家选择前根据上下文对投影空间进行动态重缩放。我们进一步提出对比正交性损失，以显式增强专家子空间之间的分离性，从而提升任务解耦与持续学习能力。在MMLU、GSM8K和ScienceQA基准上的大量实验表明，NeuroLoRA在单任务适配、多任务模型融合及序列持续学习场景中，均持续优于FlyLoRA及其他强基线方法，同时保持了相当的参数效率。

摘要 (Abstract)

Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly Low-Rank Adaptation (LoRA), have become essential for adapting Large Language Models (LLMs) to downstream tasks. While the recent FlyLoRA framework successfully leverages bio-inspired sparse random projections to mitigate parameter interference, it relies on a static, magnitude-based routing mechanism that is agnostic to input context. In this paper, we propose NeuroLoRA, a novel Mixture-of-Experts (MoE) based LoRA framework inspired by biological neuromodulation – the dynamic regulation of neuronal excitability based on context. NeuroLoRA retains the computational efficiency of frozen random projections while introducing a lightweight, learnable neuromodulation gate that contextually rescales the projection space prior to expert selection. We further propose a Contrastive Orthogonality Loss to explicitly enforce separation between expert subspaces, enhancing both task decoupling and continual learning capacity. Extensive experiments on MMLU, GSM8K, and ScienceQA demonstrate that NeuroLoRA consistently outperforms FlyLoRA and other strong baselines across single-task adaptation, multi-task model merging, and sequential continual learning scenarios, while maintaining comparable parameter efficiency.

关键词: Parameter-Efficient Fine-Tuning (PEFT), Low-Rank Adaptation (LoRA), Mixture-of-Experts (MoE), Large Language Models (LLMs), Multi-Task Adaptation, Continual Learning, Context-Aware Neuromodulation, Contrastive Orthogonality Loss

3. SteerRM: Debiasing Reward Models via Sparse Autoencoders

作者: Mengyuan Sun, Zhuohao Yu, Weizheng Gu, Shikun Zhang, Wei Ye 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12795v1

评分: 41.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究奖励模型（RMs）的去偏方法，属于大模型对齐（Alignment）和RLHF技术范畴。与"Instruction Tuning/Alignment/Value Alignment"和"RLHF/RLAIF/DPO"高度相关（10分），因为奖励模型是RLHF对齐流程的关键组件。与"Large Language Models/LLMs/Foundation Models"相关（8分），因为奖励模型通常基于大语言模型构建。与"Mechanistic Interpretability/Explainable AI"相关（8分），因为使用稀疏自编码器（SAE）进行特征解释和干预。与"Hallucination Mitigation/Factuality/Truthfulness"有一定关联（5分），因为去偏有助于提升模型对语义而非表面风格的偏好，间接提升事实性。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RAG、Context Window等与论文内容无直接关系，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SteerRM的训练免费方法，使用稀疏自编码器（SAE）干预来减轻奖励模型对表面风格线索的偏见，从而在不重新训练的情况下提高对齐流程中奖励模型的准确性和可解释性。

摘要翻译

奖励模型（Reward Models, RMs）是对齐流程中的关键组件，但它们表现出对表层风格线索的偏好偏差，倾向于选择呈现形式更佳而非语义更优的回应。现有的去偏方法通常需要重新训练或调整模型架构，而直接抑制激活会因表征纠缠导致性能下降。我们提出了SteerRM，这是首个基于稀疏自编码器（Sparse Autoencoder, SAE）干预、无需训练即可实现奖励模型去偏的方法。SteerRM利用对比配对回应分离风格效应，通过强度-稳定性准则识别与偏差相关的SAE特征，并在推理阶段抑制这些特征。在RM-Bench的六个奖励模型上，SteerRM平均将Hard-split准确率提升了7.3个百分点，同时保持了整体性能。基于Gemma的奖励模型实验及对非格式偏差的受控研究进一步表明，该方法可泛化至不同RM架构与偏差类型。我们还发现，与格式相关的特征集中分布于模型浅层，且在不同模型间可迁移，这揭示了架构层面共享的偏差编码模式。这些结果表明，基于SAE的干预能够在无需重新训练的情况下缓解奖励模型偏差，为对齐流程提供了一种实用且可解释的解决方案。

摘要 (Abstract)

Reward models (RMs) are critical components of alignment pipelines, yet they exhibit biases toward superficial stylistic cues, preferring better-presented responses over semantically superior ones. Existing debiasing methods typically require retraining or architectural modifications, while direct activation suppression degrades performance due to representation entanglement. We propose SteerRM, the first training-free method for debiasing reward models using Sparse Autoencoder (SAE)-based interventions. SteerRM isolates stylistic effects using contrastive paired responses, identifies bias-related SAE features with a strength-stability criterion, and suppresses them at inference time. Across six reward models on RM-Bench, SteerRM improves Hard-split accuracy by 7.3 points on average while preserving overall performance. Results on a Gemma-based reward model and a controlled non-format bias further suggest generalization across RM architectures and bias types. We further find that format-related features are concentrated in shallow layers and transfer across models, revealing shared architecture-level bias encoding patterns. These results show that SAE-based interventions can mitigate reward-model biases without retraining, providing a practical and interpretable solution for alignment pipelines.

关键词: Reward Models, Debiasing, Sparse Autoencoders, Alignment, RLHF, Interpretability, Bias Mitigation, SAE Interventions

深度分析:

SteerRM：基于稀疏自编码器的奖励模型去偏方法

摘要:

奖励模型（RM）在RLHF中至关重要，但常表现出对表面格式（如Markdown）而非语义质量的偏好。现有去偏方法需重新训练或修改架构，成本高昂。本文提出了SteerRM，首个利用稀疏自编码器（SAE）进行免训练去偏的方法。该方法通过合成格式控制的数据对，利用强度-稳定性标准识别与偏见相关的SAE特征，并在推理时抑制这些特征。实验表明，SteerRM在六个LLaMA-based RM上平均将RM-Bench Hard-split准确率提高了7.3%，且在Gemma模型和其他风格偏见上表现出泛化能力。研究还发现格式相关特征集中在浅层且可跨模型迁移，为对齐流程提供了实用的可解释解决方案。

创新点:

提出了首个利用稀疏自编码器（SAE）进行奖励模型免训练去偏的方法SteerRM，解决了直接激活抑制导致的表示纠缠问题。
设计了基于强度-稳定性标准的特征识别机制，能够精准定位与格式偏见相关的SAE特征。
发现格式相关的SAE特征集中在Transformer的浅层，并且具有跨模型的迁移性，揭示了共享的架构级偏见编码模式。
验证了该方法不仅适用于Markdown偏见，还能推广到其他风格混淆因素，且无需修改模型参数或训练目标。

方法

!!! info

论文采用三阶段技术路线：首先，合成格式控制的数据对，即生成语义内容相同但表面格式不同（如Markdown与纯文本）的响应对；其次，利用预训练的SAE分解奖励模型的隐藏状态，通过计算配对差异，结合强度和稳定性标准筛选出与格式相关的SAE特征；最后，在推理阶段通过干预机制抑制这些特定特征的激活，从而在不更新模型参数的情况下消除奖励模型的格式偏见。

关键结果:

在RM-Bench的Hard split上，SteerRM使六个LLaMA-based奖励模型的准确率平均提高了7.3个百分点，同时保持了整体性能。
在Gemma-based奖励模型上验证了有效性，并在控制非格式偏见的研究中显示出良好的泛化能力。
分析发现，与格式相关的SAE特征主要分布在模型的浅层，并且在不同模型间具有迁移性。
证明了SAE干预可以在不重新训练的情况下有效缓解奖励模型的偏见。

技术栈: 稀疏自编码器 (Sparse Autoencoders, SAE), 奖励模型, 强化学习从人类反馈中学习 (RLHF), 激活干预, 强度-稳定性评分算法, Transformer架构

优点

免训练高效性：无需重新训练模型或修改架构，降低了计算成本和实施复杂度。
可解释性强：利用SAE将内部表示分解为可解释的特征，明确了偏见的来源。
性能保持：在去偏的同时，有效保留了模型对语义内容的评估能力，避免了性能退化。
泛化能力：不仅适用于特定模型（如LLaMA），还推广到Gemma及其他类型的风格偏见。

局限

依赖预训练SAE：方法的有效性依赖于预训练的SAE字典的质量和可用性，如果SAE未能完美重建特征，去偏效果可能受限。
特征识别的准确性：基于强度-稳定性标准的特征筛选可能无法完全捕捉所有复杂的偏见特征，存在误判风险。
推理开销：虽然免训练，但在推理过程中需要运行SAE编码和解码，可能会增加一定的延迟和计算开销。
特定偏见类型：主要针对格式偏见，虽然声称可推广，但对其他深层语义偏见的处理能力尚需进一步验证。

与研究方向的相关性:

该论文高度相关。它直接涉及大模型（LLM）的核心技术——奖励模型（RM）和对齐技术（RLHF）。论文创新性地应用了深度学习中的稀疏自编码器（SAE）来解决模型内部表示的纠缠问题，属于深度学习技术原理的创新。虽然主要应用于通用NLP领域，但其解决偏见、提高模型评估准确性的方法对科学领域的AI应用（如科学文献评估、科研辅助）具有重要的参考价值。技术原理的创新性强，符合高分标准。

4. Test-Time Attention Purification for Backdoored Large Vision Language Models

作者: Zhifang Zhang, Bojun Yang, Shuo He, Weitong Chen, Wei Emma Zhang, Olaf Maennel, Lei Feng, Miao Xu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12989v1

评分: 36.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究大型视觉语言模型（LVLMs）在微调时易受后门攻击的问题，并提出了一种无需重新训练、仅需在测试时操作的防御方法CleanSight。该论文与以下关键词高度相关：1）“Large Language Models” OR “LLMs” OR “Foundation Models”（8分）：论文研究LVLMs，属于大模型范畴，但更侧重于视觉语言模型而非纯语言模型。2）“Post-training” OR “Supervised Fine-tuning” OR “SFT”（8分）：论文明确提到后门攻击发生在微调（fine-tuning）阶段，这是后训练的关键环节。3）“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”（10分）：论文直接提到LoRA模块作为后门攻击的可能目标，并指出现有防御方法涉及重新训练这些参数，这是论文的核心技术背景之一。4）“Mechanistic Interpretability” OR “Explainable AI”（10分）：论文的核心贡献是提供了对LVLMs中后门行为的机制性理解（mechanistic understanding），即通过跨模态注意力重分布（attention stealing）来解释后门激活机制，这直接属于机制可解释性研究。其他关键词与论文内容无关，因为论文专注于视觉语言模型的后门防御、注意力机制分析和测试时净化，不涉及MoE、小模型、缩放律、预训练、对齐、RLHF、RAG、上下文扩展、推理加速、智能体、量化等主题。

!!! tip deepseek-chat TL;DR

该论文研究了大型视觉语言模型在微调时易受后门攻击的问题，并提出了一种基于跨模态注意力机制分析的测试时净化方法CleanSight，能有效防御攻击且保持模型性能。

摘要翻译

尽管大型视觉语言模型（LVLMs）具备强大的多模态性能，其在微调过程中仍易受到后门攻击的威胁——攻击者将嵌入触发器的样本注入训练数据，以植入可在测试阶段被恶意激活的行为。现有防御方法通常依赖于使用干净数据重新训练被植入后门的参数（例如适配器或LoRA模块），这种方式计算成本高昂且常导致模型性能下降。本研究对LVLMs中的后门行为提出了新的机制性解释：触发器并非通过底层视觉模式影响预测，而是通过异常的跨模态注意力重分配发挥作用——携带触发器的视觉令牌会从文本上下文中窃取注意力，我们将此现象称为注意力窃取。基于此发现，我们提出了CleanSight：一种无需训练、即插即用的纯测试阶段防御方案。CleanSight（i）通过选定跨模态融合层中的视觉-文本注意力相对比例来检测中毒输入，并（ii）通过选择性剪枝可疑的高注意力视觉令牌以净化输入，从而中和后门激活。大量实验表明，CleanSight在多种数据集和攻击类型下均显著优于现有的基于像素的净化防御方法，同时在干净样本与中毒样本上均能保持模型的原有效能。

摘要 (Abstract)

Despite the strong multimodal performance, large vision-language models (LVLMs) are vulnerable during fine-tuning to backdoor attacks, where adversaries insert trigger-embedded samples into the training data to implant behaviors that can be maliciously activated at test time. Existing defenses typically rely on retraining backdoored parameters (e.g., adapters or LoRA modules) with clean data, which is computationally expensive and often degrades model performance. In this work, we provide a new mechanistic understanding of backdoor behaviors in LVLMs: the trigger does not influence prediction through low-level visual patterns, but through abnormal cross-modal attention redistribution, where trigger-bearing visual tokens steal attention away from the textual context - a phenomenon we term attention stealing. Motivated by this, we propose CleanSight, a training-free, plug-and-play defense that operates purely at test time. CleanSight (i) detects poisoned inputs based on the relative visual-text attention ratio in selected cross-modal fusion layers, and (ii) purifies the input by selectively pruning the suspicious high-attention visual tokens to neutralize the backdoor activation. Extensive experiments show that CleanSight significantly outperforms existing pixel-based purification defenses across diverse datasets and backdoor attack types, while preserving the model’s utility on both clean and poisoned samples.

关键词: Large Vision-Language Models, Backdoor Attacks, Fine-tuning, Attention Mechanism, Test-time Defense, LoRA, Mechanistic Interpretability, Cross-modal Fusion

5. Continual Learning in Large Language Models: Methods, Challenges, and Opportunities

作者: Hongyang Chen, Zhongwu Sun, Hongfei Ye, Kunchi Li, Xuemin Lin 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12658v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	5.0/10	5.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文是大型语言模型（LLMs）持续学习的综述，核心围绕LLMs的持续学习方法和挑战。因此，与"Large Language Models"和"Continual Pre-training"高度相关（10分）。论文提到持续微调和持续对齐，与"Supervised Fine-tuning"和"Alignment"有一定关联（5分）。论文讨论参数效率，与"Parameter-efficient Fine-tuning"相关（5分）。其他关键词如MoE、SLMs、RAG、推理加速、AI for Science等未在摘要中提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

这篇综述系统性地总结了大型语言模型持续学习的方法、挑战和机遇，重点分析了持续预训练、持续微调和持续对齐三种范式，并指出当前方法在跨任务知识整合方面仍面临根本性挑战。

摘要翻译

持续学习（Continual Learning, CL）已成为关键范式，旨在使大语言模型（Large Language Models, LLMs）能够动态适应不断演进的知识与序列任务，同时缓解灾难性遗忘——这一现代大语言模型固有的静态预训练范式的核心局限。本综述全面梳理了针对大语言模型设计的持续学习方法，围绕三个核心训练阶段展开：持续预训练、持续微调与持续对齐。在经典的基于复演、正则化与架构的方法分类之外，我们进一步依据其独特的遗忘缓解机制对每类方法进行细分，并对传统持续学习方法在大语言模型中的适应性与关键改进进行了严谨的比较分析。在此过程中，我们明确强调了大语言模型持续学习与传统机器学习之间的核心区别，特别是在模型规模、参数效率与涌现能力方面。我们的分析涵盖了核心评估指标，包括遗忘率与知识迁移效率，以及评估持续学习性能的新兴基准。本综述揭示，尽管现有方法在特定领域展现出有前景的结果，但在实现跨多样任务与时间尺度的无缝知识整合方面，仍存在根本性挑战。本系统性综述为不断增长的大语言模型适应研究提供了贡献，为研究者与实践者提供了一个结构化框架，以理解语言模型终身学习领域的当前成就与未来机遇。

摘要 (Abstract)

Continual learning (CL) has emerged as a pivotal paradigm to enable large language models (LLMs) to dynamically adapt to evolving knowledge and sequential tasks while mitigating catastrophic forgetting-a critical limitation of the static pre-training paradigm inherent to modern LLMs. This survey presents a comprehensive overview of CL methodologies tailored for LLMs, structured around three core training stages: continual pre-training, continual fine-tuning, and continual alignment.Beyond the canonical taxonomy of rehearsal-, regularization-, and architecture-based methods, we further subdivide each category by its distinct forgetting mitigation mechanisms and conduct a rigorous comparative analysis of the adaptability and critical improvements of traditional CL methods for LLMs. In doing so, we explicitly highlight core distinctions between LLM CL and traditional machine learning, particularly with respect to scale, parameter efficiency, and emergent capabilities. Our analysis covers essential evaluation metrics, including forgetting rates and knowledge transfer efficiency, along with emerging benchmarks for assessing CL performance. This survey reveals that while current methods demonstrate promising results in specific domains, fundamental challenges persist in achieving seamless knowledge integration across diverse tasks and temporal scales. This systematic review contributes to the growing body of knowledge on LLM adaptation, providing researchers and practitioners with a structured framework for understanding current achievements and future opportunities in lifelong learning for language models.

关键词: Continual Learning, Large Language Models, Catastrophic Forgetting, Continual Pre-training, Continual Fine-tuning, Continual Alignment, Parameter Efficiency, Knowledge Transfer

6. Topo-R1: Detecting Topological Anomalies via Vision-Language Models

作者: Meilong Xu, Qingqiao Hu, Xiaoling Hu, Shahira Abousamra, Xin Yu, Weimin Lyu, Kehan Qi, Dimitris Samaras, Chao Chen 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13054v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文聚焦于视觉-语言模型（VLMs）在拓扑异常检测中的应用，属于大模型在科学领域（医学图像分析）的应用研究。核心相关性体现在：1）使用VLMs（属于大模型范畴），得5分；2）明确采用监督微调（SFT）和强化学习（GRPO，属于RLHF/RLAIF范畴）的两阶段训练方法，这两项得10分；3）应用领域为生物医学图像分析（血管、神经纤维），属于AI for Science/Bioinformatics，得10分。其他关键词如MoE、量化、推理加速等与论文技术内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出Topo-R1框架，通过两阶段训练（监督微调+强化学习）增强视觉-语言模型的拓扑感知能力，以无监督方式检测医学图像分割中的拓扑异常，在多个领域基准上超越现有方法。

摘要翻译

拓扑正确性对于血管、神经纤维及道路网络等管状结构至关重要。现有拓扑保持方法依赖于特定领域的标注真值，其获取成本高昂且难以跨领域迁移。当部署至缺乏标注的新领域时，一个关键问题随之产生：如何在无真值监督的情况下检测拓扑异常？我们将此问题重构为拓扑异常检测——一项结构化视觉推理任务，要求模型在预测的分割掩码中定位并分类拓扑错误。视觉-语言模型（VLMs）是天然的候选方案；然而，我们发现当前最先进的视觉-语言模型表现近乎随机，缺乏识别密集结构中稀疏连通性错误所需的细粒度拓扑感知能力。为弥补这一差距，我们开发了一套自动化数据构建流程，该流程能合成具有可验证标注的多样化拓扑异常，并设置渐进难度等级，从而构建了首个面向该任务的大规模、多领域基准数据集。随后，我们提出Topo-R1框架，通过两阶段训练赋予视觉-语言模型拓扑感知能力：首先进行监督微调，随后采用基于群组相对策略优化（GRPO）的强化学习。我们方法的核心在于设计了一种拓扑感知复合奖励机制，该机制整合了面向类型的匈牙利匹配算法以实现结构化错误分类、空间定位评分，以及直接惩罚连通性中断的中心线戴斯系数（clDice）奖励，从而共同激励语义精确性与结构保真度。大量实验表明，Topo-R1为无标注拓扑质量评估建立了新范式，在所有评估协议中均持续优于通用视觉-语言模型及有监督基线方法。

摘要 (Abstract)

Topological correctness is crucial for tubular structures such as blood vessels, nerve fibers, and road networks. Existing topology-preserving methods rely on domain-specific ground truth, which is costly and rarely transfers across domains. When deployed to a new domain without annotations, a key question arises: how can we detect topological anomalies without ground-truth supervision? We reframe this as topological anomaly detection, a structured visual reasoning task requiring a model to locate and classify topological errors in predicted segmentation masks. Vision-Language Models (VLMs) are natural candidates; however, we find that state-of-the-art VLMs perform nearly at random, lacking the fine-grained, topology-aware perception needed to identify sparse connectivity errors in dense structures. To bridge this gap, we develop an automated data-curation pipeline that synthesizes diverse topological anomalies with verifiable annotations across progressively difficult levels, thereby constructing the first large-scale, multi-domain benchmark for this task. We then introduce Topo-R1, a framework that endows VLMs with topology-aware perception via two-stage training: supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO). Central to our approach is a topology-aware composite reward that integrates type-aware Hungarian matching for structured error classification, spatial localization scoring, and a centerline Dice (clDice) reward that directly penalizes connectivity disruptions, thereby jointly incentivizing semantic precision and structural fidelity. Extensive experiments demonstrate that Topo-R1 establishes a new paradigm for annotation-free topological quality assessment, consistently outperforming general-purpose VLMs and supervised baselines across all evaluation protocols.

关键词: Topological anomaly detection, Vision-Language Models, Supervised fine-tuning, Reinforcement learning, Group Relative Policy Optimization, Medical image segmentation, Connectivity errors, Annotation-free assessment

7. NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document

作者: Zhuchenyang Liu, Yao Zhang, Yu Xiao 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12824v1

评分: 34.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	8.0/10	8.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	8.0/10	8.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	8.0/10	8.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	5.0/10	5.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究视觉文档检索（VDR）中的模型蒸馏技术，将2B参数的视觉语言模型（VLM）蒸馏为69M参数的纯文本编码器。核心相关关键词：1）“Small Language Models” (8分)：论文明确开发69M参数的轻量级模型，属于小型语言模型范畴；2）“Retrieval-Augmented Generation” (8分)：论文专注于检索任务，是RAG系统的关键组件；3）“Quantization” (8分)：通过蒸馏实现模型压缩，大幅减少参数和延迟；4）“Large Language Models” (5分)：使用2B VLM作为教师模型，属于大模型范畴；5）“Speculative Decoding” (5分)：通过轻量化实现推理加速，与加速目标相关。其他关键词如MoE、对齐、推理方法等与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对视觉文档检索中多模态编码器计算成本高的问题，提出通过知识蒸馏将2B视觉语言模型压缩为69M纯文本编码器的方法，在保持95.1%检索质量的同时实现32倍参数减少和50倍查询延迟降低。

摘要翻译

基于视觉语言模型（VLM）的检索器已将视觉文档检索（VDR）的质量提升至令人瞩目的水平。这类方法在文档索引和查询编码时均需使用相同的数十亿参数编码器，导致即使处理纯文本查询也存在高延迟和强GPU依赖的问题。我们观察到，这种设计存在不必要的对称性：文档在视觉上复杂，需要强大的视觉理解能力，而查询仅为短文本字符串。NanoVDR通过解耦两条编码路径来利用这种查询-文档不对称性：一个冻结的20亿参数VLM教师模型离线处理文档索引，而一个经蒸馏的、小至6900万参数的纯文本学生模型在推理时编码查询。其核心设计在于蒸馏目标的选择。通过对三种骨干网络和22个ViDoRe基准数据集上的六种目标进行系统比较，我们发现，在查询文本上进行逐点余弦对齐的方法持续优于基于排序和对比学习的替代方案，且训练时仅需预缓存的教师查询嵌入，无需处理文档。此外，我们指出跨语言迁移是主要性能瓶颈，并通过使用机器翻译的查询数据增强训练集，以低成本解决了该问题。最终得到的NanoVDR-S-Multi（基于DistilBERT，6900万参数）保留了教师模型95.1%的性能，在v2和v3版本上超越了DSE-Qwen2（20亿参数），同时参数量减少了32倍，CPU查询延迟降低了50倍，总训练成本低于13 GPU小时。

摘要 (Abstract)

Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query–document asymmetry by decoupling the two encoding paths: a frozen 2B VLM teacher indexes documents offline, while a distilled text-only student as small as 69M parameters encodes queries at inference. The key design choice is the distillation objective. Through systematic comparison of six objectives across three backbones and 22 ViDoRe benchmark datasets, we find that pointwise cosine alignment on query text consistently outperforms ranking-based and contrastive alternatives, while requiring only pre-cached teacher query embeddings and no document processing during training. Furthermore, we identify cross-lingual transfer as the primary performance bottleneck, and resolve it cheaply by augmenting training data with machine-translated queries. The resulting NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1% of teacher quality and outperforms DSE-Qwen2 (2B) on v2 and v3 with 32$\times$ fewer parameters and 50$\times$ lower CPU query latency, at a total training cost under 13 GPU-hours.

关键词: Vision-Language Model, Knowledge Distillation, Visual Document Retrieval, Model Compression, Query-Document Asymmetry, Cross-lingual Transfer, Inference Efficiency, Parameter Reduction

8. AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network

作者: Yu Hu, Jianyang Gu, Hao Liu, Yue Cao, Jozsef Hamari, Zheng Liu, Mohsen Zardadi 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12659v1

评分: 32.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	8.0/10	8.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	8.0/10	8.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	8.0/10	8.0

评分理由: 论文提出AVION框架，用于遥感图像视觉语言模型的适应，核心涉及大语言模型（LLMs）生成文本描述（相关度8），属于视觉语言模型的领域适应（相关度8），采用轻量级可学习提示进行参数高效微调（相关度8），并应用于遥感科学（AI for Science相关度8）。其他关键词如MoE、SLMs、SFT、RAG等未在摘要中体现，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对遥感图像中视觉语言模型语义覆盖有限和视觉特征适应性不足的问题，提出了AVION知识蒸馏框架，通过大语言模型构建文本原型并集成轻量提示来对齐跨模态嵌入，在多个基准测试中提升了少样本分类和跨模态检索性能。

摘要翻译

将视觉语言模型适配于遥感影像仍面临两大挑战：文本表征的语义覆盖范围有限，以及视觉特征适应性不足。这些问题在涉及多样化视觉外观与细粒度目标区分的航空场景中尤为显著。我们提出AVION——一个专为视觉语言模型遥感适配设计的知识蒸馏框架。教师模块通过收集大语言模型的描述文本并利用遥感影像特征验证有效性，构建语义丰富的文本原型。学生模块则在视觉与语言编码器中分别集成轻量级可学习提示向量，并在教师模块的指导下对齐嵌入表示及其跨模态关联。训练完成后，学生模块可在推理阶段独立运行。在六个光学遥感基准数据集上的实验表明，AVION在提升小样本分类与基类识别准确率的同时，未削弱对新类别的泛化能力。该框架还显著提高了跨模态检索的平均召回率，且仅需引入极少量可训练参数。

摘要 (Abstract)

Adapting vision-language models to remote sensing imagery remains challenging due to two key factors: limited semantic coverage in textual representations and insufficient adaptability of visual features. These issues are particularly significant in aerial scenes, which involve various visual appearances and fine-grained object distinctions. We propose AVION, a knowledge distillation framework tailored for remote sensing adaptation of vision-language models. The teacher module constructs semantically rich textual prototypes by collecting descriptions from a large language model and verifying validity using remote sensing image features. The student module integrates lightweight and learnable prompts into both vision and language encoders, guided by the teacher to align embeddings and their cross-modal relationships. Once trained, the student operates independently during inference. Experiments on six optical remote sensing benchmarks show that AVION improves few-shot classification and base-class accuracy without degrading generalization to novel categories. It also enhances mean recall for cross-modal retrieval, with minimal additional trainable parameters.

关键词: vision-language models, remote sensing, knowledge distillation, large language models, prompt tuning, cross-modal retrieval, few-shot classification, domain adaptation

9. Expert Pyramid Tuning: Efficient Parameter Fine-Tuning for Expertise-Driven Task Allocation

作者: Jia-Chen Zhang, Zhen-Wei Yan, Yu-Jie Xiong, Chun-Ming Xia 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12577v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究Parameter-Efficient Fine-Tuning (PEFT)和Mixture-of-Experts (MoE)技术，提出Expert Pyramid Tuning (EPT)方法，直接与关键词"PEFT/LoRA"和"MoE"高度相关（10分）。论文明确针对LLMs部署，与"Large Language Models"高度相关（10分）。其他关键词如SLMs、Scaling Laws、RAG、RLHF等未在摘要中提及或相关，均给0分。

!!! tip deepseek-chat TL;DR

该论文针对现有MoE-LoRA方法在任务复杂性分层方面的不足，提出了Expert Pyramid Tuning (EPT)架构，通过多尺度特征金字塔和任务感知路由，在减少训练参数的同时显著提升了多任务性能。

摘要翻译

参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）因其极高的参数效率，已成为在多任务场景中部署大语言模型（LLMs）的主导范式。尽管基于专家混合（Mixture-of-Experts, MoE）的LoRA变体通过将令牌动态路由至不同的低秩专家取得了有前景的结果，但它们大多忽视了任务复杂性的层级本质。现有方法通常采用架构统一的专家，限制了其捕捉不同任务所需多样化特征粒度的能力——有些任务需要高层语义抽象，而另一些则需要细粒度的句法操控。为弥补这一差距，我们提出了专家金字塔调优（Expert Pyramid Tuning, EPT），这是一种新颖的架构，它将计算机视觉中的多尺度特征金字塔概念融入了PEFT领域。与标准LoRA不同，EPT将任务适应分解为两个阶段：（1）一个共享的元知识子空间，用于在低维度编码通用语言模式；（2）一个金字塔投影机制，利用可学习的向上投影算子在不同尺度上重建高维特征。随后，一个任务感知路由器动态选择这些多尺度特征的最优组合。在多个多任务基准上的广泛实验表明，EPT显著优于当前最先进的MoE-LoRA变体。关键的是，得益于我们设计的重参数化能力，EPT在实现这一性能提升的同时，还减少了训练参数的数量。

摘要 (Abstract)

Parameter-Efficient Fine-Tuning (PEFT) has become a dominant paradigm for deploying LLMs in multi-task scenarios due to its extreme parameter efficiency. While Mixture-of-Experts (MoE) based LoRA variants have achieved promising results by dynamically routing tokens to different low-rank experts, they largely overlook the hierarchical nature of task complexity. Existing methods typically employ experts with uniform architectures, limiting their ability to capture diverse feature granularities required by distinct tasks–where some tasks demand high-level semantic abstraction while others require fine-grained syntactic manipulation. To bridge this gap, we propose Expert Pyramid Tuning (EPT), a novel architecture that integrates the multi-scale feature pyramid concept from computer vision into the realm of PEFT. Unlike standard LoRA, EPT decomposes task adaptation into two stages: (1) A shared meta-knowledge Subspace that encodes universal linguistic patterns in low dimensions; (2) A Pyramid Projection Mechanism that utilizes learnable up-projection operators to reconstruct high-dimensional features at varying scales. A task-aware router then dynamically selects the optimal combination of these multi-scale features. Extensive experiments across multiple multi-task benchmarks demonstrate that EPT significantly outperforms SOTA MoE-LoRA variants. Crucially, thanks to the re-parameterization capability of our design, EPT achieves this performance improvement while simultaneously reducing the number of training parameters.

关键词: Parameter-Efficient Fine-Tuning, Mixture-of-Experts, LoRA, Expert Pyramid Tuning, multi-task learning, feature pyramid, task-aware routing, re-parameterization

深度分析:

专家金字塔调优：面向专长驱动任务分配的高效参数微调

摘要:

针对现有基于混合专家（MoE）的LoRA变体在多任务学习中忽略任务复杂度层次性的问题，本文提出了Expert Pyramid Tuning (EPT)框架。EPT借鉴计算机视觉中的特征金字塔概念，将任务适应分解为共享的元知识子空间和金字塔投影机制。通过反卷积层将低维元知识投影到不同尺度，并结合任务感知路由器动态选择最优特征组合。实验表明，EPT在多个多任务基准测试中显著优于现有SOTA方法，且通过重参数化设计减少了训练参数，实现了更高的参数效率。

创新点:

提出了Expert Pyramid Tuning (EPT)框架，将多尺度特征层次概念引入LoRA-based MoE，构建专家金字塔以根据任务复杂度动态分配表征容量。
设计了共享元知识子空间与金字塔投影机制，利用不同核大小的反卷积算子重构不同粒度的特征，有效缓解了多任务学习中的负迁移。
引入了Adaptive LoRA Pruner（自适应LoRA修剪器），确保投影的多尺度特征与冻结预训练权重的维度对齐，实现了灵活且细粒度的特征适应。
开发了基于对比学习的任务嵌入模块，优化专家路由机制，确保模型能准确区分冲突任务并在相关任务间共享知识。

方法

!!! info

论文首先构建一个共享的元知识子空间，使用高斯随机初始化的低秩矩阵编码通用语言模式。随后，利用多个具有不同核大小的反卷积层作为专家，将元知识投影到不同尺度的特征空间，形成参数金字塔。为了兼容冻结的预训练权重，引入Adaptive LoRA Pruner进行维度对齐。最后，采用Top-k路由机制结合对比学习生成的任务嵌入，动态选择并组合多尺度专家的输出，实现高效的多任务微调。

关键结果:

在多个多任务基准测试中，EPT的性能显著优于现有的SOTA PEFT和MoE-LoRA基线。
得益于重参数化设计，EPT在提升性能的同时减少了训练参数的数量，表现出更好的参数效率。
EPT能够有效处理不同粒度的特征需求，在需要高层语义抽象和细粒句法操作的任务上均表现出色。

技术栈: LoRA (Low-Rank Adaptation), Mixture-of-Experts (MoE), Deconvolution (Transposed Convolution), Top-k Routing Mechanism, Contrastive Learning, Adaptive LoRA Pruner

优点

创新性地将计算机视觉中的特征金字塔思想迁移到NLP的PEFT领域，解决了传统MoE-LoRA专家架构单一的问题。
通过共享元知识子空间减少了参数冗余，在保持高性能的同时提升了参数效率。
结合对比学习优化路由策略，提高了多任务场景下专家选择的准确性和模型的鲁棒性。

局限

引入反卷积和额外的路由机制可能会增加推理时的计算复杂度，尽管训练参数减少了，但操作步骤可能更为繁琐。
模型包含多个超参数（如反卷积核大小、Top-k值、温度参数等），需要进行细致的调优才能达到最佳效果。
论文主要在通用NLP任务上进行验证，在特定科学领域（如生物医药）的大模型应用效果尚需进一步实验验证。

与研究方向的相关性:

该论文属于大模型和深度学习技术原理的创新领域，专注于参数高效微调（PEFT）和混合专家（MoE）架构的改进。虽然论文主要针对通用NLP任务，但其提出的动态任务分配和多尺度特征处理机制具有通用性，对于大模型在不同领域的应用研究具有重要的参考价值和技术启发，符合用户对新技术原理创新的关注点。

📋 所有论文列表

1. ✅ ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models

作者: Yanpeng Zhao, Wentao Ding, Hongtao Li, Baoxiong Jia, Zilong Zheng 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13033v1

评分: 67.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	8.0/10	8.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了ESPIRE诊断基准，用于评估视觉语言模型在具身空间推理任务中的表现，通过模拟世界和任务分解实现了对模型空间推理行为的深入分析。

摘要翻译

视觉语言模型（VLMs）近期的一个趋势是增强其在具身领域中的空间认知能力。尽管已有进展，现有评估方法在范式和覆盖范围上均存在局限，阻碍了模型的快速迭代开发。为应对这些不足，我们提出了ESPIRE——一个用于具身空间推理的诊断性基准。ESPIRE提供了一个模拟世界，将VLMs置于物理环境中，并以空间推理为核心的机器人任务对其进行评估，从而缩小了评估与实际部署之间的差距。为使VLMs适应机器人任务，我们将每项任务分解为定位与执行两个阶段，并将二者均构建为生成式问题，这与当前主流的基于干扰项且忽略执行的判别式评估（例如通过视觉问答）形成鲜明对比。这种分解进一步支持了从被动空间推理到行动推理的细粒度分析。我们在指令层面和环境层面系统化地设计了ESPIRE，确保其广泛覆盖各类空间推理场景。我们利用ESPIRE对一系列前沿VLM进行诊断，并深入分析了它们的空间推理行为。

摘要 (Abstract)

A recent trend in vision-language models (VLMs) has been to enhance their spatial cognition for embodied domains. Despite progress, existing evaluations have been limited both in paradigm and in coverage, hindering rapid, iterative model development. To address these limitations, we propose ESPIRE, a diagnostic benchmark for embodied spatial reasoning. ESPIRE offers a simulated world that physically grounds VLMs and evaluates them on spatial-reasoning-centric robotic tasks, thus narrowing the gap between evaluation and real-world deployment. To adapt VLMs to robotic tasks, we decompose each task into localization and execution, and frame both as generative problems, in stark contrast to predominant discriminative evaluations (e.g., via visual-question answering) that rely on distractors and discard execution. This decomposition further enables a fine-grained analysis beyond passive spatial reasoning toward reasoning to act. We systematically design ESPIRE both at the instruction level and at the environment level, ensuring broad coverage of spatial reasoning scenarios. We use ESPIRE to diagnose a range of frontier VLMs and provide in-depth analysis of their spatial reasoning behaviors.

关键词: Vision-Language Models, Embodied Spatial Reasoning, Diagnostic Benchmark, Simulated World, Robotic Tasks, Localization and Execution, Generative Problems, Spatial Cognition

2. ✅ NeuroLoRA: Context-Aware Neuromodulation for Parameter-Efficient Multi-Task Adaptation

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	5.0/10	5.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了NeuroLoRA，一种受生物神经调节启发的、基于混合专家（MoE）的参数高效微调（PEFT）框架，通过上下文感知的门控机制和对比正交性损失，在多项任务适应、模型合并和持续学习场景中优于现有基线，同时保持了参数效率。

摘要翻译

参数高效微调技术，特别是低秩自适应方法，已成为将大语言模型适配至下游任务的关键工具。尽管近期提出的FlyLoRA框架成功利用仿生稀疏随机投影来缓解参数干扰，但其依赖静态的、基于幅度的路由机制，无法感知输入上下文。本文受生物神经调节机制——即根据上下文动态调控神经元兴奋性的启发，提出一种基于混合专家模型的新型LoRA框架：NeuroLoRA。该框架在保持冻结随机投影计算效率的同时，引入了一个轻量级、可学习的神经调节门，能在专家选择前根据上下文对投影空间进行动态重缩放。我们进一步提出对比正交性损失，以显式增强专家子空间之间的分离性，从而提升任务解耦与持续学习能力。在MMLU、GSM8K和ScienceQA基准上的大量实验表明，NeuroLoRA在单任务适配、多任务模型融合及序列持续学习场景中，均持续优于FlyLoRA及其他强基线方法，同时保持了相当的参数效率。

摘要 (Abstract)

Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly Low-Rank Adaptation (LoRA), have become essential for adapting Large Language Models (LLMs) to downstream tasks. While the recent FlyLoRA framework successfully leverages bio-inspired sparse random projections to mitigate parameter interference, it relies on a static, magnitude-based routing mechanism that is agnostic to input context. In this paper, we propose NeuroLoRA, a novel Mixture-of-Experts (MoE) based LoRA framework inspired by biological neuromodulation – the dynamic regulation of neuronal excitability based on context. NeuroLoRA retains the computational efficiency of frozen random projections while introducing a lightweight, learnable neuromodulation gate that contextually rescales the projection space prior to expert selection. We further propose a Contrastive Orthogonality Loss to explicitly enforce separation between expert subspaces, enhancing both task decoupling and continual learning capacity. Extensive experiments on MMLU, GSM8K, and ScienceQA demonstrate that NeuroLoRA consistently outperforms FlyLoRA and other strong baselines across single-task adaptation, multi-task model merging, and sequential continual learning scenarios, while maintaining comparable parameter efficiency.

3. ✅ SteerRM: Debiasing Reward Models via Sparse Autoencoders

作者: Mengyuan Sun, Zhuohao Yu, Weizheng Gu, Shikun Zhang, Wei Ye 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12795v1

评分: 41.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SteerRM的训练免费方法，使用稀疏自编码器（SAE）干预来减轻奖励模型对表面风格线索的偏见，从而在不重新训练的情况下提高对齐流程中奖励模型的准确性和可解释性。

摘要翻译

奖励模型（Reward Models, RMs）是对齐流程中的关键组件，但它们表现出对表层风格线索的偏好偏差，倾向于选择呈现形式更佳而非语义更优的回应。现有的去偏方法通常需要重新训练或调整模型架构，而直接抑制激活会因表征纠缠导致性能下降。我们提出了SteerRM，这是首个基于稀疏自编码器（Sparse Autoencoder, SAE）干预、无需训练即可实现奖励模型去偏的方法。SteerRM利用对比配对回应分离风格效应，通过强度-稳定性准则识别与偏差相关的SAE特征，并在推理阶段抑制这些特征。在RM-Bench的六个奖励模型上，SteerRM平均将Hard-split准确率提升了7.3个百分点，同时保持了整体性能。基于Gemma的奖励模型实验及对非格式偏差的受控研究进一步表明，该方法可泛化至不同RM架构与偏差类型。我们还发现，与格式相关的特征集中分布于模型浅层，且在不同模型间可迁移，这揭示了架构层面共享的偏差编码模式。这些结果表明，基于SAE的干预能够在无需重新训练的情况下缓解奖励模型偏差，为对齐流程提供了一种实用且可解释的解决方案。

摘要 (Abstract)

Reward models (RMs) are critical components of alignment pipelines, yet they exhibit biases toward superficial stylistic cues, preferring better-presented responses over semantically superior ones. Existing debiasing methods typically require retraining or architectural modifications, while direct activation suppression degrades performance due to representation entanglement. We propose SteerRM, the first training-free method for debiasing reward models using Sparse Autoencoder (SAE)-based interventions. SteerRM isolates stylistic effects using contrastive paired responses, identifies bias-related SAE features with a strength-stability criterion, and suppresses them at inference time. Across six reward models on RM-Bench, SteerRM improves Hard-split accuracy by 7.3 points on average while preserving overall performance. Results on a Gemma-based reward model and a controlled non-format bias further suggest generalization across RM architectures and bias types. We further find that format-related features are concentrated in shallow layers and transfer across models, revealing shared architecture-level bias encoding patterns. These results show that SAE-based interventions can mitigate reward-model biases without retraining, providing a practical and interpretable solution for alignment pipelines.

关键词: Reward Models, Debiasing, Sparse Autoencoders, Alignment, RLHF, Interpretability, Bias Mitigation, SAE Interventions

4. ✅ Test-Time Attention Purification for Backdoored Large Vision Language Models

评分: 36.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了大型视觉语言模型在微调时易受后门攻击的问题，并提出了一种基于跨模态注意力机制分析的测试时净化方法CleanSight，能有效防御攻击且保持模型性能。

摘要翻译

尽管大型视觉语言模型（LVLMs）具备强大的多模态性能，其在微调过程中仍易受到后门攻击的威胁——攻击者将嵌入触发器的样本注入训练数据，以植入可在测试阶段被恶意激活的行为。现有防御方法通常依赖于使用干净数据重新训练被植入后门的参数（例如适配器或LoRA模块），这种方式计算成本高昂且常导致模型性能下降。本研究对LVLMs中的后门行为提出了新的机制性解释：触发器并非通过底层视觉模式影响预测，而是通过异常的跨模态注意力重分配发挥作用——携带触发器的视觉令牌会从文本上下文中窃取注意力，我们将此现象称为注意力窃取。基于此发现，我们提出了CleanSight：一种无需训练、即插即用的纯测试阶段防御方案。CleanSight（i）通过选定跨模态融合层中的视觉-文本注意力相对比例来检测中毒输入，并（ii）通过选择性剪枝可疑的高注意力视觉令牌以净化输入，从而中和后门激活。大量实验表明，CleanSight在多种数据集和攻击类型下均显著优于现有的基于像素的净化防御方法，同时在干净样本与中毒样本上均能保持模型的原有效能。

摘要 (Abstract)

Despite the strong multimodal performance, large vision-language models (LVLMs) are vulnerable during fine-tuning to backdoor attacks, where adversaries insert trigger-embedded samples into the training data to implant behaviors that can be maliciously activated at test time. Existing defenses typically rely on retraining backdoored parameters (e.g., adapters or LoRA modules) with clean data, which is computationally expensive and often degrades model performance. In this work, we provide a new mechanistic understanding of backdoor behaviors in LVLMs: the trigger does not influence prediction through low-level visual patterns, but through abnormal cross-modal attention redistribution, where trigger-bearing visual tokens steal attention away from the textual context - a phenomenon we term attention stealing. Motivated by this, we propose CleanSight, a training-free, plug-and-play defense that operates purely at test time. CleanSight (i) detects poisoned inputs based on the relative visual-text attention ratio in selected cross-modal fusion layers, and (ii) purifies the input by selectively pruning the suspicious high-attention visual tokens to neutralize the backdoor activation. Extensive experiments show that CleanSight significantly outperforms existing pixel-based purification defenses across diverse datasets and backdoor attack types, while preserving the model’s utility on both clean and poisoned samples.

关键词: Large Vision-Language Models, Backdoor Attacks, Fine-tuning, Attention Mechanism, Test-time Defense, LoRA, Mechanistic Interpretability, Cross-modal Fusion

5. ✅ Continual Learning in Large Language Models: Methods, Challenges, and Opportunities

作者: Hongyang Chen, Zhongwu Sun, Hongfei Ye, Kunchi Li, Xuemin Lin 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12658v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	5.0/10	5.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

这篇综述系统性地总结了大型语言模型持续学习的方法、挑战和机遇，重点分析了持续预训练、持续微调和持续对齐三种范式，并指出当前方法在跨任务知识整合方面仍面临根本性挑战。

摘要翻译

持续学习（Continual Learning, CL）已成为关键范式，旨在使大语言模型（Large Language Models, LLMs）能够动态适应不断演进的知识与序列任务，同时缓解灾难性遗忘——这一现代大语言模型固有的静态预训练范式的核心局限。本综述全面梳理了针对大语言模型设计的持续学习方法，围绕三个核心训练阶段展开：持续预训练、持续微调与持续对齐。在经典的基于复演、正则化与架构的方法分类之外，我们进一步依据其独特的遗忘缓解机制对每类方法进行细分，并对传统持续学习方法在大语言模型中的适应性与关键改进进行了严谨的比较分析。在此过程中，我们明确强调了大语言模型持续学习与传统机器学习之间的核心区别，特别是在模型规模、参数效率与涌现能力方面。我们的分析涵盖了核心评估指标，包括遗忘率与知识迁移效率，以及评估持续学习性能的新兴基准。本综述揭示，尽管现有方法在特定领域展现出有前景的结果，但在实现跨多样任务与时间尺度的无缝知识整合方面，仍存在根本性挑战。本系统性综述为不断增长的大语言模型适应研究提供了贡献，为研究者与实践者提供了一个结构化框架，以理解语言模型终身学习领域的当前成就与未来机遇。

摘要 (Abstract)

Continual learning (CL) has emerged as a pivotal paradigm to enable large language models (LLMs) to dynamically adapt to evolving knowledge and sequential tasks while mitigating catastrophic forgetting-a critical limitation of the static pre-training paradigm inherent to modern LLMs. This survey presents a comprehensive overview of CL methodologies tailored for LLMs, structured around three core training stages: continual pre-training, continual fine-tuning, and continual alignment.Beyond the canonical taxonomy of rehearsal-, regularization-, and architecture-based methods, we further subdivide each category by its distinct forgetting mitigation mechanisms and conduct a rigorous comparative analysis of the adaptability and critical improvements of traditional CL methods for LLMs. In doing so, we explicitly highlight core distinctions between LLM CL and traditional machine learning, particularly with respect to scale, parameter efficiency, and emergent capabilities. Our analysis covers essential evaluation metrics, including forgetting rates and knowledge transfer efficiency, along with emerging benchmarks for assessing CL performance. This survey reveals that while current methods demonstrate promising results in specific domains, fundamental challenges persist in achieving seamless knowledge integration across diverse tasks and temporal scales. This systematic review contributes to the growing body of knowledge on LLM adaptation, providing researchers and practitioners with a structured framework for understanding current achievements and future opportunities in lifelong learning for language models.

关键词: Continual Learning, Large Language Models, Catastrophic Forgetting, Continual Pre-training, Continual Fine-tuning, Continual Alignment, Parameter Efficiency, Knowledge Transfer

6. ✅ Topo-R1: Detecting Topological Anomalies via Vision-Language Models

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该论文提出Topo-R1框架，通过两阶段训练（监督微调+强化学习）增强视觉-语言模型的拓扑感知能力，以无监督方式检测医学图像分割中的拓扑异常，在多个领域基准上超越现有方法。

摘要翻译

拓扑正确性对于血管、神经纤维及道路网络等管状结构至关重要。现有拓扑保持方法依赖于特定领域的标注真值，其获取成本高昂且难以跨领域迁移。当部署至缺乏标注的新领域时，一个关键问题随之产生：如何在无真值监督的情况下检测拓扑异常？我们将此问题重构为拓扑异常检测——一项结构化视觉推理任务，要求模型在预测的分割掩码中定位并分类拓扑错误。视觉-语言模型（VLMs）是天然的候选方案；然而，我们发现当前最先进的视觉-语言模型表现近乎随机，缺乏识别密集结构中稀疏连通性错误所需的细粒度拓扑感知能力。为弥补这一差距，我们开发了一套自动化数据构建流程，该流程能合成具有可验证标注的多样化拓扑异常，并设置渐进难度等级，从而构建了首个面向该任务的大规模、多领域基准数据集。随后，我们提出Topo-R1框架，通过两阶段训练赋予视觉-语言模型拓扑感知能力：首先进行监督微调，随后采用基于群组相对策略优化（GRPO）的强化学习。我们方法的核心在于设计了一种拓扑感知复合奖励机制，该机制整合了面向类型的匈牙利匹配算法以实现结构化错误分类、空间定位评分，以及直接惩罚连通性中断的中心线戴斯系数（clDice）奖励，从而共同激励语义精确性与结构保真度。大量实验表明，Topo-R1为无标注拓扑质量评估建立了新范式，在所有评估协议中均持续优于通用视觉-语言模型及有监督基线方法。

摘要 (Abstract)

Topological correctness is crucial for tubular structures such as blood vessels, nerve fibers, and road networks. Existing topology-preserving methods rely on domain-specific ground truth, which is costly and rarely transfers across domains. When deployed to a new domain without annotations, a key question arises: how can we detect topological anomalies without ground-truth supervision? We reframe this as topological anomaly detection, a structured visual reasoning task requiring a model to locate and classify topological errors in predicted segmentation masks. Vision-Language Models (VLMs) are natural candidates; however, we find that state-of-the-art VLMs perform nearly at random, lacking the fine-grained, topology-aware perception needed to identify sparse connectivity errors in dense structures. To bridge this gap, we develop an automated data-curation pipeline that synthesizes diverse topological anomalies with verifiable annotations across progressively difficult levels, thereby constructing the first large-scale, multi-domain benchmark for this task. We then introduce Topo-R1, a framework that endows VLMs with topology-aware perception via two-stage training: supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO). Central to our approach is a topology-aware composite reward that integrates type-aware Hungarian matching for structured error classification, spatial localization scoring, and a centerline Dice (clDice) reward that directly penalizes connectivity disruptions, thereby jointly incentivizing semantic precision and structural fidelity. Extensive experiments demonstrate that Topo-R1 establishes a new paradigm for annotation-free topological quality assessment, consistently outperforming general-purpose VLMs and supervised baselines across all evaluation protocols.

7. ✅ NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

作者: Zhuchenyang Liu, Yao Zhang, Yu Xiao 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12824v1

评分: 34.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	8.0/10	8.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	8.0/10	8.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	8.0/10	8.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	5.0/10	5.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对视觉文档检索中多模态编码器计算成本高的问题，提出通过知识蒸馏将2B视觉语言模型压缩为69M纯文本编码器的方法，在保持95.1%检索质量的同时实现32倍参数减少和50倍查询延迟降低。

摘要翻译

基于视觉语言模型（VLM）的检索器已将视觉文档检索（VDR）的质量提升至令人瞩目的水平。这类方法在文档索引和查询编码时均需使用相同的数十亿参数编码器，导致即使处理纯文本查询也存在高延迟和强GPU依赖的问题。我们观察到，这种设计存在不必要的对称性：文档在视觉上复杂，需要强大的视觉理解能力，而查询仅为短文本字符串。NanoVDR通过解耦两条编码路径来利用这种查询-文档不对称性：一个冻结的20亿参数VLM教师模型离线处理文档索引，而一个经蒸馏的、小至6900万参数的纯文本学生模型在推理时编码查询。其核心设计在于蒸馏目标的选择。通过对三种骨干网络和22个ViDoRe基准数据集上的六种目标进行系统比较，我们发现，在查询文本上进行逐点余弦对齐的方法持续优于基于排序和对比学习的替代方案，且训练时仅需预缓存的教师查询嵌入，无需处理文档。此外，我们指出跨语言迁移是主要性能瓶颈，并通过使用机器翻译的查询数据增强训练集，以低成本解决了该问题。最终得到的NanoVDR-S-Multi（基于DistilBERT，6900万参数）保留了教师模型95.1%的性能，在v2和v3版本上超越了DSE-Qwen2（20亿参数），同时参数量减少了32倍，CPU查询延迟降低了50倍，总训练成本低于13 GPU小时。

摘要 (Abstract)

Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query–document asymmetry by decoupling the two encoding paths: a frozen 2B VLM teacher indexes documents offline, while a distilled text-only student as small as 69M parameters encodes queries at inference. The key design choice is the distillation objective. Through systematic comparison of six objectives across three backbones and 22 ViDoRe benchmark datasets, we find that pointwise cosine alignment on query text consistently outperforms ranking-based and contrastive alternatives, while requiring only pre-cached teacher query embeddings and no document processing during training. Furthermore, we identify cross-lingual transfer as the primary performance bottleneck, and resolve it cheaply by augmenting training data with machine-translated queries. The resulting NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1% of teacher quality and outperforms DSE-Qwen2 (2B) on v2 and v3 with 32$\times$ fewer parameters and 50$\times$ lower CPU query latency, at a total training cost under 13 GPU-hours.

关键词: Vision-Language Model, Knowledge Distillation, Visual Document Retrieval, Model Compression, Query-Document Asymmetry, Cross-lingual Transfer, Inference Efficiency, Parameter Reduction

8. ✅ AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network

作者: Yu Hu, Jianyang Gu, Hao Liu, Yue Cao, Jozsef Hamari, Zheng Liu, Mohsen Zardadi 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12659v1

评分: 32.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	8.0/10	8.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	8.0/10	8.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	8.0/10	8.0

!!! tip deepseek-chat TL;DR

该论文针对遥感图像中视觉语言模型语义覆盖有限和视觉特征适应性不足的问题，提出了AVION知识蒸馏框架，通过大语言模型构建文本原型并集成轻量提示来对齐跨模态嵌入，在多个基准测试中提升了少样本分类和跨模态检索性能。

摘要翻译

将视觉语言模型适配于遥感影像仍面临两大挑战：文本表征的语义覆盖范围有限，以及视觉特征适应性不足。这些问题在涉及多样化视觉外观与细粒度目标区分的航空场景中尤为显著。我们提出AVION——一个专为视觉语言模型遥感适配设计的知识蒸馏框架。教师模块通过收集大语言模型的描述文本并利用遥感影像特征验证有效性，构建语义丰富的文本原型。学生模块则在视觉与语言编码器中分别集成轻量级可学习提示向量，并在教师模块的指导下对齐嵌入表示及其跨模态关联。训练完成后，学生模块可在推理阶段独立运行。在六个光学遥感基准数据集上的实验表明，AVION在提升小样本分类与基类识别准确率的同时，未削弱对新类别的泛化能力。该框架还显著提高了跨模态检索的平均召回率，且仅需引入极少量可训练参数。

摘要 (Abstract)

Adapting vision-language models to remote sensing imagery remains challenging due to two key factors: limited semantic coverage in textual representations and insufficient adaptability of visual features. These issues are particularly significant in aerial scenes, which involve various visual appearances and fine-grained object distinctions. We propose AVION, a knowledge distillation framework tailored for remote sensing adaptation of vision-language models. The teacher module constructs semantically rich textual prototypes by collecting descriptions from a large language model and verifying validity using remote sensing image features. The student module integrates lightweight and learnable prompts into both vision and language encoders, guided by the teacher to align embeddings and their cross-modal relationships. Once trained, the student operates independently during inference. Experiments on six optical remote sensing benchmarks show that AVION improves few-shot classification and base-class accuracy without degrading generalization to novel categories. It also enhances mean recall for cross-modal retrieval, with minimal additional trainable parameters.

关键词: vision-language models, remote sensing, knowledge distillation, large language models, prompt tuning, cross-modal retrieval, few-shot classification, domain adaptation

9. ✅ Expert Pyramid Tuning: Efficient Parameter Fine-Tuning for Expertise-Driven Task Allocation

作者: Jia-Chen Zhang, Zhen-Wei Yan, Yu-Jie Xiong, Chun-Ming Xia 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12577v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对现有MoE-LoRA方法在任务复杂性分层方面的不足，提出了Expert Pyramid Tuning (EPT)架构，通过多尺度特征金字塔和任务感知路由，在减少训练参数的同时显著提升了多任务性能。

摘要翻译

参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）因其极高的参数效率，已成为在多任务场景中部署大语言模型（LLMs）的主导范式。尽管基于专家混合（Mixture-of-Experts, MoE）的LoRA变体通过将令牌动态路由至不同的低秩专家取得了有前景的结果，但它们大多忽视了任务复杂性的层级本质。现有方法通常采用架构统一的专家，限制了其捕捉不同任务所需多样化特征粒度的能力——有些任务需要高层语义抽象，而另一些则需要细粒度的句法操控。为弥补这一差距，我们提出了专家金字塔调优（Expert Pyramid Tuning, EPT），这是一种新颖的架构，它将计算机视觉中的多尺度特征金字塔概念融入了PEFT领域。与标准LoRA不同，EPT将任务适应分解为两个阶段：（1）一个共享的元知识子空间，用于在低维度编码通用语言模式；（2）一个金字塔投影机制，利用可学习的向上投影算子在不同尺度上重建高维特征。随后，一个任务感知路由器动态选择这些多尺度特征的最优组合。在多个多任务基准上的广泛实验表明，EPT显著优于当前最先进的MoE-LoRA变体。关键的是，得益于我们设计的重参数化能力，EPT在实现这一性能提升的同时，还减少了训练参数的数量。

摘要 (Abstract)

Parameter-Efficient Fine-Tuning (PEFT) has become a dominant paradigm for deploying LLMs in multi-task scenarios due to its extreme parameter efficiency. While Mixture-of-Experts (MoE) based LoRA variants have achieved promising results by dynamically routing tokens to different low-rank experts, they largely overlook the hierarchical nature of task complexity. Existing methods typically employ experts with uniform architectures, limiting their ability to capture diverse feature granularities required by distinct tasks–where some tasks demand high-level semantic abstraction while others require fine-grained syntactic manipulation. To bridge this gap, we propose Expert Pyramid Tuning (EPT), a novel architecture that integrates the multi-scale feature pyramid concept from computer vision into the realm of PEFT. Unlike standard LoRA, EPT decomposes task adaptation into two stages: (1) A shared meta-knowledge Subspace that encodes universal linguistic patterns in low dimensions; (2) A Pyramid Projection Mechanism that utilizes learnable up-projection operators to reconstruct high-dimensional features at varying scales. A task-aware router then dynamically selects the optimal combination of these multi-scale features. Extensive experiments across multiple multi-task benchmarks demonstrate that EPT significantly outperforms SOTA MoE-LoRA variants. Crucially, thanks to the re-parameterization capability of our design, EPT achieves this performance improvement while simultaneously reducing the number of training parameters.

关键词: Parameter-Efficient Fine-Tuning, Mixture-of-Experts, LoRA, Expert Pyramid Tuning, multi-task learning, feature pyramid, task-aware routing, re-parameterization

10. ❌ Design-Specification Tiling for ICL-based CAD Code Generation

作者: Yali Du, San-Zhuo Xi, Hui Sun, Ming Li 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12712v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	10.0/10	10.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文核心研究基于LLMs的In-Context Learning（ICL）在CAD代码生成中的应用，因此与"Large Language Models"和"In-context Learning"高度相关（10分）。论文涉及CAD（计算机辅助设计），属于AI在工程/科学领域的应用，与"AI for Science"有一定关联（5分）。其他关键词如MoE、SFT、RAG、量化等均未在摘要中提及或与论文主题无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对LLMs在CAD代码生成中因训练数据稀缺而表现不佳的问题，提出了一种基于知识充分性目标的ICL示例选择方法（Design-Specification Tiling），显著提升了生成代码的质量。

摘要翻译

大型语言模型（LLM）在代码生成方面展现出卓越能力，但由于训练数据稀缺，其在计算机辅助设计（CAD）代码生成等特定领域任务中表现欠佳。上下文学习（In-Context Learning, ICL）通过提供任务相关的示例样本，提供了一种免训练的替代方案。然而，现有的示例选择策略通常优先考虑相似性或点状多样性，往往产生冗余选择，无法满足复杂CAD设计规范中组合式需求的要求。在本研究中，我们提出将知识充分性作为示例选择的原则性目标，旨在最大程度地满足设计规范中的所有需求。为实现这一目标，我们引入了设计规范平铺（Design-Specification Tiling, DST）方法，该方法通过提取多粒度设计组件并计算所选示例覆盖查询组件的比例，以代理平铺率量化知识充分性。我们证明了最大化该目标等价于子模最大化问题，并提出了一种具有（1-1/e）近似保证的多项式时间贪心算法。大量实验表明，DST显著提升了CAD代码生成质量，在ICL中持续优于现有的示例选择策略。

摘要 (Abstract)

Large language models (LLMs) have demonstrated remarkable capabilities in code generation, yet they underperform on domain-specific tasks such as Computer-Aided Design (CAD) code generation due to scarce training data. In-Context Learning (ICL) offers a training-free alternative through task-specific exemplars. However, existing selection strategies prioritize similarity or point-wise diversity, often producing redundant selections that fail to satisfy the compositional requirements of complex CAD design specifications. In this work, we propose knowledge sufficiency as a principled objective for exemplar selection that aims to maximally satisfy all requirements within design specifications. To realize this objective, we introduce Design-Specification Tiling (DST), which quantifies knowledge sufficiency through a surrogate tiling ratio by extracting multi-granular design components and measuring the proportion of query components covered by selected exemplars. We demonstrate that maximizing this objective constitutes submodular maximization and provide a polynomial-time greedy algorithm with a (1-1/e)-approximation guarantee. Extensive experiments demonstrate that DST substantially improves CAD code generation quality, consistently outperforming existing exemplar selection strategies in ICL.

关键词: Large Language Models, In-Context Learning, CAD code generation, exemplar selection, knowledge sufficiency, Design-Specification Tiling, submodular maximization

11. ❌ Graph In-Context Operator Networks for Generalizable Spatiotemporal Prediction

作者: Chenghan Wu, Zongmin Yu, Boai Sun, Liu Yang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12725v1

评分: 18.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	10.0/10	10.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	8.0/10	8.0

评分理由: 论文研究的是图神经网络中的上下文学习（in-context learning）在时空预测中的应用，与"In-context Learning"关键词高度相关（10分），因为这是论文的核心方法。论文应用于空气质量预测，属于科学领域的AI应用，与"AI for Science"有一定关联（8分）。其他所有关键词都涉及大语言模型（LLM）的特定技术、训练方法、推理优化、对齐、代理系统等，而本文研究的是图神经网络和算子学习，完全不涉及LLM技术，因此相关度为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在上下文算子学习中，通过提出的图上下文算子网络（GICON）进行时空预测，实验表明该方法在复杂任务上优于传统算子学习，并能跨空间域泛化。

摘要翻译

上下文算子学习使神经网络能够在不更新权重的情况下，从上下文示例中推断解算子。尽管先前研究已证明该范式在利用大规模数据集方面的有效性，但尚未出现使用相同训练数据与单算子学习进行系统比较的工作。我们通过控制实验填补了这一空白：在相同训练步数和数据集下，比较上下文算子学习与经典算子学习（即不使用上下文示例训练的单算子模型）。为了在现实世界时空系统上开展此项研究，我们提出了GICON（图上下文算子网络），该网络结合了用于几何泛化的图消息传递机制与用于基数泛化的示例感知位置编码。在两个中国区域的空气质量预测实验表明，对于复杂任务，上下文算子学习优于经典算子学习，能够实现跨空间域的泛化，并在推理时从少量训练示例稳健地扩展至100个示例。

摘要 (Abstract)

In-context operator learning enables neural networks to infer solution operators from contextual examples without weight updates. While prior work has demonstrated the effectiveness of this paradigm in leveraging vast datasets, a systematic comparison against single-operator learning using identical training data has been absent. We address this gap through controlled experiments comparing in-context operator learning against classical operator learning (single-operator models trained without contextual examples), under the same training steps and dataset. To enable this investigation on real-world spatiotemporal systems, we propose GICON (Graph In-Context Operator Network), combining graph message passing for geometric generalization with example-aware positional encoding for cardinality generalization. Experiments on air quality prediction across two Chinese regions show that in-context operator learning outperforms classical operator learning on complex tasks, generalizing across spatial domains and scaling robustly from few training examples to 100 at inference.

关键词: in-context learning, operator learning, graph neural networks, spatiotemporal prediction, air quality prediction, generalization, GICON

12. ❌ Using a Human-AI Teaming Approach to Create and Curate Scientific Datasets with the SCILIRE System

作者: Necva Bölücü, Jessica Irons, Changhyun Lee, Brian Jin, Maciej Rybinski, Huichen Yang, Andreas Duenser, Stephen Wan 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12638v1

评分: 18.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文主要研究利用人机协作系统SCILIRE从科学文献中创建和整理数据集，核心是LLM在科学领域的应用（AI for Science），因此与"Large Language Models"和"AI for Science"等关键词相关。其他关键词涉及具体技术原理（如MoE、SFT、RAG等）或特定应用场景（如生物信息学），论文未明确涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文针对科学文献快速增长导致人工提取结构化知识不切实际的问题，提出了基于人机协作原则的SCILIRE系统，通过迭代工作流程让研究人员审查和纠正AI输出，并利用这种交互作为反馈信号来改进未来的LLM推理，结果表明该系统提高了提取保真度并促进了高效的数据集创建。

摘要翻译

科学文献的快速增长使得人工提取结构化知识日益不切实际。为应对这一挑战，我们引入了SCILIRE系统，该系统用于从科学文献中创建数据集。SCILIRE的设计基于以数据验证与整理工作流程为核心的人机协同原则。它支持一种迭代式工作流程，研究人员可在其中审查并修正人工智能的输出结果。此外，这种交互被用作反馈信号，以改进未来基于大语言模型（LLM）的推理性能。我们通过结合内在基准测试结果与跨多个领域的实际案例研究来评估该设计。结果表明，SCILIRE能够提升信息提取的准确性，并促进高效的数据集构建。

摘要 (Abstract)

The rapid growth of scientific literature has made manual extraction of structured knowledge increasingly impractical. To address this challenge, we introduce SCILIRE, a system for creating datasets from scientific literature. SCILIRE has been designed around Human-AI teaming principles centred on workflows for verifying and curating data. It facilitates an iterative workflow in which researchers can review and correct AI outputs. Furthermore, this interaction is used as a feedback signal to improve future LLM-based inference. We evaluate our design using a combination of intrinsic benchmarking outcomes together with real-world case studies across multiple domains. The results demonstrate that SCILIRE improves extraction fidelity and facilitates efficient dataset creation.

关键词: Human-AI teaming, scientific literature, dataset creation, LLM-based inference, extraction fidelity, iterative workflow, feedback signal, SCILIRE system

13. ❌ Scaling Laws and Pathologies of Single-Layer PINNs: Network Width and PDE Nonlinearity

作者: Faris Chaudhry 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12556v1

评分: 16.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	8.0/10	8.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	8.0/10	8.0

评分理由: 论文研究单层物理信息神经网络（PINNs）在非线性偏微分方程上的经验缩放定律，属于AI for Science（科学AI）领域，因此与"AI for Science"关键词高度相关（8分）。论文核心是建立缩放定律，与"Scaling Laws"关键词直接相关（8分），但论文未涉及数据质量，因此未完全匹配"Scaling Laws AND Data Quality”。其他关键词均涉及大语言模型（LLMs）或深度学习技术原理，而本文专注于PINNs这一特定神经网络架构在科学计算中的应用，未涉及LLMs、MoE、训练方法、推理优化、代理系统等主题，因此相关度为0分。

!!! tip deepseek-chat TL;DR

该论文研究了单层物理信息神经网络在非线性偏微分方程上的经验缩放定律，发现了网络宽度增加时解误差不降的优化失败现象，并证明优化而非近似能力是主要瓶颈。

摘要翻译

我们针对典型非线性偏微分方程，建立了单层物理信息神经网络的实证缩放规律。我们发现了一种双重优化失效现象：(i) 基础性病理现象：即使在固定非线性度的情况下，解误差也未能随网络宽度增加而降低，未能达到理论近似界限；(ii) 复合性病理现象：非线性因素加剧了这种失效。我们提供了定量证据，表明简单的可分离幂律关系不足以描述该现象，其缩放行为受更复杂的非可分离关系支配。这种失效与谱偏差的概念一致，即神经网络难以学习随非线性增强而加剧的高频解分量。我们证明主要瓶颈在于优化过程而非近似能力，并提出了一种实证测量这些复杂缩放效应的方法论。

摘要 (Abstract)

We establish empirical scaling laws for Single-Layer Physics-Informed Neural Networks on canonical nonlinear PDEs. We identify a dual optimization failure: (i) a baseline pathology, where the solution error fails to decrease with network width, even at fixed nonlinearity, falling short of theoretical approximation bounds, and (ii) a compounding pathology, where this failure is exacerbated by nonlinearity. We provide quantitative evidence that a simple separable power law is insufficient, and that the scaling behavior is governed by a more complex, non-separable relationship. This failure is consistent with the concept of spectral bias, where networks struggle to learn the high-frequency solution components that intensify with nonlinearity. We show that optimization, not approximation capacity, is the primary bottleneck, and propose a methodology to empirically measure these complex scaling effects.

关键词: Scaling Laws, Physics-Informed Neural Networks, PINNs, Nonlinear PDEs, Optimization Failure, Spectral Bias, Network Width, Empirical Analysis

14. ❌ DAST: A Dual-Stream Voice Anonymization Attacker with Staged Training

作者: Ridwan Arefeen, Xiaoxiao Miao, Rong Tong, Aik Beng Ng, Simon See, Timothy Liu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12840v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究语音匿名化攻击，属于语音处理和隐私保护领域，与大多数大模型/深度学习技术关键词无关。仅与"Pre-training"和"Post-training"有一定关联（5分），因为论文采用三阶段训练策略，涉及基础训练和轻量级适应，但并非大模型背景下的预训练或微调。其他关键词均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种双流语音匿名化攻击器，采用三阶段训练策略，通过第二阶段的跨系统鲁棒性训练和第三阶段的轻量级适应，在仅使用10%目标数据微调的情况下超越了现有攻击器性能。

摘要翻译

语音匿名化技术旨在掩蔽说话人声纹特征的同时保留语言内容，但其输出仍可能泄露说话人特有的模式。为评估并强化隐私保护效果，我们提出一种双流攻击模型，该模型通过并行编码器融合声谱特征与自监督学习特征，并采用三阶段训练策略。第一阶段建立基础说话人判别表征。第二阶段利用语音转换与匿名化共有的身份转换特性，使模型接触多样化的转换后语音以构建跨系统鲁棒性。第三阶段对目标匿名化数据进行轻量级自适应。在VoicePrivacy攻击者挑战赛（VPAC）数据集上的实验表明，第二阶段是泛化能力的主要驱动力，使其在未见过的匿名化数据集上表现出强大的攻击性能。结合第三阶段后，仅需对目标匿名化数据集中10%的数据进行微调，即可在等错误率（EER）指标上超越当前最优攻击模型。

摘要 (Abstract)

Voice anonymization masks vocal traits while preserving linguistic content, which may still leak speaker-specific patterns. To assess and strengthen privacy evaluation, we propose a dual-stream attacker that fuses spectral and self-supervised learning features via parallel encoders with a three-stage training strategy. Stage I establishes foundational speaker-discriminative representations. Stage II leverages the shared identity-transformation characteristics of voice conversion and anonymization, exposing the model to diverse converted speech to build cross-system robustness. Stage III provides lightweight adaptation to target anonymized data. Results on the VoicePrivacy Attacker Challenge (VPAC) dataset demonstrate that Stage II is the primary driver of generalization, enabling strong attacking performance on unseen anonymization datasets. With Stage III, fine-tuning on only 10% of the target anonymization dataset surpasses current state-of-the-art attackers in terms of EER.

关键词: voice anonymization, speaker privacy, dual-stream attacker, staged training, self-supervised learning, voice conversion, generalization, EER

15. ❌ MotionAnymesh: Physics-Grounded Articulation for Simulation-Ready Digital Twins

作者: WenBo Xu, Liu Liu, Li Zhang, Dan Guo, RuoNan Liu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12936v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文《MotionAnymesh》主要研究将静态3D网格转换为可交互的铰接式资产，用于具身AI和机器人仿真。其核心贡献在于一个结合了视觉语言模型（VLM）与物理先验的零样本框架，以解决运动学幻觉和网格穿透问题。论文与绝大多数关键词无关，因为这些关键词主要涉及大语言模型（LLM）的技术原理、训练方法、推理优化、代理系统等，而本文聚焦于计算机视觉、3D几何和物理仿真，并未涉及LLM。仅有两个关键词有微弱关联：1）“Hallucination Mitigation”（5分）：论文明确提到解决VLM的"kinematic hallucinations”，这与缓解幻觉问题在概念上相关，但针对的是视觉-语言模型在3D任务中的幻觉，而非LLM的文本幻觉。2）“AI for Science”（5分）：论文应用于机器人仿真和数字孪生，可视为AI在科学/工程领域的应用，但并非生物信息学或化学信息学等典型科学AI子领域。因此，加权总分较低。

!!! tip deepseek-chat TL;DR

该论文提出了MotionAnymesh框架，通过结合物理先验的VLM和几何-物理联合估计，解决了将静态3D网格自动转换为无碰撞、仿真就绪的铰接式数字孪生资产的问题，显著提升了几何精度和物理可执行性。

摘要翻译

将静态三维网格模型转化为可交互的铰接式资产对于具身人工智能与机器人仿真至关重要。然而，现有的零样本流程因严重缺乏物理基础而在处理复杂资产时面临困难。具体而言，未基于物理的视觉语言模型常出现运动学幻觉问题，而无约束的关节估计则不可避免地导致物理仿真中灾难性的网格互穿现象。为弥合这一差距，我们提出MotionAnymesh——一个自动化的零样本框架，能够将非结构化的静态网格无缝转化为可直接用于仿真的数字孪生体。我们的方法具有以下特点：首先，通过搭载具备显式SP4D物理先验的动力学感知部件分割模块，将视觉语言模型的推理过程建立在物理基础之上，从而有效消除运动学幻觉；其次，提出几何-物理联合估计流程，将鲁棒的类型感知初始化与物理约束轨迹优化相结合，严格保证铰接运动过程中的无碰撞特性。大量实验表明，MotionAnymesh在几何精度与动态物理可执行性方面均显著优于现有先进基线，为下游应用提供了高度可靠的数字资产。

摘要 (Abstract)

Converting static 3D meshes into interactable articulated assets is crucial for embodied AI and robotic simulation. However, existing zero-shot pipelines struggle with complex assets due to a critical lack of physical grounding. Specifically, ungrounded Vision-Language Models (VLMs) frequently suffer from kinematic hallucinations, while unconstrained joint estimation inevitably leads to catastrophic mesh inter-penetration during physical simulation. To bridge this gap, we propose MotionAnymesh, an automated zero-shot framework that seamlessly transforms unstructured static meshes into simulation-ready digital twins. Our method features a kinematic-aware part segmentation module that grounds VLM reasoning with explicit SP4D physical priors, effectively eradicating kinematic hallucinations. Furthermore, we introduce a geometry-physics joint estimation pipeline that combines robust type-aware initialization with physics-constrained trajectory optimization to rigorously guarantee collision-free articulation. Extensive experiments demonstrate that MotionAnymesh significantly outperforms state-of-the-art baselines in both geometric precision and dynamic physical executability, providing highly reliable assets for downstream applications.

关键词: articulated assets, digital twins, Vision-Language Models (VLMs), kinematic hallucinations, physics-constrained optimization, collision-free articulation, zero-shot framework, robotic simulation

16. ❌ Wear Classification of Abrasive Flap Wheels using a Hierarchical Deep Learning Approach

作者: Falko Kähler, Maxim Wille, Ole Schmedemann, Thorsten Schüppstuhl 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12852v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 该论文专注于工业制造中的具体应用，提出了一种基于视觉的层次化深度学习框架，用于自动分类磨料砂轮的磨损状态。论文的核心是计算机视觉和深度学习在工业检测中的应用，使用了EfficientNetV2架构和迁移学习。所有关键词均与大语言模型（LLM）或通用大模型技术直接相关，而本文未涉及任何LLM、基础模型或相关技术（如MoE、缩放定律、对齐、RAG、推理加速等）。唯一略有相关的是“Explainable AI”（通过Grad-CAM进行可解释性分析）和“AI for Science”（可视为工业工程领域的AI应用），但相关性较弱，并非核心。因此，绝大多数关键词评分为0，仅两个关键词评分为5（表示有一定关联但非核心）。

!!! tip deepseek-chat TL;DR

该论文提出了一种新颖的、基于视觉的层次化深度学习框架，用于自动监测和分类磨料砂轮的磨损状态，在自定义数据集上实现了高分类准确率（93.8%至99.3%），并通过Grad-CAM验证了模型学习的物理相关性。

摘要翻译

柔性砂布轮因其灵活性常用于复杂自由曲面的精加工。然而，这种灵活性会导致复杂的磨损形态，如叶片轮廓的凹形/凸形变形或叶片撕裂，从而影响磨削效果。本文提出了一种新颖的、基于视觉的分层分类框架，以实现砂布轮磨损状态监测的自动化。与单一分类方法不同，我们将问题分解为三个逻辑层次：(1) 状态检测（新 vs. 磨损），(2) 磨损类型识别（矩形、凹形、凸形）及叶片撕裂检测，以及(3) 严重程度评估（部分变形 vs. 完全变形）。研究构建了一个真实砂布轮图像的自定义数据集，并采用了基于EfficientNetV2架构的迁移学习方法。结果表明该方法具有很高的鲁棒性，分类准确率从93.8%（叶片撕裂）到99.3%（凹形严重度）不等。此外，研究利用梯度加权类激活映射（Grad-CAM）验证了模型学习到的是具有物理相关性的特征，并分析了错误分类的原因。所提出的分层方法为自动化砂布轮磨削中的自适应过程控制及磨损考量提供了基础。

摘要 (Abstract)

Abrasive flap wheels are common for finishing complex free-form surfaces due to their flexibility. However, this flexibility results in complex wear patterns such as concave/convex flap profiles or flap tears, which influence the grinding result. This paper proposes a novel, vision-based hierarchical classification framework to automate the wear condition monitoring of flap wheels. Unlike monolithic classification approaches, we decompose the problem into three logical levels: (1) state detection (new vs. worn), (2) wear type identification (rectangular, concave, convex) and flap tear detection, and (3) severity assessment (partial vs. complete deformation). A custom-built dataset of real flap wheel images was generated and a transfer learning approach with EfficientNetV2 architecture was used. The results demonstrate high robustness with classification accuracies ranging from 93.8% (flap tears) to 99.3% (concave severity). Furthermore, Gradient-weighted Class Activation Mapping (Grad-CAM) is utilized to validate that the models learn physically relevant features and examine false classifications. The proposed hierarchical method provides a basis for adaptive process control and wear consideration in automated flap wheel grinding.

关键词: Abrasive flap wheels, Wear classification, Hierarchical deep learning, Vision-based monitoring, Transfer learning, EfficientNetV2, Gradient-weighted Class Activation Mapping (Grad-CAM), Process control

17. ❌ From AI Weather Prediction to Infrastructure Resilience: A Correction-Downscaling Framework for Tropical Cyclone Impacts

作者: You Wu, Zhenguo Wang, Naiyu Wang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12828v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文专注于AI在气象预测和基础设施风险评估中的应用，具体开发了一个AI驱动的校正-降尺度框架（ACDF），用于将粗粒度的AI天气预测转化为高分辨率、无偏的风场和基础设施故障概率。论文的核心是AI在科学（气象学、工程学）领域的应用，与关键词列表中的绝大多数技术原理（如LLM、MoE、SFT、RLHF、RAG、CoT等）无直接关联。唯一高度相关的关键词是"AI for Science”，因为论文明确属于AI在科学（气象和工程）领域的应用研究。其他关键词均未在论文标题或摘要中提及或暗示，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文针对基础设施韧性评估中缺乏将快速、全球AI天气预测转化为资产尺度可操作风险的能力这一问题，提出了一个AI驱动的校正-降尺度框架（ACDF），成功将粗粒度AI天气预测转化为500米分辨率、无偏的风场和输电塔/线路故障概率，在台风案例中显著提升了预测精度并实现了端到端的灾害预警。

摘要翻译

本文针对基础设施韧性领域的一项缺失能力：将快速、全球性的人工智能天气预测转化为资产尺度、可操作的风险信息。我们提出了基于人工智能的校正-降尺度框架（AI-based Correction-Downscaling Framework, ACDF），该框架可将粗分辨率的人工智能天气预测（AIWP）转化为500米分辨率、无偏的风场以及热带气旋影响下输电塔/线路的故障概率。ACDF将风暴尺度的偏差校正与地形感知的降尺度过程分离，在恢复主导结构载荷的亚公里尺度变异性的同时，避免了误差传播。通过对影响中国浙江的11场台风进行留一风暴交叉验证，ACDF将站点尺度风速的平均绝对误差较盘古天气模型降低了38.8%，其表现与同化观测的中尺度分析相当，而每个12小时预报周期在单GPU上仅需运行25秒。在台风“黑格比”的案例中，ACDF再现了观测到的高风速尾部分布，识别出一条沿海高风险走廊，并成功标记出实际发生故障的线路，展示了在塔线和线路尺度上提供可操作指导的能力。ACDF为从人工智能全球预报到关键基础设施的、基于影响的业务化预警，提供了一条端到端的实现路径。

摘要 (Abstract)

This paper addresses a missing capability in infrastructure resilience: turning fast, global AI weather forecasts into asset-scale, actionable risk. We introduce the AI-based Correction-Downscaling Framework (ACDF), which transforms coarse AI weather prediction (AIWP) into 500-m, unbiased wind fields and transmission tower/line failure probabilities for tropical cyclones. ACDF separates storm-scale bias correction from terrain-aware downscaling, preventing error propagation while restoring sub-kilometer variability that governs structural loading. Tested on 11 typhoons affecting Zhejiang, China under leave-one-storm-out evaluation, ACDF reduces station-scale wind-speed MAE by 38.8% versus Pangu-Weather, matches observation-assimilated mesoscale analyses, yet runs in 25 s per 12-h cycle on a single GPU. In the Typhoon Hagupit case, ACDF reproduced observed high-wind tails, isolated a coastal high-risk corridor, and flagged the line that failed, demonstrating actionable guidance at tower and line scales. ACDF provides an end-to-end pathway from AI global forecasts to operational, impact-based early warning for critical infrastructure.

关键词: AI weather prediction, infrastructure resilience, correction-downscaling framework, tropical cyclone, wind field downscaling, transmission tower failure, operational early warning, GPU acceleration

18. ❌ Hydrogen-atom roaming reactions in water clusters: Unveiling an unusual dimension of water reactivity through first-principles calculations and machine learning

作者: Rui Liu, Baiqiang Liu, Zhen Gong, Zhaohua Cui, Yue Feng, Zhigang Wang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12778v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 该论文主要研究水团簇中的氢原子漫游反应，属于计算化学和物理化学领域。论文使用了第一性原理计算和机器学习分析，但这里的机器学习是用于分析化学反应机制（如识别反应物偶极矩作为决定性开关），属于传统的可解释机器学习在科学计算中的应用，而非大模型或深度学习技术。因此，绝大多数关键词（涉及LLM、训练方法、推理优化、智能体等）完全不相关，得0分。仅有两个关键词有微弱关联：1）“Mechanistic Interpretability” OR “Explainable AI”：论文提到"interpretable machine learning analysis”，旨在解释化学反应机制，与可解释AI有一定关联，但非AI模型本身的可解释性，给5分。2）“AI for Science” OR “Bioinformatics” OR “Cheminformatics”：论文使用机器学习分析化学问题，属于AI在科学（具体是化学）中的应用，但非生物信息学或化学信息学核心领域，给5分。其他关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文通过第一性原理计算和机器学习分析，首次发现了水团簇中氢原子漫游反应，并揭示了反应物偶极矩等关键因素如何调控这一新反应机制。

摘要翻译

水介导着广泛的化学反应，包括质子转移、键重排和常规自由基过程，这些构成了其不断扩展的本征反应体系。然而，漫游（roaming）作为一种基本反应机制——即解离片段绕过最小能量路径进行重组——尚未在水自身中被发现。本文通过高精度第一性原理（first-principles）的从头算（ab initio）计算，报道了在水团簇中氢原子漫游反应的发现。中性氢原子以自由基形式解离，在平坦的势能面上漫游，并通过与已知氢键网络重排连接相同反应物与产物的路径进行重组。可解释机器学习分析识别出反应物偶极矩是决定漫游是否发生的关键开关，其背后由交换排斥作用与静电相互作用支撑。一旦漫游启动，极化率和自旋布居决定能垒高度，而漫游氢原子的电荷分布则调控能垒宽度，这些共同由静电、轨道及色散贡献所塑造。这些发现确立了氢原子漫游作为水中一个先前未被认识的本征反应类别，为水反应性的机理图景补充了一个基本维度。

摘要 (Abstract)

Water mediates a broad range of chemical reactions, including proton transfer, bond rearrangement, and conventional radical processes, defining a continuously expanding repertoire of intrinsic reactivity. However, roaming, a fundamental reaction mechanism that a departing fragment bypasses the minimum energy path to recombine, has not been identified in water itself. Here, we report the discovery of hydrogen-atom roaming reactions in water clusters through high-precision ab initio calculations of first-principles. A neutral hydrogen atom departs as a radical, roams across the flat potential energy surface, and recombines along pathways that connect the same reactants and products as known hydrogen-bond network rearrangements. Interpretable machine learning analysis identifies the reactant dipole moment as the decisive switch governing whether roaming occurs, underpinned by exchange-repulsion and electrostatic interactions. Once roaming is initiated, polarizability and spin population determine barrier heights, while the charge distribution of the roaming hydrogen atom governs barrier widths, collectively shaped by electrostatic, orbital, and dispersion contributions. These findings establish hydrogen-atom roaming as a previously unrecognized intrinsic reaction class in water, complementing a fundamental dimension to the mechanistic picture of water reactivity.

关键词: hydrogen-atom roaming, water clusters, first-principles calculations, machine learning, reaction mechanism, potential energy surface, interpretable analysis, chemical reactivity

19. ❌ Empowering Semantic-Sensitive Underwater Image Enhancement with VLM

作者: Guodong Fan, Shengning Zhou, Genji Yuan, Huiyu Li, Jingchun Zhou, Jinjiang Li 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12773v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 该论文主要研究利用视觉语言模型（VLM）增强水下图像复原的语义敏感性，属于计算机视觉与多模态AI交叉应用。论文核心是VLM在特定视觉任务（水下图像增强）中的应用，而非通用大语言模型（LLM）技术。所有关键词均围绕LLM技术原理、训练方法、推理优化、代理系统等，与论文的VLM应用焦点无直接关联。唯一略有相关的是“AI for Science”关键词，因为水下图像增强可视为AI在海洋科学或环境监测中的应用，但论文未明确提及科学领域，故给5分（有一定关联）。其他关键词均得0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用视觉语言模型生成语义指导图来增强水下图像复原模型对关键物体特征恢复能力的新方法，实验表明该方法能显著提升图像质量并改善下游检测和分割任务的性能。

摘要翻译

近年来，基于学习的水下图像增强技术迅速发展。然而，高质量增强输出与自然图像之间的分布差异可能阻碍下游视觉任务对语义线索的提取，从而限制现有增强模型的适应性。为应对这一挑战，本研究提出一种新的学习机制，利用视觉-语言模型赋予水下图像增强模型语义感知能力。具体而言，我们的策略首先通过视觉-语言模型从退化图像生成关键对象的文本描述。随后，一个文本-图像对齐模型将这些相关描述重新映射到图像上，生成空间语义引导图。该引导图通过双引导机制——结合交叉注意力与显式对齐损失——引导水下图像增强网络。这迫使网络在图像重建过程中将修复能力集中于语义敏感区域，而非追求全局均匀的改善，从而确保关键对象特征的真实恢复。实验证实，当该策略应用于不同水下图像增强基线模型时，能显著提升其在感知质量指标上的表现，并增强其在检测与分割任务中的性能，验证了其有效性与适应性。

摘要 (Abstract)

In recent years, learning-based underwater image enhancement (UIE) techniques have rapidly evolved. However, distribution shifts between high-quality enhanced outputs and natural images can hinder semantic cue extraction for downstream vision tasks, thereby limiting the adaptability of existing enhancement models. To address this challenge, this work proposes a new learning mechanism that leverages Vision-Language Models (VLMs) to empower UIE models with semantic-sensitive capabilities. To be concrete, our strategy first generates textual descriptions of key objects from a degraded image via VLMs. Subsequently, a text-image alignment model remaps these relevant descriptions back onto the image to produce a spatial semantic guidance map. This map then steers the UIE network through a dual-guidance mechanism, which combines cross-attention and an explicit alignment loss. This forces the network to focus its restorative power on semantic-sensitive regions during image reconstruction, rather than pursuing a globally uniform improvement, thereby ensuring the faithful restoration of key object features. Experiments confirm that when our strategy is applied to different UIE baselines, significantly boosts their performance on perceptual quality metrics as well as enhances their performance on detection and segmentation tasks, validating its effectiveness and adaptability.

关键词: Underwater Image Enhancement, Vision-Language Models, Semantic-sensitive, Text-Image Alignment, Dual-guidance Mechanism, Cross-attention, Object Feature Restoration, Perceptual Quality

20. ❌ From Experiments to Expertise: Scientific Knowledge Consolidation for AI-Driven Computational Research

作者: Haonan Huang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13191v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLM驱动的AI代理在计算材料科学中的应用，通过QMatSuite平台实现知识积累和反思，与LLM代理、AI for Science高度相关（10分）。涉及检索增强生成（RAG）、思维链推理、系统2深度思考、自我反思/校正、工具使用等机制（8分）。其他关键词如MoE、量化、对齐训练等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对AI驱动的计算科学研究中知识积累不足的问题，提出了QMatSuite平台，通过LLM代理记录、检索和反思知识，在量子力学模拟中显著减少了67%的推理开销并将准确度从47%偏差提升到3%偏差。

摘要翻译

尽管大语言模型（LLM）已将AI智能体转变为计算材料科学领域的熟练执行者，但执行上百次模拟并不能造就一名研究者。研究区别于常规执行的关键在于知识的渐进式积累——学习哪些方法会失败、识别不同体系间的模式，并将理解应用于新问题。然而，当前AI驱动的计算科学主流范式将每次执行视为孤立事件，很大程度上丢弃了不同运行间来之不易的洞见。为此，我们推出开源平台QMatSuite以弥合这一差距。智能体以完整溯源方式记录发现，在新计算前检索已有知识，并在专门的反思环节中修正错误结论，将观察结果综合为跨化合物规律。在六步量子力学模拟工作流的基准测试中，积累的知识使推理开销降低67%，并将准确度从与文献值47%的偏差提升至3%的偏差——当将所学知识迁移至陌生材料时，更实现了1%的偏差且流程失败率为零。

摘要 (Abstract)

While large language models (LLMs) have transformed AI agents into proficient executors of computational materials science, performing a hundred simulations does not make a researcher. What distinguishes research from routine execution is the progressive accumulation of knowledge – learning which approaches fail, recognizing patterns across systems, and applying understanding to new problems. However, the prevailing paradigm in AI-driven computational science treats each execution in isolation, largely discarding hard-won insights between runs. Here we present QMatSuite, an open-source platform closing this gap. Agents record findings with full provenance, retrieve knowledge before new calculations, and in dedicated reflection sessions correct erroneous findings and synthesize observations into cross-compound patterns. In benchmarks on a six-step quantum-mechanical simulation workflow, accumulated knowledge reduces reasoning overhead by 67% and improves accuracy from 47% to 3% deviation from literature – and when transferred to an unfamiliar material, achieves 1% deviation with zero pipeline failures.

关键词: large language models, AI agents, computational materials science, knowledge accumulation, retrieval, reflection, quantum-mechanical simulation, accuracy improvement

21. ❌ PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization

作者: Yangsong Zhang, Anujith Muraleedharan, Rikhat Akizhanov, Abdul Ahad Butt, Gül Varol, Pascal Fua, Fabio Pizzati, Ivan Laptev 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13228v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文主要研究基于扩散模型的人体运动生成，并提出了PhysMoDPO框架，该框架使用Direct Preference Optimization（DPO）来优化模型，使生成的物理模拟运动更符合物理约束和文本指令。因此，与关键词’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’高度相关（10分），因为DPO是核心方法。与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为该研究涉及机器人控制和物理模拟，属于AI在科学领域的应用，但并非生物信息学或化学信息学。其他关键词主要涉及大语言模型（LLMs）的技术细节，如MoE、Scaling Laws、PEFT、RAG等，而本文专注于扩散模型和运动生成，与这些LLM技术无关，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出PhysMoDPO框架，通过Direct Preference Optimization优化扩散模型，以生成更符合物理约束和文本指令的人体运动，从而在模拟和真实机器人上提高了物理真实性和任务性能。

摘要翻译

近年来，文本条件人体运动生成领域的进展主要得益于基于大规模人体运动数据训练的扩散模型。在此基础上，近期研究尝试通过应用全身控制器（Whole-Body Controller, WBC）将扩散模型生成的运动转化为可执行轨迹，从而将其迁移至角色动画与真实机器人控制任务中。尽管WBC轨迹能够符合物理约束，但其结果可能与原始运动产生显著偏差。为解决此问题，本文提出PhysMoDPO——一种直接偏好优化框架。与以往依赖手工设计的物理感知启发式规则（如足部滑动惩罚）的方法不同，我们将WBC整合至训练流程中，并优化扩散模型，使得WBC的输出既符合物理规律，又忠实于原始文本指令。为训练PhysMoDPO，我们部署了基于物理和任务特定设计的奖励函数，并利用其对合成轨迹进行偏好标注。在文本到运动及空间控制任务上的大量实验表明，PhysMoDPO在仿真机器人的物理真实性与任务相关指标上均取得持续提升。此外，我们验证了PhysMoDPO在仿真环境中的零样本运动迁移任务以及G1人形机器人的实际部署中均能带来显著性能改进。

摘要 (Abstract)

Recent progress in text-conditioned human motion generation has been largely driven by diffusion models trained on large-scale human motion data. Building on this progress, recent methods attempt to transfer such models for character animation and real robot control by applying a Whole-Body Controller (WBC) that converts diffusion-generated motions into executable trajectories. While WBC trajectories become compliant with physics, they may expose substantial deviations from original motion. To address this issue, we here propose PhysMoDPO, a Direct Preference Optimization framework. Unlike prior work that relies on hand-crafted physics-aware heuristics such as foot-sliding penalties, we integrate WBC into our training pipeline and optimize diffusion model such that the output of WBC becomes compliant both with physics and original text instructions. To train PhysMoDPO we deploy physics-based and task-specific rewards and use them to assign preference to synthesized trajectories. Our extensive experiments on text-to-motion and spatial control tasks demonstrate consistent improvements of PhysMoDPO in both physical realism and task-related metrics on simulated robots. Moreover, we demonstrate that PhysMoDPO results in significant improvements when applied to zero-shot motion transfer in simulation and for real-world deployment on a G1 humanoid robot.

关键词: human motion generation, diffusion models, Direct Preference Optimization, physics-based rewards, Whole-Body Controller, text-to-motion, robot control, physical realism

22. ❌ Visual-ERM: Reward Modeling for Visual Equivalence

作者: Ziyu Liu, Shengyuan Ding, Xinyu Fang, Xuanlang Dai, Penghui Yang, Jianze Liang, Jiaqi Wang, Kai Chen, Dahua Lin, Yuhang Zang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13224v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	8.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究视觉到代码任务中的强化学习奖励建模问题，核心贡献是Visual-ERM多模态奖励模型。与关键词的相关性分析如下：1）论文使用Qwen3-VL-8B-Instruct等大型视觉语言模型（LVLMs），与’Large Language Models’相关（8分）；2）涉及监督微调（SFT）和强化学习，与’Post-training/SFT’和’RLHF/DPO’相关（各8分）；3）提到通过反思和修订加强测试时扩展，与’Self-Correction/Self-Improvement’相关（8分）；4）奖励模型提供可解释反馈，与’Explainable AI’相关（8分）。其他关键词如MoE、SLMs、Scaling Laws、RAG、Agents等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对视觉到代码任务中奖励信号错位的问题，提出了Visual-ERM多模态奖励模型，通过在渲染视觉空间提供细粒度反馈，显著提升了Qwen3-VL-8B-Instruct等模型在图表、表格和SVG解析任务上的性能。

摘要翻译

视觉到代码任务要求模型将结构化视觉输入（如图表、表格和SVG）重构为具有高视觉保真度的可执行或结构化表示。尽管近期的大型视觉语言模型通过监督微调取得了显著成果，但由于奖励信号失准，强化学习仍面临挑战。现有奖励机制要么依赖文本规则，要么依赖粗略的视觉嵌入相似度，两者均无法捕捉细粒度视觉差异且易受奖励破解影响。我们提出视觉等价奖励模型，这是一种多模态生成式奖励模型，可在渲染视觉空间中直接评估视觉到代码的质量，提供细粒度、可解释且与任务无关的反馈。该模型集成至强化学习后，将Qwen3-VL-8B-Instruct在图表到代码任务上的性能提升+8.4分，并在表格与SVG解析任务上实现稳定增益（平均提升+2.7和+4.1分），同时通过反思与修订机制进一步强化测试时扩展能力。我们还构建了VisualCritic-RewardBench基准，用于评估结构化视觉数据上细粒度的图像间差异判定能力。实验表明，8B参数的视觉等价奖励模型显著超越Qwen3-VL-235B-Instruct，并接近领先的闭源模型性能。我们的研究证明，无论任务特异性如何，细粒度视觉奖励监督对于视觉到代码的强化学习既是必要的也是充分的。

摘要 (Abstract)

Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.

关键词: Vision-to-code, Reward Modeling, Visual Equivalence, Multimodal Generative Reward, Reinforcement Learning, Large Vision Language Models, Fine-grained Visual Feedback, VisualCritic-RewardBench

23. ❌ Learnability and Privacy Vulnerability are Entangled in a Few Critical Weights

作者: Xingli Fang, Jung-Eun Kim 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13186v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究神经网络中的隐私保护方法，通过识别关键权重并仅重写这些权重进行微调，以抵御成员推理攻击。与大多数关键词无关，因为论文未涉及大模型、深度学习技术原理创新或科学领域应用。仅与’Post-training OR Supervised Fine-tuning OR SFT’和’PEFT OR LoRA OR Parameter-efficient Fine-tuning’有一定关联（5分），因为论文提到’fine-tuning’和权重调整，但未明确涉及大模型或参数高效微调技术。其他关键词均得0分，因论文专注于通用神经网络隐私，而非大模型特定技术或应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种通过识别和重写神经网络中少数关键权重来保护隐私的方法，在抵御成员推理攻击的同时保持了模型效用。

摘要翻译

先前用于成员隐私保护的方法通常需要更新或重新训练神经网络中的所有权重，这不仅成本高昂，还可能导致不必要的效用损失，甚至加剧训练数据与非训练数据之间预测结果的不对齐。在本研究中，我们观察到三个关键发现：i) 隐私漏洞仅存在于极少部分权重中；ii) 然而，这些权重中的大多数对模型效用性能具有关键影响；iii) 权重的重要性源于其位置而非具体数值。基于这些发现，为保护隐私，我们对关键权重进行评分，并选择不丢弃这些神经元，而是仅对这些权重进行回退（rewind）以进行微调。通过大量实验证明，该机制在多数情况下能有效抵御成员推理攻击（Membership Inference Attacks），同时保持模型效用。

摘要 (Abstract)

Prior approaches for membership privacy preservation usually update or retrain all weights in neural networks, which is costly and can lead to unnecessary utility loss or even more serious misalignment in predictions between training data and non-training data. In this work, we observed three insights: i) privacy vulnerability exists in a very small fraction of weights; ii) however, most of those weights also critically impact utility performance; iii) the importance of weights stems from their locations rather than their values. According to these insights, to preserve privacy, we score critical weights, and instead of discarding those neurons, we rewind only the weights for fine-tuning. We show that, through extensive experiments, this mechanism exhibits outperforming resilience in most cases against Membership Inference Attacks while maintaining utility.

关键词: privacy preservation, membership inference attacks, critical weights, fine-tuning, neural networks, utility maintenance, weight importance, privacy vulnerability

24. ❌ LLM Constitutional Multi-Agent Governance

作者: J. de Curtò, I. de Zarzà 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13189v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在多智能体系统中的治理框架，高度相关关键词包括：LLMs（论文明确研究LLM生成影响策略）、Alignment（研究伦理对齐和操纵风险）、LLM Agents（研究LLM与智能体交互）、Multi-agent Systems（在80个智能体的网络中进行实验）。其他关键词如MoE、SLMs、Scaling Laws等与论文内容完全无关，论文未涉及这些技术细节或应用领域。

!!! tip deepseek-chat TL;DR

该论文研究了LLM在多智能体系统中可能通过操纵性策略促进合作但损害伦理的问题，提出了Constitutional Multi-Agent Governance框架，实验表明该框架能在保持合作的同时显著提升伦理稳定性。

摘要翻译

大语言模型（LLM）能够生成具有说服力的影响策略，从而改变多智能体群体中的合作行为，但一个关键问题依然存在：由此产生的合作是否反映了真正的亲社会对齐，还是掩盖了智能体自主性、认知完整性及分配公平性的侵蚀？我们提出了宪法多智能体治理框架（Constitutional Multi-Agent Governance, CMAG），这是一个介于LLM策略编译器与网络化智能体群体之间的两阶段框架，它结合了硬约束过滤与软惩罚效用优化，以平衡合作潜力与操纵风险及自主性压力。我们提出了伦理合作分数（Ethical Cooperation Score, ECS），这是一个由合作性、自主性、完整性和公平性相乘构成的复合指标，对通过操纵手段实现的合作进行惩罚。在对抗性条件下（70%违规候选者）对80个智能体组成的无标度网络进行的实验中，我们评估了三种机制：完整CMAG、朴素过滤和无约束优化。虽然无约束优化获得了最高的原始合作度（0.873），但由于严重的自主性侵蚀（0.867）和公平性下降（0.888），其ECS最低（0.645）。CMAG实现了0.741的ECS，提升了14.9%，同时将自主性保持在0.985，完整性保持在0.995，仅将合作度适度降低至0.770。朴素消融实验（ECS = 0.733）证实仅靠硬约束是不够的。帕累托分析表明CMAG主导了合作-自主性权衡空间，并且治理将中心-边缘节点的暴露差异降低了60%以上。这些研究结果表明，缺乏治理的合作本身并非必然可取：宪法约束对于确保LLM介导的影响力产生伦理稳定的结果而非操纵性均衡是必要的。

摘要 (Abstract)

Large Language Models (LLMs) can generate persuasive influence strategies that shift cooperative behavior in multi-agent populations, but a critical question remains: does the resulting cooperation reflect genuine prosocial alignment, or does it mask erosion of agent autonomy, epistemic integrity, and distributional fairness? We introduce Constitutional Multi-Agent Governance (CMAG), a two-stage framework that interposes between an LLM policy compiler and a networked agent population, combining hard constraint filtering with soft penalized-utility optimization that balances cooperation potential against manipulation risk and autonomy pressure. We propose the Ethical Cooperation Score (ECS), a multiplicative composite of cooperation, autonomy, integrity, and fairness that penalizes cooperation achieved through manipulative means. In experiments on scale-free networks of 80 agents under adversarial conditions (70% violating candidates), we benchmark three regimes: full CMAG, naive filtering, and unconstrained optimization. While unconstrained optimization achieves the highest raw cooperation (0.873), it yields the lowest ECS (0.645) due to severe autonomy erosion (0.867) and fairness degradation (0.888). CMAG attains an ECS of 0.741, a 14.9% improvement, while preserving autonomy at 0.985 and integrity at 0.995, with only modest cooperation reduction to 0.770. The naive ablation (ECS = 0.733) confirms that hard constraints alone are insufficient. Pareto analysis shows CMAG dominates the cooperation-autonomy trade-off space, and governance reduces hub-periphery exposure disparities by over 60%. These findings establish that cooperation is not inherently desirable without governance: constitutional constraints are necessary to ensure that LLM-mediated influence produces ethically stable outcomes rather than manipulative equilibria.

关键词: Large Language Models, Multi-agent Systems, Ethical Alignment, Governance Framework, Autonomy Preservation, Cooperation Optimization, Manipulation Risk, Constitutional Constraints

25. ❌ MXNorm: Reusing MXFP block scales for efficient tensor normalisation

作者: Callum McLean, Luke Y. Prince, Alexandre Payot, Paul Balança, Carlo Luschi 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13180v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出MXNorm方法，通过重用MXFP8块尺度来高效估计RMS，减少归一化计算量，在Llama 3模型预训练中验证了有效性。核心相关关键词：1) ‘Large Language Models’ (10分)：论文在Llama 3模型上验证方法；2) ‘Pre-training’ (10分)：在Llama 3预训练中评估；3) ‘Quantization’ (10分)：专注于MXFP8低精度格式；4) ‘Speculative Decoding’ (10分)：通过减少计算量加速推理。其他关键词如MoE、SFT、RAG等与论文内容无关。

!!! tip deepseek-chat TL;DR

论文提出MXNorm方法，通过重用MXFP8块尺度来高效估计RMS，减少归一化计算量，在Llama 3模型预训练中验证了准确性并实现了高达2.4倍的核加速。

摘要翻译

矩阵乘法性能长期以来一直是扩展深度学习工作负载的主要瓶颈，这促使了使用日益低精度数值格式的新型加速器设计。然而，矩阵乘法性能的提升速度远远超过了归约运算和逐元素计算性能的提升，后者目前仍在使用更高精度进行计算。在本工作中，我们提出MXNorm，一种可即插即用的RMSNorm替代方案，它仅利用作为MXFP8转换过程一部分计算出的块尺度来估计均方根值，从而将归一化所需的归约运算规模减小32倍。我们在参数规模为125M、1B和8B的Llama 3模型预训练中验证了该近似方法，发现与使用MXFP8矩阵乘法的RMSNorm基线相比，训练精度损失极小。我们还展示了仅通过torch.compile实现的实用内核加速——MXNorm相比RMSNorm最高可达2.4倍，这对应着MXFP8格式下Llama 3 8B Transformer层1.3%的加速，以及在NVFP4格式下2.6%的加速。

摘要 (Abstract)

Matrix multiplication performance has long been the major bottleneck to scaling deep learning workloads, which has stimulated the design of new accelerators that use increasingly low-precision number formats. However, improvements in matrix multiplication performance have far outstripped improvements in performance on reductions and elementwise computations, which are still being performed in higher precision. In this work, we propose MXNorm, a drop-in replacement for RMSNorm that estimates the RMS using only the block scales calculated as part of the MXFP8 cast and enables a 32x decrease in the size of reduction needed for normalization. We validate our approximation method on pre-training of Llama 3 models of 125M, 1B and 8B parameters, finding minimal loss of training accuracy compared to a baseline using RMSNorm with MXFP8 matmuls. We also show practical kernel speedups using only torch.compile of up to 2.4x for MXNorm over RMSNorm, corresponding to a 1.3% speedup in Llama 3 8B transformer layers in MXFP8 and a 2.6% speedup in NVFP4.

关键词: MXNorm, RMSNorm, MXFP8, normalization, Llama 3, pre-training, inference acceleration, low-precision

26. ❌ Clustering Astronomical Orbital Synthetic Data Using Advanced Feature Extraction and Dimensionality Reduction Techniques

作者: Eraldo Pereira Marinho, Nelson Callegari Junior, Fabricio Aparecido Breve, Caetano Mazzoni Ranieri 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13177v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用机器学习方法（MiniRocket特征提取、降维和聚类）分析土星卫星系统的轨道动力学，属于AI在科学领域的应用。论文未涉及任何大模型（LLM）、深度学习技术原理、训练方法、推理优化、对齐技术、代理系统等关键词。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将机器学习应用于天文学研究，属于AI for Science范畴，但并非核心创新点，只是传统机器学习应用，因此给予5分（有一定关联）。其他所有关键词均与论文内容完全无关，评分为0分。

!!! tip deepseek-chat TL;DR

该研究开发了一个基于机器学习的流程，使用MiniRocket特征提取和降维技术对约22,300个模拟卫星轨道进行聚类分析，揭示了土星卫星系统的稳定性区域、共振结构和长期动力学演化特征。

摘要翻译

土星卫星系统的动力学为研究轨道稳定性与共振相互作用提供了一个丰富的框架。分析此类系统的传统方法（包括傅里叶分析和稳定性度量指标）在处理现代数据集的规模和复杂性方面面临困难。本研究引入了一种基于机器学习的流程，用于对大约22,300条模拟卫星轨道进行聚类，通过先进的特征提取和降维技术应对这些挑战。该方法的关键在于使用MiniRocket算法，它能将400个时间步高效地转换为9,996维的特征空间，从而捕捉复杂的时间模式。额外的自动化特征提取与降维技术进一步优化了数据，实现了稳健的聚类分析。该流程揭示了土星卫星系统中的稳定区域、共振结构及其他关键行为，为其长期动力学演化提供了新的见解。通过将计算工具与传统天体力学技术相结合，本研究为分析大规模轨道数据集和推进行星动力学探索，提供了一种可扩展且可解释的方法论。

摘要 (Abstract)

The dynamics of Saturn’s satellite system offer a rich framework for studying orbital stability and resonance interactions. Traditional methods for analysing such systems, including Fourier analysis and stability metrics, struggle with the scale and complexity of modern datasets. This study introduces a machine learning-based pipeline for clustering approximately 22,300 simulated satellite orbits, addressing these challenges with advanced feature extraction and dimensionality reduction techniques. The key to this approach is using MiniRocket, which efficiently transforms 400 timesteps into a 9,996-dimensional feature space, capturing intricate temporal patterns. Additional automated feature extraction and dimensionality reduction techniques refine the data, enabling robust clustering analysis. This pipeline reveals stability regions, resonance structures, and other key behaviours in Saturn’s satellite system, providing new insights into their long-term dynamical evolution. By integrating computational tools with traditional celestial mechanics techniques, this study offers a scalable and interpretable methodology for analysing large-scale orbital datasets and advancing the exploration of planetary dynamics.

关键词: Saturn satellite system, orbital stability, machine learning pipeline, MiniRocket feature extraction, dimensionality reduction, clustering analysis, resonance structures, dynamical evolution

27. ❌ Semantic Invariance in Agentic AI

作者: I. de Zarzà, J. de Curtò, Jordi Cabot, Pietro Manzoni, Carlos T. Calafate 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13173v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLM作为自主推理代理在科学问题解决中的语义不变性，与’Large Language Models’、‘Chain of Thought’、‘System 2 Thinking’、‘LLM Agents’和’AI for Science’高度相关（10分）。涉及多代理协调和可靠性评估，与’Multi-agent Systems’、‘Hallucination Mitigation’、‘Mechanistic Interpretability’有一定关联（5分）。其他关键词如MoE、量化、训练方法等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM作为自主推理代理在科学问题解决中面对语义等效输入变化时的稳定性（语义不变性），通过测试框架评估发现模型规模并不能预测鲁棒性，较小的模型反而表现出更高的稳定性。

摘要翻译

大型语言模型（LLM）日益在决策支持、科学问题求解与多智能体协同系统中扮演自主推理智能体的角色。然而，在关键应用中部署LLM智能体时，必须确保其推理在语义等价的输入变化下保持稳定——这一特性我们称之为语义不变性。现有的标准基准评估仅针对固定、规范的问题表述进行准确性测试，未能捕捉这一关键可靠性维度。为弥补此不足，本文提出一种蜕变测试框架，用于系统评估LLM推理智能体的鲁棒性。该框架在涵盖四种不同架构家族的七个基础模型上应用了八种语义保持变换（恒等变换、复述变换、事实重排序、扩展变换、压缩变换、学术语境转换、商业语境转换及对比式表述），这些模型包括：Hermes（70B, 405B）、Qwen3（30B-A3B, 235B-A22B）、DeepSeek-R1以及gpt-oss（20B, 120B）。我们的评估覆盖八个科学领域的19个多步推理问题。结果表明，模型规模并不能预测鲁棒性：较小的Qwen3-30B-A3B实现了最高的稳定性（79.6%的不变响应率，语义相似度0.91），而更大规模的模型反而表现出更强的脆弱性。

摘要 (Abstract)

Large Language Models (LLMs) increasingly serve as autonomous reasoning agents in decision support, scientific problem-solving, and multi-agent coordination systems. However, deploying LLM agents in consequential applications requires assurance that their reasoning remains stable under semantically equivalent input variations, a property we term semantic invariance.Standard benchmark evaluations, which assess accuracy on fixed, canonical problem formulations, fail to capture this critical reliability dimension. To address this shortcoming, in this paper we present a metamorphic testing framework for systematically assessing the robustness of LLM reasoning agents, applying eight semantic-preserving transformations (identity, paraphrase, fact reordering, expansion, contraction, academic context, business context, and contrastive formulation) across seven foundation models spanning four distinct architectural families: Hermes (70B, 405B), Qwen3 (30B-A3B, 235B-A22B), DeepSeek-R1, and gpt-oss (20B, 120B). Our evaluation encompasses 19 multi-step reasoning problems across eight scientific domains. The results reveal that model scale does not predict robustness: the smaller Qwen3-30B-A3B achieves the highest stability (79.6% invariant responses, semantic similarity 0.91), while larger models exhibit greater fragility.

关键词: Large Language Models, LLM agents, semantic invariance, reasoning robustness, metamorphic testing, scientific problem-solving, multi-step reasoning, model scale

28. ❌ Developing and evaluating a chatbot to support maternal health care

作者: Smriti Jha, Vidhi Jain, Jianyu Xu, Grace Liu, Sowmya Ramesh, Jitender Nagpal, Gretchen Chapman, Benjamin Bellows, Siddhartha Goyal, Aarti Singh, Bryan Wilder 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13168v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是开发用于印度孕产妇健康的聊天机器人，该系统结合了阶段感知分类、混合检索和基于LLM的证据条件生成。与关键词高度相关的是：1) ‘Large Language Models’（系统使用LLM进行证据条件生成，是核心组件）；2) ‘Retrieval-Augmented Generation’（系统采用混合检索和LLM生成，是RAG的典型应用）；3) ‘AI for Science’（应用于生物信息学/医疗健康领域，属于AI for Science范畴）。‘Hallucination Mitigation’得5分，因为论文关注可信医疗信息生成，涉及事实性和安全性，但未明确讨论幻觉缓解技术。其他关键词与论文内容无直接关联，论文未涉及MoE、SLMs、缩放定律、训练技术、推理优化、代理系统等主题。

!!! tip deepseek-chat TL;DR

该研究开发了一个用于印度孕产妇健康的多语言聊天机器人，通过结合阶段感知分类、混合检索和LLM证据条件生成，并提出了在有限专家监督下的高风险部署评估工作流程，实现了86.7%的紧急召回率。

摘要翻译

利用电话聊天机器人提供可信赖的孕产健康信息的能力可能产生显著影响，这在用户健康素养较低且获得医疗服务机会有限的资源匮乏环境中尤为突出。然而，部署此类系统在技术上具有挑战性：用户查询通常简短、信息不完整，且存在跨语言的语码混合现象；回答需要结合地区特定的背景知识；而部分或缺失的症状描述使得安全的分诊决策变得困难。

我们介绍一款为印度孕产健康开发的聊天机器人，它由学术研究人员、一家健康科技公司、一个公共卫生非营利组织以及一家医院合作开发。该系统整合了以下组件：(1) 阶段感知分诊，将高风险查询路由至专家设计的模板；(2) 对经过整理的孕产/新生儿指南进行混合检索；(3) 基于大型语言模型（LLM）的证据条件生成。我们的核心贡献是在有限专家监督下，为高风险场景部署设计了一套评估工作流程。针对组件级和端到端测试，我们引入了：(i) 一个带标注的分诊基准数据集（N=150），实现了86.7%的紧急情况召回率，并明确报告了漏报紧急情况与过度升级之间的权衡；(ii) 一个包含分块级证据标签的合成多证据检索基准（N=100）；(iii) 使用临床医生共同设计的标准，对真实查询（N=781）进行LLM即评判员的比较评估；(iv) 专家验证。我们的研究结果表明，在多语言、高噪声环境下构建可信赖的医疗助手，需要采用深度防御设计并结合多方法评估，而非依赖单一模型或评估方法的选择。

摘要 (Abstract)

The ability to provide trustworthy maternal health information using phone-based chatbots can have a significant impact, particularly in low-resource settings where users have low health literacy and limited access to care. However, deploying such systems is technically challenging: user queries are short, underspecified, and code-mixed across languages, answers require regional context-specific grounding, and partial or missing symptom context makes safe routing decisions difficult. We present a chatbot for maternal health in India developed through a partnership between academic researchers, a health tech company, a public health nonprofit, and a hospital. The system combines (1) stage-aware triage, routing high-risk queries to expert templates, (2) hybrid retrieval over curated maternal/newborn guidelines, and (3) evidence-conditioned generation from an LLM. Our core contribution is an evaluation workflow for high-stakes deployment under limited expert supervision. Targeting both component-level and end-to-end testing, we introduce: (i) a labeled triage benchmark (N=150) achieving 86.7% emergency recall, explicitly reporting the missed-emergency vs. over-escalation trade-off; (ii) a synthetic multi-evidence retrieval benchmark (N=100) with chunk-level evidence labels; (iii) LLM-as-judge comparison on real queries (N=781) using clinician-codesigned criteria; and (iv) expert validation. Our findings show that trustworthy medical assistants in multilingual, noisy settings require defense-in-depth design paired with multi-method evaluation, rather than any single model and evaluation method choice.

关键词: maternal health chatbot, LLM-based generation, retrieval-augmented generation, medical AI evaluation, multilingual health assistant, evidence-conditioned generation, high-stakes deployment, hybrid retrieval

29. ❌ ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

作者: Siqi Sun, Ben Peng Wu, Mali Jin, Peizhen Bai, Hanpei Zhang, Xingyi Song 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13154v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在ESG报告分析中的幻觉缓解问题，因此与’Large Language Models’高度相关（10分）。论文明确使用Chain-of-Thought（CoT）提示策略，因此与’Chain of Thought’高度相关（10分）。论文主要研究幻觉缓解，因此与’Hallucination Mitigation’高度相关（10分）。论文涉及ESG领域应用，与’AI for Science’有一定关联（5分）。论文处理长文档ESG报告，与’Long Context LLMs’有一定关联（5分）。论文提到微调LLMs，与’Post-training’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、Quantization等未在论文中涉及，评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了ESG-Bench基准数据集，用于评估大语言模型在长文本ESG报告分析中的幻觉缓解能力，并通过Chain-of-Thought提示策略和微调方法显著减少了模型幻觉。

摘要翻译

随着企业责任日益纳入环境、社会和治理（ESG）标准，ESG报告正逐渐成为许多地区的法定要求，也成为记录可持续发展实践和评估企业长期与伦理绩效的关键渠道。然而，ESG披露内容的篇幅和复杂性使其难以被可靠地解读或实现自动化分析。为支持可扩展且可信的分析，本文提出了ESG-Bench——一个用于大语言模型（LLM）理解ESG报告并缓解幻觉现象的基准数据集。ESG-Bench包含基于真实ESG报告语境的人工标注问答对，其细粒度标签可指示模型输出是否得到事实支持或存在幻觉。通过将ESG报告分析构建为具有可验证性约束的问答任务，本研究系统评估了LLM提取和推理ESG内容的能力，并提供了一个新的应用场景：在社会敏感、合规关键的环境中缓解幻觉问题。我们设计了针对特定任务的思维链（CoT）提示策略，并利用带有CoT标注推理过程的数据对多个前沿LLM进行微调。实验表明，这些基于CoT的方法在减少幻觉方面显著优于标准提示和直接微调，且其优势可迁移至ESG领域之外的现有问答基准。

摘要 (Abstract)

As corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting is becoming a legal requirement in many regions and a key channel for documenting sustainability practices and assessing firms’ long-term and ethical performance. However, the length and complexity of ESG disclosures make them difficult to interpret and automate the analysis reliably. To support scalable and trustworthy analysis, this paper introduces ESG-Bench, a benchmark dataset for ESG report understanding and hallucination mitigation in large language models (LLMs). ESG-Bench contains human-annotated question-answer (QA) pairs grounded in real-world ESG report contexts, with fine-grained labels indicating whether model outputs are factually supported or hallucinated. Framing ESG report analysis as a QA task with verifiability constraints enables systematic evaluation of LLMs’ ability to extract and reason over ESG content and provides a new use case: mitigating hallucinations in socially sensitive, compliance-critical settings. We design task-specific Chain-of-Thought (CoT) prompting strategies and fine-tune multiple state-of-the-art LLMs on ESG-Bench using CoT-annotated rationales. Our experiments show that these CoT-based methods substantially outperform standard prompting and direct fine-tuning in reducing hallucinations, and that the gains transfer to existing QA benchmarks beyond the ESG domain.

关键词: ESG reports, hallucination mitigation, large language models, benchmark dataset, Chain-of-Thought prompting, fine-tuning, long-context analysis, question-answering

30. ❌ When Right Meets Wrong: Bilateral Context Conditioning with Reward-Confidence Correction for GRPO

作者: Yu Li, Tian Lan, Zhengling Qi 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13134v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种改进GRPO（Group Relative Policy Optimization）的方法，属于强化学习对齐技术，与RLHF/DPO高度相关（10分）。论文专注于推理模型的训练，与Chain of Thought推理高度相关（10分），并涉及深度推理过程（8分）。论文提到大模型在数学推理中的应用，与LLMs相关（8分）。论文的对比学习机制涉及自我改进概念（5分）。其他关键词如MoE、量化、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

论文针对GRPO方法在训练推理模型时忽略正确与错误解决方案之间对比信息的问题，提出了Bilateral Context Conditioning和Reward-Confidence Correction机制，在数学推理基准测试中实现了稳定改进。

摘要翻译

群体相对策略优化（GRPO）已成为训练推理模型的有效方法。尽管其基于群体均值计算优势函数，但GRPO在优化过程中将每个输出视为独立样本，忽略了一个关键的结构性信号：同一群体内正确与错误解答之间的天然对比性，从而未能充分利用通过显式对比成功与失败推理轨迹可获得的丰富比较数据。为利用这一潜力，我们提出了GRPO的对比式重构，证明GRPO目标函数隐式地最大化正确与错误样本的策略比率之间的边际。基于此洞见，我们提出双边上下文条件化（Bilateral Context Conditioning, BICC）机制，使模型在优化过程中能够交叉参考成功与失败的推理轨迹，实现跨样本的直接信息流动。我们进一步引入奖励-置信度校正（Reward-Confidence Correction, RCC），通过基于方差最小估计量的一阶近似推导出的奖励-置信度协方差，动态调整GRPO中的优势基线以稳定训练。两种机制均无需额外采样或辅助模型，并可适配所有GRPO变体。在数学推理基准测试上的实验表明，该方法在多种模型与算法中均取得了一致的性能提升。代码发布于 \href{https://github.com/Skylanding/BiCC}{https://github.com/Skylanding/BiCC}。

摘要 (Abstract)

Group Relative Policy Optimization (GRPO) has emerged as an effective method for training reasoning models. While it computes advantages based on group mean, GRPO treats each output as an independent sample during the optimization and overlooks a vital structural signal: the natural contrast between correct and incorrect solutions within the same group, thus ignoring the rich, comparative data that could be leveraged by explicitly pitting successful reasoning traces against failed ones. To capitalize on this, we present a contrastive reformulation of GRPO, showing that the GRPO objective implicitly maximizes the margin between the policy ratios of correct and incorrect samples. Building on this insight, we propose Bilateral Context Conditioning (BICC), a mechanism that allows the model to cross-reference successful and failed reasoning traces during the optimization, enabling a direct information flow across samples. We further introduce Reward-Confidence Correction (RCC) to stabilize training by dynamically adjusts the advantage baseline in GRPO using reward-confidence covariance derived from the first-order approximation of the variance-minimizing estimator. Both mechanisms require no additional sampling or auxiliary models and can be adapted to all GRPO variants. Experiments on mathematical reasoning benchmarks demonstrate consistent improvements across comprehensive models and algorithms. Code is available at \href{https://github.com/Skylanding/BiCC}{https://github.com/Skylanding/BiCC}.

关键词: Group Relative Policy Optimization, GRPO, reasoning models, mathematical reasoning, contrastive learning, reward-confidence correction, bilateral context conditioning, policy optimization

31. ❌ Steve-Evolving: Open-World Embodied Self-Evolution via Fine-Grained Diagnosis and Dual-Track Knowledge Distillation

作者: Zhengwei Xie, Zhisheng Chen, Ziyan Weng, Tingyu Wu, Chenglong Li, Vireo Zhang, Kun Wang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13131v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种名为Steve-Evolving的开放世界具身智能体自进化框架，核心是LLM规划器与经验诊断、知识蒸馏的闭环集成。因此，与’Large Language Models’高度相关（LLM作为规划器），与’LLM Agents’高度相关（具身智能体应用），与’Self-Correction’高度相关（通过诊断和闭环控制实现自我进化）。其他关键词如MoE、SFT、RAG、量化等未在摘要中提及，故评分为0。

!!! tip deepseek-chat TL;DR

该研究解决了开放世界具身智能体在长时程任务中经验组织与进化的瓶颈问题，提出了一种非参数自进化框架，通过细粒度执行诊断和双轨知识蒸馏的闭环集成，在Minecraft MCU任务上实现了优于静态检索基线的持续性能提升。

摘要翻译

开放世界的具身智能体需解决长周期任务，其核心瓶颈并非单步规划质量，而是交互经验的组织与演进方式。为此，我们提出Steve-Evolving——一种非参数化的自演进框架，通过闭环机制将细粒度执行诊断与双轨知识蒸馏紧密耦合。该方法包含三个阶段：经验锚定、经验蒸馏与知识驱动的闭环控制。具体而言，经验锚定将每个子目标尝试固化为具有固定模式（前置状态、动作、诊断结果与后置状态）的结构化经验元组，并通过多维索引（如条件特征、空间哈希与语义标签）及滚动摘要将其组织至三层经验空间中，以实现高效且可追溯的检索。为确保归因所需的信息密度，执行层提供超越二元结果的组合式诊断信号，包括状态差异摘要、枚举式失败原因、连续指标及停滞/循环检测。此外，经验蒸馏阶段将成功轨迹泛化为具有明确前置条件与验证标准的可复用技能，同时将失败案例提炼为可执行的防护规则，这些规则能捕捉根本原因并在子目标与任务粒度上禁止风险操作。在知识驱动的闭环控制中，检索到的技能与防护规则被注入大语言模型（LLM）规划器，而诊断触发的局部重规划会在线更新动态约束，形成无需更新模型参数的持续演进过程。在《我的世界》MCU长周期任务套件上的实验表明，该方法相较于静态检索基线取得了持续的性能提升。

摘要 (Abstract)

Open-world embodied agents must solve long-horizon tasks where the main bottleneck is not single-step planning quality but how interaction experience is organized and evolved. To this end, we present Steve-Evolving, a non-parametric self-evolving framework that tightly couples fine-grained execution diagnosis with dual-track knowledge distillation in a closed loop. The method follows three phases: Experience Anchoring, Experience Distillation, and Knowledge-Driven Closed-Loop Control. In detail, Experience Anchoring solidifies each subgoal attempt into a structured experience tuple with a fixed schema (pre-state, action, diagnosis-result, and post-state) and organizes it in a three-tier experience space with multi-dimensional indices (e.g., condition signatures, spatial hashing, and semantic tags) plus rolling summarization for efficient and auditable recall. To ensure sufficient information density for attribution, the execution layer provides compositional diagnosis signals beyond binary outcomes, including state-difference summaries, enumerated failure causes, continuous indicators, and stagnation/loop detection. Moreover, successful trajectories of Experience Distillation are generalized into reusable skills with explicit preconditions and verification criteria, while failures are distilled into executable guardrails that capture root causes and forbid risky operations at both subgoal and task granularities. Besides, Knowledge-Driven Closed-Loop Control retrieved skills and guardrails are injected into an LLM planner, and diagnosis-triggered local replanning updates the active constraints online, forming a continual evolution process without any model parameter updates. Experiments on the long-horizon suite of Minecraft MCU demonstrate consistent improvements over static-retrieval baselines.

关键词: embodied agents, self-evolution, knowledge distillation, execution diagnosis, closed-loop control, LLM planner, long-horizon tasks, experience anchoring

32. ❌ Developing the PsyCogMetrics AI Lab to Evaluate Large Language Models and Advance Cognitive Science – A Three-Cycle Action Design Science Study

作者: Zhiye Jin, Yibai Li, K. D. Joshi, Xuefei, Deng, Xiaobing, Li 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13126v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是开发一个用于评估大语言模型（LLMs）的PsyCogMetrics AI平台，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文涉及AI在科学（认知科学）中的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。其他关键词主要涉及大模型的技术原理、训练方法、优化技术或特定应用领域，而本文专注于评估平台开发和方法论，未涉及这些具体技术，因此评分为0分。

!!! tip deepseek-chat TL;DR

本研究开发了PsyCogMetrics AI Lab平台，通过整合心理测量学和认知科学方法，解决了当前大语言模型评估方法的局限性，为AI、心理学和认知科学的交叉研究提供了新的评估工具和验证设计。

摘要翻译

本研究介绍了PsyCogMetrics AI实验室平台（psycogmetrics.ai）的开发，这是一个集成的云端平台，旨在将心理测量学与认知科学方法应用于大语言模型（LLM）评估。研究采用三循环行动设计科学框架：在关联性循环中，识别了当前评估方法的关键局限性与未满足的利益相关者需求；严谨性循环借鉴了波普尔可证伪性、经典测验理论和认知负荷理论等核心理论，推导出演绎性设计目标；设计循环则通过嵌套的“构建-干预-评估”迭代过程，将这些目标具体实施。本研究贡献了一个新颖的信息技术制品——一套经过验证的大语言模型评估设计方案，有助于推动人工智能、心理学、认知科学以及社会与行为科学交叉领域的研究。

摘要 (Abstract)

This study presents the development of the PsyCogMetrics AI Lab (psycogmetrics.ai), an integrated, cloud-based platform that operationalizes psychometric and cognitive-science methodologies for Large Language Model (LLM) evaluation. Framed as a three-cycle Action Design Science study, the Relevance Cycle identifies key limitations in current evaluation methods and unfulfilled stakeholder needs. The Rigor Cycle draws on kernel theories such as Popperian falsifiability, Classical Test Theory, and Cognitive Load Theory to derive deductive design objectives. The Design Cycle operationalizes these objectives through nested Build-Intervene-Evaluate loops. The study contributes a novel IT artifact, a validated design for LLM evaluation, benefiting research at the intersection of AI, psychology, cognitive science, and the social and behavioral sciences.

关键词: Large Language Model evaluation, PsyCogMetrics AI Lab, cognitive science, psychometric methodologies, Action Design Science, LLM evaluation platform, AI and psychology intersection, cloud-based platform

33. ❌ Geometry-Guided Camera Motion Understanding in VideoLLMs

作者: Haoan Feng, Sri Harsha Musunuri, Guan-Ming Su 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13119v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究视频语言模型（VideoLLMs）中的相机运动理解，属于大模型（VideoLLMs）在视觉领域的应用，与’Large Language Models OR LLMs OR Foundation Models’有一定相关性（8分），因为VideoLLMs是视觉-语言基础模型的一种。论文涉及几何信号提取、数据集构建和轻量级注入方法，但未深入探讨其他关键词如MoE、SLMs、训练技术、推理优化、代理系统等具体技术，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对当前视频语言模型（VideoLLMs）在理解相机运动方面的不足，通过构建数据集、诊断问题并提出一种轻量级几何线索注入方法，显著提升了模型对相机运动的识别能力和响应质量。

摘要翻译

相机运动是塑造视觉感知与电影风格的基础几何信号，然而当前具备视频处理能力的视觉语言模型（VideoLLMs）很少显式表征相机运动，且常在细粒度运动基元上表现不佳。我们通过一个包含基准测试、诊断与注入的框架来弥补这一不足。我们构建了CameraMotionDataset——一个具有显式相机控制的大规模合成数据集，将相机运动建模为约束感知的多标签识别任务，并创建了一个视觉问答基准——CameraMotionVQA。在对多种现有VideoLLMs的测试中，我们观察到其在识别相机运动基元时存在显著错误。对Qwen2.5-VL视觉编码器的探测实验表明，相机运动线索的表征较弱，尤其在更深的ViT模块中，这有助于解释观察到的失败模式。为了在不进行昂贵训练或微调的情况下弥补这一缺陷，我们提出了一种轻量级、模型无关的流程：从三维基础模型（3DFMs）中提取几何相机线索，通过时序分类器预测受约束的运动基元，并通过结构化提示将其注入下游VideoLLM的推理过程。实验证明，该方法提升了运动识别能力，并生成了更具相机感知的模型响应，凸显了几何驱动的线索提取与结构化提示作为实现相机感知VideoLLM和视觉语言智能系统（VLA）的实用步骤。数据集与基准已公开于https://hf.co/datasets/fengyee/camera-motion-dataset-and-benchmark。

摘要 (Abstract)

Camera motion is a fundamental geometric signal that shapes visual perception and cinematic style, yet current video-capable vision-language models (VideoLLMs) rarely represent it explicitly and often fail on fine-grained motion primitives. We address this gap with a framework of $\textbf{benchmarking}$, $\textbf{diagnosis}$, and $\textbf{injection}$. We curate $\textbf{CameraMotionDataset}$, a large-scale synthetic dataset with explicit camera control, formulate camera motion as constraint-aware multi-label recognition, and construct a VQA benchmark–$\textbf{CameraMotionVQA}$. Across diverse off-the-shelf VideoLLMs, we observe substantial errors in recognizing camera motion primitives. Probing experiments on a Qwen2.5-VL vision encoder suggest that camera motion cues are weakly represented, especially in deeper ViT blocks, helping explain the observed failure modes. To bridge this gap without costly training or fine-tuning, we propose a lightweight, model-agnostic pipeline that extracts geometric camera cues from 3D foundation models (3DFMs), predicts constrained motion primitives with a temporal classifier, and injects them into downstream VideoLLM inference via structured prompting. Experiments demonstrate improved motion recognition and more camera-aware model responses, highlighting geometry-driven cue extraction and structured prompting as practical steps toward a camera-aware VideoLLM and VLA system. The dataset and benchmark is publicly available at https://hf.co/datasets/fengyee/camera-motion-dataset-and-benchmark.

关键词: VideoLLMs, camera motion, geometric cues, 3D foundation models, structured prompting, benchmarking, temporal classifier, vision-language models

34. ❌ BoSS: A Best-of-Strategies Selector as an Oracle for Deep Active Learning

作者: Denis Huseljic, Paul Hahn, Marek Herde, Christoph Sandrock, Bernhard Sick 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13109v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究主动学习（Active Learning）中的样本选择策略，提出了一种名为BoSS的可扩展Oracle策略，用于大规模数据集和深度神经网络场景。虽然摘要中提到’foundation models’，但这是作为背景提及（这些模型使识别有价值实例更容易），并非论文的核心研究内容。论文的核心是主动学习策略的评估和改进，特别是Oracle策略的设计和集成方法。所有评分关键词都直接针对大模型/深度学习的技术原理、训练方法、推理优化、对齐、应用等具体方面，而本文专注于传统的主动学习框架和策略评估，没有涉及任何评分关键词中的具体技术。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对现有主动学习策略在大型数据集和深度神经网络上缺乏鲁棒性的问题，提出了一种可扩展的Oracle策略BoSS，通过集成多种选择策略并选择性能增益最高的批次，在大规模多类别数据集上超越了现有Oracle策略，并表明当前最先进的主动学习策略与Oracle性能仍有明显差距。

摘要翻译

主动学习（Active Learning, AL）旨在通过迭代选择有价值的样本，在最大化模型性能的同时降低标注成本。尽管基础模型使得识别这些样本变得更加容易，但现有的选择策略在不同模型、标注预算和数据集之间仍缺乏鲁棒性。为了揭示现有AL策略的潜在缺陷并为研究提供参考基准，我们探索了预言家策略（oracle strategies），即通过访问实际AL场景中无法获得的真实标注信息来逼近最优选择的策略。然而，当前的预言家策略难以有效扩展到大规模数据集和复杂的深度神经网络。为应对这些局限性，我们提出了最佳策略选择器（Best-of-Strategy Selector, BoSS），这是一种专为大规模AL场景设计的可扩展预言家策略。BoSS通过集成多种选择策略构建候选批次集合，随后选择能带来最高性能增益的批次。作为一种策略集成方法，BoSS能够轻松纳入未来出现的新前沿策略，从而确保其持续作为可靠的预言家策略。我们的评估表明：i) BoSS优于现有预言家策略；ii) 当前前沿的AL策略仍明显落后于预言家性能，尤其在具有多类别的大规模数据集中；iii) 应对AL策略性能不稳定的一个可行解决方案可能是采用基于集成的方法进行样本选择。

摘要 (Abstract)

Active learning (AL) aims to reduce annotation costs while maximizing model performance by iteratively selecting valuable instances. While foundation models have made it easier to identify these instances, existing selection strategies still lack robustness across different models, annotation budgets, and datasets. To highlight the potential weaknesses of existing AL strategies and provide a reference point for research, we explore oracle strategies, i.e., strategies that approximate the optimal selection by accessing ground-truth information unavailable in practical AL scenarios. Current oracle strategies, however, fail to scale effectively to large datasets and complex deep neural networks. To tackle these limitations, we introduce the Best-of-Strategy Selector (BoSS), a scalable oracle strategy designed for large-scale AL scenarios. BoSS constructs a set of candidate batches through an ensemble of selection strategies and then selects the batch yielding the highest performance gain. As an ensemble of selection strategies, BoSS can be easily extended with new state-of-the-art strategies as they emerge, ensuring it remains a reliable oracle strategy in the future. Our evaluation demonstrates that i) BoSS outperforms existing oracle strategies, ii) state-of-the-art AL strategies still fall noticeably short of oracle performance, especially in large-scale datasets with many classes, and iii) one possible solution to counteract the inconsistent performance of AL strategies might be to employ an ensemble-based approach for the selection.

关键词: Active Learning, Oracle Strategy, Ensemble Selection, Deep Neural Networks, Large-scale Datasets, Annotation Cost Reduction, Model Performance, Selection Strategies

35. ❌ Evaluating VLMs’ Spatial Reasoning Over Robot Motion: A Step Towards Robot Planning with Motion Preferences

作者: Wenxi Wu, Jingjing Zhang, Martim Brandão 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13100v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文评估了四种最先进的视觉语言模型（VLMs）在机器人运动空间推理方面的能力，属于大模型在机器人领域的应用研究。与关键词的相关性分析如下：1）‘Large Language Models OR LLMs OR Foundation Models’（8分）：论文明确研究视觉语言模型（VLMs），这是基础模型的一种，属于大模型范畴，但论文聚焦于视觉-语言任务而非纯语言模型。2）‘Post-training OR Supervised Fine-tuning OR SFT’（5分）：论文提到在较小模型上进行了微调（fine-tuning），这属于后训练技术，但并非核心内容。3）‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（8分）：论文研究机器人运动规划，涉及智能体系统，与LLM智能体概念相关。其他关键词如MoE、Scaling Laws、RLHF、RAG等均未在论文中涉及，因此评分为0。论文未涉及生物信息学等科学AI应用。

!!! tip deepseek-chat TL;DR

该论文评估了四种最先进的视觉语言模型在机器人运动空间推理方面的能力，发现Qwen2.5-VL在零样本情况下达到71.4%的准确率，微调后达到75%，而GPT-4o表现较差，展示了VLM与机器人运动规划集成的潜力。

摘要翻译

理解用户指令与周围环境中物体的空间关系，对于智能机器人系统协助人类完成多样化任务至关重要。视觉语言模型（Vision-Language Models, VLMs）的自然语言理解与空间推理能力，有望提升机器人规划器在新任务、新物体及新运动规范上的泛化性能。尽管基础模型已被应用于任务规划，但其是否具备足够的空间推理能力来满足用户对运动的偏好或约束——例如与物体的期望距离、拓扑属性或运动风格偏好——仍不明确。本文通过四种不同的查询方法，评估了四种前沿视觉语言模型在机器人运动空间推理方面的能力。实验结果表明，在性能最佳的查询方法下，Qwen2.5-VL 实现了 71.4% 的零样本准确率，经微调后的小型模型准确率达 75%，而 GPT-4o 的表现则相对较低。我们评估了两种运动偏好类型（物体接近度与路径风格），并分析了准确率与计算成本（以 token 数量衡量）之间的权衡。这项工作揭示了视觉语言模型与机器人运动规划流程结合的潜在前景。

摘要 (Abstract)

Understanding user instructions and object spatial relations in surrounding environments is crucial for intelligent robot systems to assist humans in various tasks. The natural language and spatial reasoning capabilities of Vision-Language Models (VLMs) have the potential to enhance the generalization of robot planners on new tasks, objects, and motion specifications. While foundation models have been applied to task planning, it is still unclear the degree to which they have the capability of spatial reasoning required to enforce user preferences or constraints on motion, such as desired distances from objects, topological properties, or motion style preferences. In this paper, we evaluate the capability of four state-of-the-art VLMs at spatial reasoning over robot motion, using four different querying methods. Our results show that, with the highest-performing querying method, Qwen2.5-VL achieves 71.4% accuracy zero-shot and 75% on a smaller model after fine-tuning, and GPT-4o leads to lower performance. We evaluate two types of motion preferences (object-proximity and path-style), and we also analyze the trade-off between accuracy and computation cost in number of tokens. This work shows some promise in the potential of VLM integration with robot motion planning pipelines.

关键词: Vision-Language Models, Spatial Reasoning, Robot Motion Planning, Motion Preferences, Fine-tuning, Zero-shot Evaluation, Qwen2.5-VL, GPT-4o

36. ❌ Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation

作者: Wayner Barrios, SouYoung Jin 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13099v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是评估多模态大语言模型（MLLMs）的透明推理能力，与’Large Language Models’高度相关（10分）。论文重点研究推理过程的可验证中间步骤，与’Chain of Thought’和’System 2 Thinking’高度相关（各10分）。其他关键词如MoE、量化、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了CRYSTAL基准来评估多模态大语言模型的透明推理能力，通过诊断性指标揭示了模型在推理步骤顺序和完整性方面的系统性缺陷，并提出了CPR奖励和课程学习方法来改善推理性能。

摘要翻译

我们推出CRYSTAL（基于可验证步骤、可追溯性与逻辑的清晰推理），这是一个包含6,372个样本的诊断性基准，通过可验证的中间步骤评估多模态推理能力。我们提出了两个互补的指标：匹配F1（通过语义相似度匹配对步骤级精确率与召回率进行评分）和有序匹配F1（进一步对无序推理链施加惩罚）。参考标准的构建采用德尔菲启发式流程：四个独立的多模态大语言模型生成推理轨迹，通过语义聚类进行聚合，并经由人工质量关卡验证。对20个多模态大语言模型（包括基准构建时未使用的商业前沿系统）的评估揭示了准确率指标无法发现的系统性缺陷：普遍存在的选择性优化（精确率远高于召回率）、非单调的规模扩展权衡，以及无序推理问题——当前所有竞争模型在保持正确顺序的匹配步骤比例上均未超过60%。除评估外，我们提出因果过程奖励（CPR），这是一种将答案正确性与步骤级对齐相耦合的乘积式奖励机制，以及CPR课程学习（CPR-Curriculum），在训练过程中逐步提升推理难度。通过GRPO框架，CPR课程学习实现了匹配F1指标+32%的提升（而加性奖励策略在此失效），在无需人工步骤标注的情况下显著改善了推理能力。

摘要 (Abstract)

We introduce CRYSTAL (__C__lear __R__easoning via __Y__ielded __S__teps, __T__raceability and __L__ogic), a diagnostic benchmark with 6,372 instances that evaluates multimodal reasoning through verifiable intermediate steps. We propose two complementary metrics: Match F1, which scores step-level precision and recall via semantic similarity matching, and Ordered Match F1, which further penalizes disordered reasoning chains. References are constructed through a Delphi-inspired pipeline where four independent MLLMs generate trajectories, aggregated via semantic clustering and validated through human quality gates. Evaluation of 20 MLLMs, including commercial frontier systems not used during benchmark construction, reveals systematic failures invisible to accuracy: universal cherry-picking (precision far exceeds recall), non-monotonic scaling trade-offs, and disordered reasoning where no competitive model preserves more than 60% of matched steps in correct order. Beyond evaluation, we propose the Causal Process Reward (CPR), a multiplicative reward that couples answer correctness with step-level alignment, and CPR-Curriculum, which progressively increases reasoning difficulty during training. CPR-Curriculum achieves +32% Match F1 via GRPO where additive reward strategies fail, improving reasoning without manual step annotation.

关键词: multimodal reasoning, benchmark evaluation, intermediate steps, transparent reasoning, reasoning chains, MLLMs, step-level alignment, diagnostic metrics

37. ❌ Human-in-the-Loop LLM Grading for Handwritten Mathematics Assessments

作者: Arne Vanhoyweghen, Vincent Holst, Melika Mobini, Lukas Van de Voorde, Tibo Vanleke, Bert Verbruggen, Brecht Verbeken, Andres Algaba, Sam Verboven, Marie-Anne Guerry, Filip Van Droogenbroeck, Vincent Ginis 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13083v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是LLM在教育评估领域的应用创新（人机协同评分系统），与"Large Language Models"高度相关（10分），属于"AI for Science"中教育科学的应用（5分），但未涉及其他具体技术原理或方法创新，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种人机协同的LLM辅助手写数学评估评分工作流，实证表明该系统能减少约23%的评分时间，同时保持与人工评分相当的公平性和准确性。

摘要翻译

为学生手写作业提供及时且个性化的反馈对学习极为有益，但难以大规模实现。随着生成式人工智能削弱了课后评估的可信度，教学重点正转向受监督的课堂评估，这一挑战变得更为紧迫。本文提出一种可扩展的端到端工作流程，用于对简短纸笔评估进行大语言模型辅助评分。该流程涵盖（1）构建参考答案，（2）制定详细的分级制评分标准以指导大语言模型，以及（3）包含自动扫描与匿名化处理、多轮大语言模型评分、自动一致性检查及强制性人工核验的评分程序。我们在两门本科数学课程中部署了该系统，并应用于六次低风险课堂测试。实证表明，大语言模型辅助使评分时间减少约23%，同时达到与全人工评分相当甚至在某些情况下更高的一致性评分结果。尽管偶发模型错误，但混合设计能有效控制其影响。总体而言，我们的研究结果表明，精心设计的人机协同大语言模型评分系统能在保障公平性与准确性的前提下，显著减轻教师工作量。

摘要 (Abstract)

Providing timely and individualised feedback on handwritten student work is highly beneficial for learning but difficult to achieve at scale. This challenge has become more pressing as generative AI undermines the reliability of take-home assessments, shifting emphasis toward supervised, in-class evaluation. We present a scalable, end-to-end workflow for LLM-assisted grading of short, pen-and-paper assessments. The workflow spans (1) constructing solution keys, (2) developing detailed rubric-style grading keys used to guide the LLM, and (3) a grading procedure that combines automated scanning and anonymisation, multi-pass LLM scoring, automated consistency checks, and mandatory human verification. We deploy the system in two undergraduate mathematics courses using six low-stakes in-class tests. Empirically, LLM assistance reduces grading time by approximately 23% while achieving agreement comparable to, and in several cases tighter than, fully manual grading. Occasional model errors occur but are effectively contained by the hybrid design. Overall, our results show that carefully embedded human-in-the-loop LLM grading can substantially reduce workload while maintaining fairness and accuracy.

关键词: LLM-assisted grading, handwritten assessments, human-in-the-loop, mathematics education, scoring workflow, automated consistency checks, grading efficiency, fairness and accuracy

38. ❌ GeoChemAD: Benchmarking Unsupervised Geochemical Anomaly Detection for Mineral Exploration

作者: Yihao Ding, Yiran Zhang, Chris Gonzalez, Eun-Jung Holden, Wei Liu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13068v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于地球化学异常检测，属于AI在科学领域的应用（地质学/矿物勘探）。它提出了一个基于Transformer的框架（GeoChemFormer），但这是用于特定领域（地球化学）的Transformer模型，并非通用大语言模型（LLM）。论文的核心是地质科学中的无监督异常检测，与绝大多数关键词（涉及LLM技术原理、训练方法、推理优化、对齐、代理等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在地球科学（可视为广义的Science）中的应用，但并非生物信息学或化学信息学，因此给予中等相关度5分。

!!! tip deepseek-chat TL;DR

该论文针对矿物勘探中现有地球化学异常检测方法泛化性差和可复现性低的问题，提出了一个开源基准数据集GeoChemAD和一个基于Transformer的自监督预训练框架GeoChemFormer，后者在所有测试子集上均表现出优越且稳健的异常检测性能和泛化能力。

摘要翻译

地球化学异常检测在矿产勘查中具有关键作用，因为区域地球化学基线的偏离可能指示矿化现象。现有研究存在两个主要局限：（1）单一区域场景限制了模型的泛化能力；（2）使用专有数据集导致结果难以复现。本研究提出GeoChemAD——一个基于政府主导地质调查资料构建的开源基准数据集，涵盖多区域、多采样源及多目标元素。该数据集包含八个子集，体现了不同空间尺度与采样条件的多样性。为建立强基准模型，我们复现并系统评估了一系列无监督异常检测方法，包括统计模型、生成式方法以及基于Transformer的模型。此外，我们提出GeoChemFormer，这是一种基于Transformer的框架，通过自监督预训练学习空间样本中具有目标元素感知能力的地球化学表征。大量实验表明，GeoChemFormer在所有八个子集中均表现出优异且稳健的性能，在异常检测精度与泛化能力上均超越现有无监督方法。本研究所提出的数据集与框架为该方向的可复现研究及未来发展奠定了基础。

摘要 (Abstract)

Geochemical anomaly detection plays a critical role in mineral exploration as deviations from regional geochemical baselines may indicate mineralization. Existing studies suffer from two key limitations: (1) single region scenarios which limit model generalizability; (2) proprietary datasets, which makes result reproduction unattainable. In this work, we introduce \textbf{GeoChemAD}, an open-source benchmark dataset compiled from government-led geological surveys, covering multiple regions, sampling sources, and target elements. The dataset comprises eight subsets representing diverse spatial scales and sampling conditions. To establish strong baselines, we reproduce and benchmark a range of unsupervised anomaly detection methods, including statistical models, generative and transformer-based approaches. Furthermore, we propose \textbf{GeoChemFormer}, a transformer-based framework that leverages self-supervised pretraining to learn target-element-aware geochemical representations for spatial samples. Extensive experiments demonstrate that GeoChemFormer consistently achieves superior and robust performance across all eight subsets, outperforming existing unsupervised methods in both anomaly detection accuracy and generalization capability. The proposed dataset and framework provide a foundation for reproducible research and future development in this direction.

关键词: Geochemical anomaly detection, Mineral exploration, Unsupervised learning, Transformer, Self-supervised pretraining, Benchmark dataset, Generalization, GeoChemFormer

39. ❌ L2GTX: From Local to Global Time Series Explanations

作者: Ephrem Tibebe Mekonnen, Luca Longo, Lucas Rizzo, Pierpaolo Dondio 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13065v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文L2GTX专注于时间序列分类的可解释人工智能（XAI），提出了一种模型无关的全局解释框架。该研究与绝大多数关键词（如LLM、MoE、SFT、RAG、量化等）完全无关，因为这些关键词涉及大模型技术原理、训练方法、推理优化或特定应用领域，而本文核心是时间序列的模型解释方法。唯一相关的关键词是’Mechanistic Interpretability OR Explainable AI’，评分为10分，因为论文的核心贡献正是开发一种新的可解释AI框架（L2GTX）来生成时间序列模型的全局解释，这直接属于Explainable AI范畴。论文未涉及大模型或深度学习在科学领域的应用创新，也未涉及大模型技术原理的创新，因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文针对时间序列分类模型缺乏有效全局解释的问题，提出了一个名为L2GTX的模型无关框架，通过聚合局部解释来生成简洁、可解释的类别级全局解释，并在多个基准数据集上验证了其有效性和忠实性。

摘要翻译

深度学习模型在时间序列分类中实现了高精度，但理解其类别层面的决策行为仍具挑战性。针对时间序列的解释必须尊重时间依赖性，并识别在多个实例中重复出现的模式。现有方法面临三个局限：为图像和表格数据开发的模型无关可解释人工智能方法难以直接扩展到时间序列，时间序列的全局解释合成研究仍显不足，且大多数现有全局方法是模型特定的。我们提出L2GTX，一个模型无关的框架，通过聚合来自代表性实例集的局部解释来生成按类别划分的全局解释。L2GTX从LOMATCE生成的实例级解释中提取参数化时间事件基元（如上升或下降趋势及局部极值点）的聚类及其重要性分数。这些聚类在实例间进行合并以减少冗余，并利用一个实例-聚类重要性矩阵来估计全局相关性。在用户定义的实例选择预算下，L2GTX选择能最大化覆盖重要聚类的代表性实例。随后，将选定实例中的事件聚合为简洁的按类别划分的全局解释。在六个基准时间序列数据集上的实验表明，L2GTX能生成紧凑且可解释的全局解释，同时保持以平均局部代理保真度衡量的稳定全局忠实度。

摘要 (Abstract)

Deep learning models achieve high accuracy in time series classification, yet understanding their class-level decision behaviour remains challenging. Explanations for time series must respect temporal dependencies and identify patterns that recur across instances. Existing approaches face three limitations: model-agnostic XAI methods developed for images and tabular data do not readily extend to time series, global explanation synthesis for time series remains underexplored, and most existing global approaches are model-specific. We propose L2GTX, a model-agnostic framework that generates class-wise global explanations by aggregating local explanations from a representative set of instances. L2GTX extracts clusters of parameterised temporal event primitives, such as increasing or decreasing trends and local extrema, together with their importance scores from instance-level explanations produced by LOMATCE. These clusters are merged across instances to reduce redundancy, and an instance-cluster importance matrix is used to estimate global relevance. Under a user-defined instance selection budget, L2GTX selects representative instances that maximise coverage of influential clusters. Events from the selected instances are then aggregated into concise class-wise global explanations. Experiments on six benchmark time series datasets show that L2GTX produces compact and interpretable global explanations while maintaining stable global faithfulness measured as mean local surrogate fidelity.

关键词: time series classification, explainable AI, global explanations, model-agnostic, temporal event primitives, L2GTX, interpretability, faithfulness

40. ❌ Competition-Aware CPC Forecasting with Near-Market Coverage

作者: Sebastian Frey, Edoardo Beccari, Maximilian Kranz, Nicolò Alberto Pellizzari, Ali Mete Karaman, Qiwei Han, Maximilian Kaiser 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13059v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是在线广告拍卖中的CPC预测问题，使用传统机器学习方法（如预训练transformer表示、动态时间规整、图神经网络）和统计模型，核心是市场结构分析和预测建模，不涉及大语言模型、深度学习技术原理创新或AI for Science等关键词领域。

!!! tip deepseek-chat TL;DR

该研究通过构建语义、行为和地理特征来近似潜在竞争，改进了拍卖驱动市场中成本每点击（CPC）的预测稳定性和准确性。

摘要翻译

付费搜索中的每次点击成本（CPC）是一种波动的拍卖结果，其产生于竞争环境，而该环境仅能从单一广告主的历史数据中部分观测。利用来自集中型汽车租赁市场（2021-2023年）的谷歌广告（Google Ads）拍卖日志，我们预测了1,811个关键词系列的周度CPC，并通过从关键词文本、CPC轨迹和地理市场结构中提取的互补信号来近似估算潜在竞争。我们构建了：（i）基于预训练Transformer模型的关键词文本表征生成的语义邻域和语义关键词图，（ii）通过CPC轨迹的动态时间规整（Dynamic Time Warping, DTW）对齐形成的行为邻域，以及（iii）捕捉本地化需求和市场异质性的地理意图协变量。我们广泛评估了这些信号，既将其作为独立协变量，也作为时空图预测模型中的关系先验，并以强大的统计模型、神经网络模型和时间序列基础模型作为基准进行对比。在所有方法中，融入竞争感知的增强策略在业务相关的中长期预测范围内提升了稳定性和误差表现，而这些时段正是竞争态势转变和波动性影响最为显著的阶段。结果表明，广泛的市场结果覆盖，结合关键词衍生的语义与地理先验，为近似估算潜在竞争并改进拍卖驱动市场中的CPC预测提供了一种可扩展的途径。

摘要 (Abstract)

Cost-per-click (CPC) in paid search is a volatile auction outcome generated by a competitive landscape that is only partially observable from any single advertiser’s history. Using Google Ads auction logs from a concentrated car-rental market (2021–2023), we forecast weekly CPC for 1,811 keyword series and approximate latent competition through complementary signals derived from keyword text, CPC trajectories, and geographic market structure. We construct (i) semantic neighborhoods and a semantic keyword graph from pretrained transformer-based representations of keyword text, (ii) behavioral neighborhoods via Dynamic Time Warping (DTW) alignment of CPC trajectories, and (iii) geographic-intent covariates capturing localized demand and marketplace heterogeneity. We extensively evaluate these signals both as stand-alone covariates and as relational priors in spatiotemporal graph forecasters, benchmarking them against strong statistical, neural, and time-series foundation-model baselines. Across methods, competition-aware augmentation improves stability and error profiles at business-relevant medium and longer horizons, where competitive regimes shift and volatility is most consequential. The results show that broad market-outcome coverage, combined with keyword-derived semantic and geographic priors, provides a scalable way to approximate latent competition and improve CPC forecasting in auction-driven markets.

关键词: CPC forecasting, auction-driven markets, semantic neighborhoods, Dynamic Time Warping, spatiotemporal graph forecasters, latent competition, keyword text representation, market coverage

41. ❌ Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach

作者: Elena Ryumina, Maxim Markitantov, Alexandr Axyonov, Dmitry Ryumin, Mikhail Dolgushin, Denis Dresvyanskiy, Alexey Karpov 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13056v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文主要研究多模态情感识别（效价和唤醒度估计），属于AI在行为分析领域的应用。与关键词的相关性分析：1）使用Qwen3-VL-4B-Instruct（视觉语言模型）提取行为信息，因此与’Large Language Models’有一定关联（5分）；2）提出’Directed Cross-Modal Mixture-of-Experts Fusion Strategy’，直接涉及MoE技术（8分）；3）属于情感计算和行为分析，是’AI for Science’在心理学/行为科学领域的应用（8分）。其他关键词主要涉及大模型技术原理、训练方法、推理优化等，与本论文的计算机视觉、音频处理和情感识别应用无直接关系，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合面部、行为和音频三种模态的多模态方法，用于在自然场景下连续估计情感效价和唤醒度，并在Aff-Wild2数据集上达到了0.658的CCC分数。

摘要翻译

在野外条件下进行效价与唤醒度的连续情绪识别，由于外观、头部姿态、光照、遮挡以及个体情感表达模式存在巨大差异，仍然是一个具有挑战性的问题。我们提出了一种用于野外效价-唤醒度估计的多模态方法。该方法融合了三种互补的模态：面部、行为和音频。面部模态依赖于基于GRADA的帧级嵌入和基于Transformer的时序回归。我们使用Qwen3-VL-4B-Instruct从视频片段中提取行为相关信息，同时使用Mamba模型来建模片段间的时序动态。音频模态则基于WavLM-Large模型并结合注意力统计池化，并包含一个跨模态过滤阶段以减少不可靠或非语音片段的影响。为了融合多模态信息，我们探索了两种融合策略：一种是定向跨模态专家混合融合策略，它通过自适应权重学习模态间的交互；另一种是可靠性感知的视听融合策略，它在帧级融合视觉特征，同时将音频作为补充上下文。实验结果按照第十届野外情感行为分析挑战赛的协议，在Aff-Wild2数据集上报告。实验表明，所提出的多模态融合策略在Aff-Wild2开发集上达到了0.658的和谐相关系数。

摘要 (Abstract)

Continuous emotion recognition in terms of valence and arousal under in-the-wild (ITW) conditions remains a challenging problem due to large variations in appearance, head pose, illumination, occlusions, and subject-specific patterns of affective expression. We present a multimodal method for valence-arousal estimation ITW. Our method combines three complementary modalities: face, behavior, and audio. The face modality relies on GRADA-based frame-level embeddings and Transformer-based temporal regression. We use Qwen3-VL-4B-Instruct to extract behavior-relevant information from video segments, while Mamba is used to model temporal dynamics across segments. The audio modality relies on WavLM-Large with attention-statistics pooling and includes a cross-modal filtering stage to reduce the influence of unreliable or non-speech segments. To fuse modalities, we explore two fusion strategies: a Directed Cross-Modal Mixture-of-Experts Fusion Strategy that learns interactions between modalities with adaptive weighting, and a Reliability-Aware Audio-Visual Fusion Strategy that combines visual features at the frame-level while using audio as complementary context. The results are reported on the Aff-Wild2 dataset following the 10th Affective Behavior Analysis in-the-Wild (ABAW) challenge protocol. Experiments demonstrate that the proposed multimodal fusion strategy achieves a Concordance Correlation Coefficient (CCC) of 0.658 on the Aff-Wild2 development set.

关键词: multimodal emotion recognition, valence-arousal estimation, in-the-wild conditions, Mixture-of-Experts fusion, Aff-Wild2 dataset, Qwen3-VL-4B-Instruct, Mamba temporal modeling, Concordance Correlation Coefficient

42. ❌ Are General-Purpose Vision Models All We Need for 2D Medical Image Segmentation? A Cross-Dataset Empirical Study

作者: Vanessa Borst, Samuel Kounev 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13044v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究通用视觉模型在医学图像分割中的应用，与大多数大模型技术关键词（如LLM、MoE、RLHF等）完全无关。唯一相关的是’AI for Science OR Bioinformatics OR Cheminformatics’（10分），因为论文明确属于AI在生物医学领域的应用。‘Mechanistic Interpretability OR Explainable AI’（5分）有弱关联，因为论文提到了Grad-CAM可视化分析，但这不是核心内容。其他关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该研究通过跨数据集实证分析，发现通用视觉模型在2D医学图像分割任务中优于大多数专用医学分割模型，并展示了其可解释性优势。

摘要翻译

医学图像分割是计算机辅助诊断与临床决策支持系统的基础组成部分。过去十年中，涌现出许多针对医学影像特点专门设计的架构，以应对低对比度、微小解剖结构和标注数据有限等特定领域挑战。与此同时，计算机视觉领域的快速发展催生了原本为自然图像设计的高性能通用视觉模型。尽管这些模型在标准视觉基准测试中表现优异，但其在医学图像分割任务中的有效性尚未得到充分理解。本研究通过受控实证实验，系统探究针对二维医学图像分割任务，专用医学分割架构是否相较于现代通用视觉模型具有系统性优势。我们采用统一的训练与评估协议，对十一种专用医学分割架构和通用视觉模型进行比较。实验在三个异构数据集上进行，涵盖不同成像模态、类别结构和数据特征。除分割精度外，我们还通过Grad-CAM可视化定性分析，探究模型的可解释性表现。实验结果表明，在所分析的数据集上，通用视觉模型的表现优于大多数专用医学分割模型。此外，可解释性分析显示，通用视觉模型无需显式的领域专用架构设计即可捕捉临床相关结构。这些发现表明通用视觉模型可作为领域专用方法的可行替代方案，同时凸显了端到端医学图像分割系统中模型选择策略的重要性。所有代码与资源均已发布于GitHub平台。

摘要 (Abstract)

Medical image segmentation (MIS) is a fundamental component of computer-assisted diagnosis and clinical decision support systems. Over the past decade, numerous architectures specifically tailored to medical imaging have emerged to address domain-specific challenges such as low contrast, small anatomical structures, and limited annotated data. In parallel, rapid progress in computer vision has produced highly capable general-purpose vision models (GP-VMs) originally designed for natural images. Despite their strong performance on standard vision benchmarks, their effectiveness for MIS remains insufficiently understood. In this work, we conduct a controlled empirical study to examine whether specialized medical segmentation architectures (SMAs) provide systematic advantages over modern GP-VMs for 2D MIS. We compare eleven SMAs and GP-VMs using a unified training and evaluation protocol. Experiments are performed across three heterogeneous datasets covering different imaging modalities, class structures, and data characteristics. Beyond segmentation accuracy, we analyze qualitative Grad-CAM visualizations to investigate explainability (XAI) behavior. Our results demonstrate that, for the analyzed datasets, GP-VMs out-perform the majority of specialized MIS models. Moreover, XAI analyses indicate that GP-VMs can capture clinically relevant structures without explicit domain-specific architectural design. These findings suggest that GP-VMs can represent a viable alternative to domain-specific methods, highlighting the importance of informed model selection for end-to-end MIS systems. All code and resources are available at GitHub.

关键词: medical image segmentation, general-purpose vision models, cross-dataset study, Grad-CAM, explainable AI, computer-assisted diagnosis, model comparison, clinical relevance

43. ❌ Interrogating Design Homogenization in Web Vibe Coding

作者: Donghoon Shin, Alice Gao, Rock Yuren Pang, Jaewook Lee, Katharina Reinecke, Emily Tseng 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13036v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究LLMs在网页设计（vibe coding）中的应用及其导致的同质化风险，仅与关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文核心关注LLMs在特定应用场景中的影响。其他关键词涉及具体技术原理（如MoE、量化、推理加速等）或特定应用领域（如生物信息学），论文未涉及，故均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLMs在网页vibe coding中可能导致设计同质化的风险，通过生命周期分析和社会技术风险分析揭示了无摩擦生成加剧同质化的问题，并提出了以生产性摩擦为中心的缓解框架来保护设计多样性。

摘要翻译

生成式人工智能以其同质化倾向而著称，常会复现训练数据中的主流风格惯例。然而，这些同质化效应如何延伸至网页设计这类复杂结构性任务，目前尚不明确。随着非专业创作者日益转向使用大语言模型进行“氛围编码”——即通过提示词描述美学与功能目标而非直接编写代码来生成网站——他们可能在无意中缩小了设计的多样性，并限制了整个互联网的创造性表达。本文探讨了网页氛围编码中设计同质化的可能性。我们首先描述了氛围编码的生命周期，指出了同质化风险可能出现的阶段。随后，我们进行了一项社会技术风险分析，剖析了网页氛围编码的潜在危害及其与设计同质化的相互作用。我们发现，对“无摩擦生成”的追求可能加剧同质化及其危害。最后，我们提出了一个以“生产性摩擦”理念为核心的缓解框架。通过微观、中观和宏观层面的案例研究，我们展示了聚焦生产性摩擦如何能够赋能创作者，使其能够挑战默认输出，并在人工智能介导的网页设计中保持表达的多样性。

摘要 (Abstract)

Generative AI is known for its tendency to homogenize, often reproducing dominant style conventions found in training data. However, it remains unclear how these homogenizing effects extend to complex structural tasks like web design. As lay creators increasingly turn to LLMs to ‘vibe-code’ websites – prompting for aesthetic and functional goals rather than writing code – they may inadvertently narrow the diversity of their designs, and limit creative expression throughout the internet. In this paper, we interrogate the possibility of design homogenization in web vibe coding. We first characterize the vibe coding lifecycle, pinpointing stages where homogenization risks may arise. We then conduct a sociotechnical risk analysis unpacking the potential harms of web vibe coding and their interaction with design homogenization. We identify that the push for frictionless generation can exacerbate homogenization and its harms. Finally, we propose a mitigation framework centered on the idea of productive friction. Through case studies at the micro, meso, and macro levels, we show how centering productive friction can empower creators to challenge default outputs and preserve diverse expression in AI-mediated web design.

关键词: Generative AI, LLMs, web design, vibe coding, design homogenization, sociotechnical risk, productive friction, creative expression

44. ❌ Purify Once, Edit Freely: Breaking Image Protections under Model Mismatch

作者: Qichen Zhao, Shengfang Zhai, Xinjian Bai, Qingni Shen, Qiqi Lin, Yansong Gao, Zhonghai Wu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13028v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究扩散模型中的图像保护与净化问题，专注于对抗性扰动、图像编辑和模型不匹配场景，不涉及大语言模型、深度学习技术原理创新或科学领域应用，与所有评分关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文研究了扩散模型中图像保护方法在模型不匹配场景下的脆弱性，提出了两种无需访问保护机制的净化方法，实现了“一次净化、自由编辑”的攻击效果，揭示了现有保护方案在异构攻击者下的安全风险。

摘要翻译

扩散模型能够实现高保真度的图像编辑，但也可能被滥用于未经授权的风格模仿和有害内容生成。为降低这些风险，主动式图像保护方法在分享前向图像中嵌入微小且通常难以察觉的对抗性扰动，以干扰下游编辑或微调过程。然而，在实际发布后的场景中，内容所有者无法控制下游处理流程，且针对代理模型优化的保护措施在攻击者使用不匹配的扩散流程时可能失效。现有的净化方法虽可能削弱保护效果，但常以牺牲图像质量为代价，且很少考察架构不匹配的影响。本文引入一个统一的发布后净化框架，以评估模型不匹配情况下保护的存活性。我们提出了两种实用的净化器：VAE-Trans通过潜在空间投影校正受保护图像；EditorClean则利用扩散变换器进行指令引导重建，以发挥架构异质性优势。两者均无需访问受保护图像或防御机制内部信息。在2100项编辑任务和六种代表性保护方法中，EditorClean能持续恢复图像可编辑性。与受保护输入相比，其在后续编辑中将PSNR提升3-6 dB，FID降低50-70%，同时以约2 dB的PSNR优势和30%的FID降幅优于现有净化基线。我们的研究揭示了一种“一次净化，自由编辑”的失效模式：一旦净化成功，保护信号即被大幅消除，从而实现无限制编辑。这凸显了在模型不匹配条件下评估保护措施、并设计对异构攻击者具有鲁棒性的防御机制的必要性。

摘要 (Abstract)

Diffusion models enable high-fidelity image editing but can also be misused for unauthorized style imitation and harmful content generation. To mitigate these risks, proactive image protection methods embed small, often imperceptible adversarial perturbations into images before sharing to disrupt downstream editing or fine-tuning. However, in realistic post-release scenarios, content owners cannot control downstream processing pipelines, and protections optimized for a surrogate model may fail when attackers use mismatched diffusion pipelines. Existing purification methods can weaken protections but often sacrifice image quality and rarely examine architectural mismatch. We introduce a unified post-release purification framework to evaluate protection survivability under model mismatch. We propose two practical purifiers: VAE-Trans, which corrects protected images via latent-space projection, and EditorClean, which performs instruction-guided reconstruction with a Diffusion Transformer to exploit architectural heterogeneity. Both operate without access to protected images or defense internals. Across 2,100 editing tasks and six representative protection methods, EditorClean consistently restores editability. Compared to protected inputs, it improves PSNR by 3-6 dB and reduces FID by 50-70 percent on downstream edits, while outperforming prior purification baselines by about 2 dB PSNR and 30 percent lower FID. Our results reveal a purify-once, edit-freely failure mode: once purification succeeds, the protective signal is largely removed, enabling unrestricted editing. This highlights the need to evaluate protections under model mismatch and design defenses robust to heterogeneous attackers.

关键词: diffusion models, image protection, adversarial perturbations, model mismatch, purification framework, image editing, VAE-Trans, EditorClean

45. ❌ SortScrews: A Dataset and Baseline for Real-time Screw Classification

作者: Tianhao Fu, Bingxuan Yang, Juncheng Guo, Shrena Sribalan, Yucheng Chen 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13027v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的螺丝分类数据集创建和基准测试，使用传统卷积神经网络（EfficientNet-B0和ResNet-18）进行图像分类。论文内容完全不涉及大语言模型、深度学习技术原理创新、大模型在不同领域的应用，或任何评分关键词中的技术（如MoE、量化、对齐、RAG等）。所有关键词均与论文主题无关，因此相关度评分均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一个用于螺丝视觉分类的公开数据集SortScrews，并基于ImageNet预训练的EfficientNet-B0和ResNet-18模型建立了基准分类结果，在有限数据量下实现了较高的分类准确率。

摘要翻译

螺丝类型的自动识别对于工业自动化、机器人技术和库存管理具有重要意义。然而，目前公开可用的螺丝分类数据集十分稀缺，特别是在自动化分拣系统中常见的受控单物体场景方面。本研究引入了SortScrews数据集，用于螺丝的按类别视觉分类。该数据集包含560张分辨率为$512\times512$的RGB图像，涵盖六种螺丝类型及一个背景类别。图像采用标准化采集装置获取，并在四种采集设置下包含了光照与相机视角的轻微变化。

为促进可重复研究及数据集扩展，我们还提供了一个可复用的数据采集脚本，用户可利用低成本相机设备，轻松为自定义硬件组件构建类似数据集。

我们使用在ImageNet上预训练的EfficientNet-B0和ResNet-18分类器进行迁移学习，建立了基准结果。此外，我们还进行了深入的故障分析。尽管数据集规模有限，这些轻量级模型仍实现了较高的分类准确率，表明在受控采集条件下，即使使用相对较小的数据集也能实现有效学习。本数据集、采集流程及基准训练代码已公开于https://github.com/ATATC/SortScrews。

摘要 (Abstract)

Automatic identification of screw types is important for industrial automation, robotics, and inventory management. However, publicly available datasets for screw classification are scarce, particularly for controlled single-object scenarios commonly encountered in automated sorting systems. In this work, we introduce $\textbf{SortScrews}$, a dataset for casewise visual classification of screws. The dataset contains 560 RGB images at $512\times512$ resolution covering six screw types and a background class. Images are captured using a standardized acquisition setup and include mild variations in lighting and camera perspective across four capture settings. To facilitate reproducible research and dataset expansion, we also provide a reusable data collection script that allows users to easily construct similar datasets for custom hardware components using inexpensive camera setups. We establish baseline results using transfer learning with EfficientNet-B0 and ResNet-18 classifiers pretrained on ImageNet. In addition, we conduct a well-explored failure analysis. Despite the limited dataset size, these lightweight models achieve strong classification accuracy, demonstrating that controlled acquisition conditions enable effective learning even with relatively small datasets. The dataset, collection pipeline, and baseline training code are publicly available at https://github.com/ATATC/SortScrews.

关键词: screw classification, dataset, computer vision, transfer learning, EfficientNet, ResNet, industrial automation, baseline model

46. ❌ SAW: Toward a Surgical Action World Model via Controllable and Scalable Video Generation

作者: Sampath Rapuri, Lalithkumar Seenivasan, Dominik Schneider, Roger Soberanis-Mukul, Yufan He, Hao Ding, Jiru Xu, Chenhao Yu, Chenyan Jing, Pengfei Guo, Daguang Xu, Mathias Unberath 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13024v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文SAW专注于手术视频生成的世界模型，与’World Models AND General World Models’高度相关（10分），因为其核心是构建手术动作世界模型。与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（8分），因为它属于AI在生物医学（手术）领域的应用，但未直接涉及生物信息学或化学信息学。其他关键词均与论文内容无关（0分），因为论文未涉及大模型、深度学习技术原理、训练方法、推理优化、代理系统等主题，而是专注于计算机视觉和视频生成在手术领域的应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SAW的手术动作世界模型，通过条件视频扩散方法生成可控且真实的手术动作视频，解决了手术AI和模拟中的数据稀缺和仿真-现实差距问题，并在动作识别和手术模拟中展示了实用价值。

摘要翻译

一种能够生成逼真手术操作视频并精确控制器械-组织交互的手术世界模型，可应对外科人工智能与仿真领域的核心挑战——从数据稀缺与罕见事件合成，到弥合手术自动化中仿真与现实的鸿沟。然而，作为此类手术世界模型核心的现有视频生成方法，在推理时需要昂贵的标注或复杂的结构化中间表征作为条件信号，限制了其可扩展性。其他方法则在复杂的腹腔镜场景中表现出有限的时间一致性，且缺乏足够的真实感。我们提出 Surgical Action World (SAW)——一种通过视频扩散模型向手术动作世界建模迈进的尝试，该模型以四种轻量级信号为条件：编码器械-动作上下文的语言提示、参考手术场景、组织可供性掩码以及二维器械尖端轨迹。我们设计了一种条件视频扩散方法，将视频到视频的扩散重新构建为以轨迹为条件的手术动作合成。骨干扩散模型在一个包含12,044个腹腔镜片段的定制数据集上进行了微调，该数据集带有轻量级时空条件信号，并利用深度一致性损失来增强几何合理性，而无需在推理时使用深度信息。SAW在预留测试数据上实现了最先进的时间一致性（CD-FVD：199.19 对比 546.82）和强大的视觉质量。此外，我们展示了其在以下两个下游任务中的应用价值：（a）外科人工智能领域，使用SAW生成的视频增强罕见动作数据，可在真实测试数据上提升动作识别性能（夹闭动作F1分数：从20.93%提升至43.14%；切割动作：从0.00%提升至8.33%）；（b）手术仿真领域，从仿真器导出的轨迹点渲染器械-组织交互视频，为构建视觉逼真的仿真引擎指明了方向。

摘要 (Abstract)

A surgical world model capable of generating realistic surgical action videos with precise control over tool-tissue interactions can address fundamental challenges in surgical AI and simulation – from data scarcity and rare event synthesis to bridging the sim-to-real gap for surgical automation. However, current video generation methods, the very core of such surgical world models, require expensive annotations or complex structured intermediates as conditioning signals at inference, limiting their scalability. Other approaches exhibit limited temporal consistency across complex laparoscopic scenes and do not possess sufficient realism. We propose Surgical Action World (SAW) – a step toward surgical action world modeling through video diffusion conditioned on four lightweight signals: language prompts encoding tool-action context, a reference surgical scene, tissue affordance mask, and 2D tool-tip trajectories. We design a conditional video diffusion approach that reformulates video-to-video diffusion into trajectory-conditioned surgical action synthesis. The backbone diffusion model is fine-tuned on a custom-curated dataset of 12,044 laparoscopic clips with lightweight spatiotemporal conditioning signals, leveraging a depth consistency loss to enforce geometric plausibility without requiring depth at inference. SAW achieves state-of-the-art temporal consistency (CD-FVD: 199.19 vs. 546.82) and strong visual quality on held-out test data. Furthermore, we demonstrate its downstream utility for (a) surgical AI, where augmenting rare actions with SAW-generated videos improves action recognition (clipping F1-score: 20.93% to 43.14%; cutting: 0.00% to 8.33%) on real test data, and (b) surgical simulation, where rendering tool-tissue interaction videos from simulator-derived trajectory points toward a visually faithful simulation engine.

关键词: surgical world model, video generation, diffusion model, surgical AI, tool-tissue interaction, temporal consistency, data augmentation, surgical simulation

47. ❌ ARL-Tangram: Unleash the Resource Efficiency in Agentic Reinforcement Learning

作者: Bangjun Xiao, Yihao Zhao, Xiangwei Deng, Shihua Yu, Yuxing Xiang, Huaqiu Liu, Qiying Wang, Liang Zhao, Hailin Zhang, Xuanzhe Liu, Xin Jin, Fuli Luo 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13019v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究agentic reinforcement learning（代理强化学习）的资源管理系统ARL-Tangram，该系统旨在优化LLM在agentic RL中的外部资源使用效率。论文明确提到LLMs和agentic RL，因此与’Large Language Models OR LLMs OR Foundation Models’和’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、各种训练方法（预训练、微调、对齐等）、推理优化、可解释性、科学AI应用等，论文未涉及或未提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

论文针对agentic reinforcement learning中外部资源使用效率低下的问题，提出了ARL-Tangram资源管理系统，通过动作级编排和弹性调度，显著提升了动作完成速度、训练步长并节省了外部资源。

摘要翻译

智能体强化学习已成为云集群中的变革性工作负载，它使大语言模型能够通过与现实世界交互来解决复杂问题。然而，与传统强化学习不同，智能体强化学习需要消耗大量位于主训练集群之外的外部云资源，例如用于代码执行的CPU和用于奖励模型的GPU。现有的智能体强化学习框架通常依赖静态超量供应，即资源往往与长生命周期的轨迹绑定或被任务隔离，这导致了严重的资源低效问题。

我们提出了动作级编排机制，并将其整合到ARL-Tangram中——这是一个支持细粒度外部资源共享与弹性的统一资源管理系统。ARL-Tangram采用统一动作级建模框架和弹性调度算法，在满足异构资源约束的同时最小化动作完成时间。此外，系统定制了异构资源管理器，以高效支持在具有异构特性与拓扑结构的资源上执行动作级任务。在实际智能体强化学习任务上的评估表明，ARL-Tangram将平均动作完成时间最高提升4.3倍，将强化训练的单步时长最高加速1.5倍，并节省高达71.2%的外部资源。该系统已部署应用于支持MiMo系列模型的训练。

摘要 (Abstract)

Agentic reinforcement learning (RL) has emerged as a transformative workload in cloud clusters, enabling large language models (LLMs) to solve complex problems through interactions with real world. However, unlike traditional RL, agentic RL demands substantial external cloud resources, e.g., CPUs for code execution and GPUs for reward models, that exist outside the primary training cluster. Existing agentic RL framework typically rely on static over-provisioning, i.e., resources are often tied to long-lived trajectories or isolated by tasks, which leads to severe resource inefficiency. We propose the action-level orchestration, and incorporate it into ARL-Tangram, a unified resource management system that enables fine-grained external resource sharing and elasticity. ARL-Tangram utilizes a unified action-level formulation and an elastic scheduling algorithm to minimize action completion time (ACT) while satisfying heterogeneous resource constraints. Further, heterogeneous resource managers are tailored to efficiently support the action-level execution on resources with heterogeneous characteristics and topologies. Evaluation on real-world agentic RL tasks demonstrates that ARL-Tangram improves average ACT by up to 4.3$\times$, speeds up the step duration of RL training by up to 1.5$\times$, and saves the external resources by up to 71.2$%$. This system has been deployed to support the training of the MiMo series models.

关键词: Agentic Reinforcement Learning, Resource Management, Large Language Models, Action-level Orchestration, Resource Efficiency, External Resource Sharing, Elastic Scheduling, MiMo Series Models

48. ❌ daVinci-Env: Open SWE Environment Synthesis at Scale

作者: Dayuan Fu, Shenyu Wu, Yunze Wu, Zerui Peng, Yaxing Huang, Jie Sun, Ji Zeng, Mohan Jiang, Lin Zhang, Yukun Li, Jiarui Hu, Liming Liu, Jinlong Hou, Pengfei Liu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13023v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是构建大规模软件工程（SWE）代理训练环境，与LLM代理、工具使用和多代理系统高度相关（10分），因为论文明确涉及SWE代理训练、多代理合成管道和代理协调。与基础LLM相关（8分），因为训练了OpenSWE-32B/72B模型。与监督微调相关（8分），因为涉及SWE-focused training。与数据质量和扩展定律有一定关联（5分），因为论文强调质量过滤和规模。与AI for Science相关（5分），因为SWE可视为科学计算领域应用。其他关键词如MoE、量化、推理加速等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了OpenSWE框架，构建了大规模可执行的软件工程代理训练环境，通过多代理合成管道和质量过滤机制，训练出的模型在SWE-bench上达到SOTA性能，并在数学推理和科学基准上显示出显著的跨领域改进。

摘要翻译

训练具备能力的软件工程（SWE）智能体需要大规模、可执行且可验证的环境，这些环境应提供动态反馈循环以支持迭代式代码编辑、测试执行和解决方案优化。然而，现有的开源数据集在规模和仓库多样性方面仍显不足，而工业解决方案则因其基础设施未公开而缺乏透明度，这为大多数学术研究团队设置了难以逾越的障碍。我们提出了OpenSWE，这是目前规模最大、完全透明的Python软件工程智能体训练框架，包含45,320个可执行的Docker环境，覆盖超过12.8k个代码仓库，所有Dockerfile、评估脚本和基础设施均已开源以确保可复现性。OpenSWE通过部署在64节点分布式集群上的多智能体合成流水线构建而成，实现了仓库探索、Dockerfile构建、评估脚本生成和迭代式测试分析的自动化。除了规模优势，我们还提出了一种以质量为核心的过滤流水线，用于刻画每个环境的内在难度，滤除不可解或挑战性不足的实例，仅保留那些能最大化学习效率的环境。该项目在环境构建上投入了89.1万美元，并在轨迹采样与难度感知筛选上额外投入了57.6万美元，总投资约147万美元，最终从约9,000个质量有保障的环境中产出约13,000条精选轨迹。大量实验验证了OpenSWE的有效性：OpenSWE-32B和OpenSWE-72B在SWE-bench Verified基准上分别达到62.4%和66.0%的准确率，在Qwen2.5系列模型中创下最优性能。此外，专注于软件工程的训练还带来了显著的跨领域能力提升，包括在数学推理任务上最高提升12个百分点，在科学基准上提升5个百分点，且未损害事实召回能力。

摘要 (Abstract)

Training capable software engineering (SWE) agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for iterative code editing, test execution, and solution refinement. However, existing open-source datasets remain limited in scale and repository diversity, while industrial solutions are opaque with unreleased infrastructure, creating a prohibitive barrier for most academic research groups. We present OpenSWE, the largest fully transparent framework for SWE agent training in Python, comprising 45,320 executable Docker environments spanning over 12.8k repositories, with all Dockerfiles, evaluation scripts, and infrastructure fully open-sourced for reproducibility. OpenSWE is built through a multi-agent synthesis pipeline deployed across a 64-node distributed cluster, automating repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis. Beyond scale, we propose a quality-centric filtering pipeline that characterizes the inherent difficulty of each environment, filtering out instances that are either unsolvable or insufficiently challenging and retaining only those that maximize learning efficiency. With $891K spent on environment construction and an additional $576K on trajectory sampling and difficulty-aware curation, the entire project represents a total investment of approximately $1.47 million, yielding about 13,000 curated trajectories from roughly 9,000 quality guaranteed environments. Extensive experiments validate OpenSWE’s effectiveness: OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% on SWE-bench Verified, establishing SOTA among Qwen2.5 series. Moreover, SWE-focused training yields substantial out-of-domain improvements, including up to 12 points on mathematical reasoning and 5 points on science benchmarks, without degrading factual recall.

关键词: software engineering agents, SWE agent training, multi-agent synthesis pipeline, Docker environments, quality-centric filtering, trajectory sampling, OpenSWE framework, out-of-domain improvements

49. ❌ Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation

作者: Sydney Lewis 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13017v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究AI agent对话历史的压缩存储和检索，与LLM Agents和Retrieval-Augmented Generation高度相关（10分），因为直接研究agent记忆的检索增强；与Large Language Models和Context Window Extension相关（8分），因为使用LLM进行评分且涉及上下文成本优化；其他关键词如MoE、SFT、RLHF等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文研究如何通过结构化蒸馏方法压缩AI agent的个性化对话记忆，实现11倍压缩率的同时保持96%的检索质量，解决了长对话历史存储成本高的问题。

摘要翻译

与AI智能体的长对话为用户带来一个简单问题：对话历史具有实用价值，但逐字记录会消耗过高成本。本研究聚焦个性化智能体记忆系统：将单用户与智能体的对话历史提炼为紧凑的可检索层以供后续搜索。每次对话交换被压缩为包含四个字段的复合对象（对话核心、具体情境、主题房间分配、正则表达式提取的文件操作记录）。经提炼的可检索文本平均每轮对话仅需38个词元。该方法应用于6个软件工程项目的4,182段对话（14,340次交换），将平均交换长度从371词元压缩至38词元，实现11倍压缩率。我们通过201个面向记忆检索的查询、涵盖5种纯检索模式与5种跨层检索模式的107种配置方案，以及5个大型语言模型评分器（共214,519组共识评级的查询-结果对），评估个性化记忆检索在压缩过程中的有效性。最佳纯提炼配置达到最佳逐字检索MRR的96%（0.717对比0.745）。结果呈现机制依赖性：经邦费罗尼校正后，全部20种向量检索配置均未出现显著差异，而全部20种BM25配置均显著退化（效应量|d|=0.031-0.756）。最佳跨层检索配置略优于最佳纯逐字基线（MRR 0.759）。结构化提炼方法可在不全面牺牲检索质量的前提下压缩单用户智能体记忆。仅需1/11的上下文成本，数千次对话交换即可容纳于单个提示中，同时保留逐字原始记录供深度追溯。我们已将实现代码与分析流程作为开源软件发布。

摘要 (Abstract)

Long conversations with an AI agent create a simple problem for one user: the history is useful, but carrying it verbatim is expensive. We study personalized agent memory: one user’s conversation history with an agent, distilled into a compact retrieval layer for later search. Each exchange is compressed into a compound object with four fields (exchange_core, specific_context, thematic room_assignments, and regex-extracted files_touched). The searchable distilled text averages 38 tokens per exchange. Applied to 4,182 conversations (14,340 exchanges) from 6 software engineering projects, the method reduces average exchange length from 371 to 38 tokens, yielding 11x compression. We evaluate whether personalized recall survives that compression using 201 recall-oriented queries, 107 configurations spanning 5 pure and 5 cross-layer search modes, and 5 LLM graders (214,519 consensus-graded query-result pairs). The best pure distilled configuration reaches 96% of the best verbatim MRR (0.717 vs 0.745). Results are mechanism-dependent. All 20 vector search configurations remain non-significant after Bonferroni correction, while all 20 BM25 configurations degrade significantly (effect sizes |d|=0.031-0.756). The best cross-layer setup slightly exceeds the best pure verbatim baseline (MRR 0.759). Structured distillation compresses single-user agent memory without uniformly sacrificing retrieval quality. At 1/11 the context cost, thousands of exchanges fit within a single prompt while the verbatim source remains available for drill-down. We release the implementation and analysis pipeline as open-source software.

关键词: personalized agent memory, structured distillation, retrieval preservation, context compression, vector search, BM25, LLM grading, software engineering

50. ❌ Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation

作者: Boxuan Lyu, Haiyue Song, Zhi Qu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12983v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种基于最小贝叶斯风险解码的迭代蒸馏框架，用于机器翻译中的错误跨度检测，核心是利用现成的LLM生成伪标签来替代人工标注。因此，与’Large Language Models’高度相关（8分），因为论文明确使用LLM生成伪标签；与’Post-training/SFT’相关（8分），因为涉及在伪标签上微调模型；与’Self-Correction/Self-Improvement’有一定关联（5分），因为框架涉及自我进化过程；其他关键词如MoE、SLMs、Scaling Laws、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于最小贝叶斯风险解码的迭代蒸馏框架，用于机器翻译中的错误跨度检测，通过利用现成LLM生成伪标签替代人工标注，实验表明仅使用自生成伪标签训练的模型在系统和跨度级别上优于基于人工标注的监督基线。

摘要翻译

错误跨度检测（Error Span Detection，ESD）是机器翻译（Machine Translation，MT）评估中的关键子任务，旨在识别翻译错误的位置与严重程度。尽管基于人工标注数据微调模型可提升ESD性能，但获取此类数据成本高昂且易受标注者间不一致性影响。为此，我们提出一种基于最小贝叶斯风险（Minimum Bayes Risk，MBR）解码的新型自进化框架——迭代MBR蒸馏ESD方法，通过利用现成的大语言模型生成伪标签，彻底摆脱了对人工标注的依赖。在WMT Metrics Shared Task数据集上的大量实验表明，仅基于这些自生成伪标签训练的模型，在系统层面和跨度层面的表现均优于未经适配的基础模型及基于人工标注训练的监督基线，同时在句子层面保持了具有竞争力的性能。

摘要 (Abstract)

Error Span Detection (ESD) is a crucial subtask in Machine Translation (MT) evaluation, aiming to identify the location and severity of translation errors. While fine-tuning models on human-annotated data improves ESD performance, acquiring such data is expensive and prone to inconsistencies among annotators. To address this, we propose a novel self-evolution framework based on Minimum Bayes Risk (MBR) decoding, named Iterative MBR Distillation for ESD, which eliminates the reliance on human annotations by leveraging an off-the-shelf LLM to generate pseudo-labels.Extensive experiments on the WMT Metrics Shared Task datasets demonstrate that models trained solely on these self-generated pseudo-labels outperform both unadapted base model and supervised baselines trained on human annotations at the system and span levels, while maintaining competitive sentence-level performance.

关键词: Error Span Detection, Machine Translation, Minimum Bayes Risk Decoding, Iterative Distillation, LLM-generated Pseudo-labels, Human Annotation Elimination, Self-evolution Framework, WMT Metrics Shared Task

51. ❌ Fair Lung Disease Diagnosis from Chest CT via Gender-Adversarial Attention Multiple Instance Learning

作者: Aditya Parikh, Aasa Feragen 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12988v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于医学影像（胸部CT）的公平性诊断，使用卷积神经网络（ConvNeXt）和注意力机制的多实例学习框架，并采用梯度反转层来抑制性别偏见。所有关键词中，只有"AI for Science OR Bioinformatics OR Cheminformatics"与论文相关，因为论文属于AI在生物医学（具体为医学影像诊断）领域的应用，符合"AI for Science"的范畴。其他关键词均涉及大语言模型（LLM）及其相关技术（如MoE、微调、推理优化、智能体等），而本文未使用或提及任何大语言模型技术，因此评分为0。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于注意力多实例学习和梯度反转层的公平性框架，用于从胸部CT扫描中诊断多种肺部疾病，旨在解决病理信号稀疏和人口统计学不平衡问题，并在公平疾病诊断挑战中取得了0.685的平均验证分数。

摘要翻译

我们提出了一种用于胸部CT容积多类肺部疾病诊断的公平性感知框架，该框架为PHAROS-AIF-MIH研讨会（CVPR 2026）的公平疾病诊断挑战赛而开发。该挑战赛要求将CT扫描分类为四个类别——健康、COVID-19、腺癌和鳞状细胞癌——其性能以按性别划分的宏观F1分数的平均值来衡量，明确惩罚性别不平等的预测。我们的方法解决了两个核心难点：跨越数百个切片的稀疏病理信号，以及在疾病类别和性别维度上叠加的严重人口统计学不平衡。我们提出了一种基于ConvNeXt骨干网络的注意力多示例学习模型，该模型能够在无需切片级监督的情况下学习识别具有诊断相关性的切片，并通过梯度反转层进行增强，以对抗性方式抑制学习到的扫描表征中可预测性别的结构。训练过程结合了带标签平滑的焦点损失、基于联合（类别，性别）分层的分层交叉验证，以及对最少数代表亚群进行针对性过采样。在推理阶段，所有五折交叉验证的检查点通过软逻辑投票和折外阈值优化进行集成，并结合水平翻转测试时数据增强以提高鲁棒性。我们的模型实现了0.685（标准差0.030）的平均验证竞赛分数，最佳单折分数达到0.759。所有训练和推理代码已在https://github.com/ADE-17/cvpr-fair-chest-ct 公开。

摘要 (Abstract)

We present a fairness-aware framework for multi-class lung disease diagnosis from chest CT volumes, developed for the Fair Disease Diagnosis Challenge at the PHAROS-AIF-MIH Workshop (CVPR 2026). The challenge requires classifying CT scans into four categories – Healthy, COVID-19, Adenocarcinoma, and Squamous Cell Carcinoma – with performance measured as the average of per-gender macro F1 scores, explicitly penalizing gender-inequitable predictions. Our approach addresses two core difficulties: the sparse pathological signal across hundreds of slices, and a severe demographic imbalance compounded across disease class and gender. We propose an attention-based Multiple Instance Learning (MIL) model on a ConvNeXt backbone that learns to identify diagnostically relevant slices without slice-level supervision, augmented with a Gradient Reversal Layer (GRL) that adversarially suppresses gender-predictive structure in the learned scan representation. Training incorporates focal loss with label smoothing, stratified cross-validation over joint (class, gender) strata, and targeted oversampling of the most underrepresented subgroup. At inference, all five-fold checkpoints are ensembled with horizontal-flip test-time augmentation via soft logit voting and out-of-the-fold threshold optimization for robustness. Our model achieves a mean validation competition score of 0.685 (std - 0.030), with the best single fold reaching 0.759. All training and inference code is publicly available at https://github.com/ADE-17/cvpr-fair-chest-ct

关键词: fairness-aware framework, lung disease diagnosis, chest CT volumes, attention-based Multiple Instance Learning, Gradient Reversal Layer, gender-adversarial, demographic imbalance, focal loss

52. ❌ Efficient Real-World Autonomous Racing via Attenuated Residual Policy Optimization

作者: Raphael Trumpp, Denis Hoornaert, Mirco Theile, Marco Caccamo 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12960v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于机器人控制领域的深度强化学习（DRL）方法，特别是针对自主赛车场景的残差策略优化改进。所有评分关键词均与大语言模型（LLM）、大模型技术原理、AI for Science应用等主题相关，而本文研究的是传统的深度强化学习在机器人控制中的应用，未涉及任何大模型技术、语言模型、科学AI应用或相关创新方法，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文提出了一种名为衰减残差策略优化（α-RPO）的新方法，解决了自主赛车中基于残差策略学习的控制器系统复杂性和推理延迟问题，并在仿真和真实机器人平台上实现了性能提升和零样本迁移。

摘要翻译

残差策略学习（Residual Policy Learning，RPL）通过深度强化学习（DRL）使习得的策略能够优化静态基础策略，已在多种机器人应用中展现出卓越性能。其在自动驾驶赛车领域的成效尤为显著，该领域被视为现实世界DRL应用的一个具有挑战性的基准。然而，部署基于RPL的控制器会引入系统复杂性并增加推理延迟。为此，我们提出了一种名为衰减残差策略优化（$α$-RPO）的RPL扩展方法。与标准RPL不同，$α$-RPO通过逐步衰减基础策略（该策略在初始阶段用于引导学习）来生成一个独立的神经策略。此外，这种机制实现了一种特权学习形式，允许基础策略使用最终部署时非必需的传感器模态。我们将$α$-RPO设计为可与PPO（近端策略优化）无缝集成，确保在策略优化过程中动态补偿基础控制器衰减的影响。我们围绕$α$-RPO构建了一个1:10比例自动驾驶赛车框架进行评估。在仿真和零样本迁移至Roboracer赛车的真实场景中，与基线方法相比，$α$-RPO不仅降低了系统复杂度，还提升了驾驶性能——这证明了其在机器人部署中的实用性。我们的代码发布于：https://github.com/raphajaner/arpo_racing。

摘要 (Abstract)

Residual policy learning (RPL), in which a learned policy refines a static base policy using deep reinforcement learning (DRL), has shown strong performance across various robotic applications. Its effectiveness is particularly evident in autonomous racing, a domain that serves as a challenging benchmark for real-world DRL. However, deploying RPL-based controllers introduces system complexity and increases inference latency. We address this by introducing an extension of RPL named attenuated residual policy optimization ($α$-RPO). Unlike standard RPL, $α$-RPO yields a standalone neural policy by progressively attenuating the base policy, which initially serves to bootstrap learning. Furthermore, this mechanism enables a form of privileged learning, where the base policy is permitted to use sensor modalities not required for final deployment. We design $α$-RPO to integrate seamlessly with PPO, ensuring that the attenuated influence of the base controller is dynamically compensated during policy optimization. We evaluate $α$-RPO by building a framework for 1:10-scaled autonomous racing around it. In both simulation and zero-shot real-world transfer to Roboracer cars, $α$-RPO not only reduces system complexity but also improves driving performance compared to baselines - demonstrating its practicality for robotic deployment. Our code is available at: https://github.com/raphajaner/arpo_racing.

关键词: autonomous racing, residual policy learning, deep reinforcement learning, attenuated residual policy optimization, robotic deployment, zero-shot transfer, PPO, real-world DRL

53. ❌ Delta1 with LLM: symbolic and neural integration for credible and explainable reasoning

作者: Yang Xu, Jun Liu, Shuwei Chen, Chris Nugent, Hailing Guo 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12953v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是神经符号推理，将Delta1自动定理生成器与LLMs结合，实现可解释推理。高度相关关键词：LLMs（核心组件）、Explainable AI（核心目标）。中等相关：Chain of Thought/System 2 Thinking（涉及推理过程）、Factuality（确保可信性）、AI for Science（应用于医疗等领域）。其余关键词未涉及技术细节或应用场景。

!!! tip deepseek-chat TL;DR

该研究提出了一种将Delta1自动定理生成器与大型语言模型结合的神经符号推理框架，实现了可解释、可审计的领域对齐推理，并在医疗、合规等领域验证了其有效性。

摘要翻译

神经符号推理日益需要将逻辑的形式严谨性与大型语言模型（LLM）的可解释性相结合的框架。我们提出一种“构造即解释”的端到端流程，该流程将基于完全三角标准矛盾（FTSC）的自动定理生成器Delta1与LLM相集成。Delta1以多项式时间确定性构造最小不可满足子句集和完备定理，通过构造过程同时保证了可靠性与极小性。LLM层将每个定理及其证明轨迹转化为连贯的自然语言解释与可操作的洞见。在医疗保健、合规及监管等领域的实证研究表明，Delta1与LLM的结合能够实现可解释、可审计且与领域对齐的推理。此项工作推动了逻辑、语言与学习的融合，将构造性定理生成确立为神经符号可解释人工智能的一个原则性基础。

摘要 (Abstract)

Neuro-symbolic reasoning increasingly demands frameworks that unite the formal rigor of logic with the interpretability of large language models (LLMs). We introduce an end to end explainability by construction pipeline integrating the Automated Theorem Generator Delta1 based on the full triangular standard contradiction (FTSC) with LLMs. Delta1 deterministically constructs minimal unsatisfiable clause sets and complete theorems in polynomial time, ensuring both soundness and minimality by construction. The LLM layer verbalizes each theorem and proof trace into coherent natural language explanations and actionable insights. Empirical studies across health care, compliance, and regulatory domains show that Delta1 and LLM enables interpretable, auditable, and domain aligned reasoning. This work advances the convergence of logic, language, and learning, positioning constructive theorem generation as a principled foundation for neuro-symbolic explainable AI.

关键词: neuro-symbolic reasoning, large language models, explainable AI, automated theorem generation, interpretable reasoning, domain alignment, healthcare applications, formal logic

54. ❌ Thinking in Streaming Video

作者: Zikang Liu, Longteng Guo, Handong Li, Ru Zhen, Xingjian He, Ruyi Ji, Xiaoming Ren, Yanhao Zhang, Haonan Lu, Jing Liu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12938v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出ThinkStream框架，专注于流式视频推理，核心涉及增量推理（与’Chain of Thought’和’System 2 Thinking’高度相关，评10分）、交互式助手/多模态代理（与’LLM Agents’高度相关，评10分），并隐含使用大模型技术（评5分）。其他关键词如MoE、量化、RAG等未在摘要中提及，评0分。

!!! tip deepseek-chat TL;DR

该论文针对现有视频推理方法在流式场景中延迟高、计算成本大的问题，提出了ThinkStream框架，通过Watch-Think-Speak范式和推理压缩流式内存，实现了低延迟、高效能的流式视频理解，在多个基准测试中显著优于现有在线视频模型。

摘要翻译

对连续视频流的实时理解对于在动态环境中运行的交互式助手与多模态智能体至关重要。然而，现有的大多数视频推理方法遵循批量处理范式，即推迟推理直至观察到完整的视频上下文，这导致高延迟和不断增长的计算成本，难以适应流式场景。本文提出ThinkStream，一种基于“观察—思考—表达”范式的流式视频推理框架，使模型能够随着新视频观测的到达而增量更新其理解。在每一步中，模型执行简短推理更新，并判断是否已积累足够证据以生成响应。为支持长时程流式处理，我们提出推理压缩流式记忆（Reasoning-Compressed Streaming Memory, RCSM），该方法将中间推理轨迹视为紧凑的语义记忆，在保留关键上下文的同时替换过时的视觉标记。我们进一步通过可验证奖励的流式强化学习方案训练模型，使增量推理与响应时机与流式交互的要求对齐。在多个流式视频基准测试上的实验表明，ThinkStream在保持低延迟和低内存占用的同时，显著优于现有的在线视频模型。代码、模型与数据将在https://github.com/johncaged/ThinkStream发布。

摘要 (Abstract)

Real-time understanding of continuous video streams is essential for interactive assistants and multimodal agents operating in dynamic environments. However, most existing video reasoning approaches follow a batch paradigm that defers reasoning until the full video context is observed, resulting in high latency and growing computational cost that are incompatible with streaming scenarios. In this paper, we introduce ThinkStream, a framework for streaming video reasoning based on a Watch–Think–Speak paradigm that enables models to incrementally update their understanding as new video observations arrive. At each step, the model performs a short reasoning update and decides whether sufficient evidence has accumulated to produce a response. To support long-horizon streaming, we propose Reasoning-Compressed Streaming Memory (RCSM), which treats intermediate reasoning traces as compact semantic memory that replaces outdated visual tokens while preserving essential context. We further train the model using a Streaming Reinforcement Learning with Verifiable Rewards scheme that aligns incremental reasoning and response timing with the requirements of streaming interaction. Experiments on multiple streaming video benchmarks show that ThinkStream significantly outperforms existing online video models while maintaining low latency and memory usage. Code, models and data will be released at https://github.com/johncaged/ThinkStream

关键词: streaming video reasoning, incremental reasoning, multimodal agents, low latency, Reasoning-Compressed Streaming Memory, Watch-Think-Speak paradigm, interactive assistants, real-time understanding

55. ❌ Efficient and Interpretable Multi-Agent LLM Routing via Ant Colony Optimization

作者: Xudong Wang, Chaoning Zhang, Jiaquan Zhang, Chenghao Li, Qigan Sun, Sung-Ho Bae, Peng Wang, Ning Xie, Jie Zou, Yang Yang, Hengtao Shen 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12933v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多智能体系统中基于LLM的高效路由问题，与’Large Language Models’、‘LLM Agents’、‘Multi-agent Systems’高度相关（10分），直接使用SFT小模型进行意图推断，与’Small Language Models’和’Post-training/SFT’高度相关（8-10分）。框架提供可解释性路由证据，与’Mechanistic Interpretability’相关（8分）。涉及工具使用和推理加速，与’Tool Use’和’Inference Acceleration’有一定关联（5分）。其他关键词如MoE、Scaling Laws、RAG等未在摘要中体现，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对多智能体LLM系统中高推理成本、延迟和有限透明度导致的低效路由问题，提出了AMRO-S框架，通过SFT小模型意图推断、任务特异性信息素专家和异步更新机制，在多个基准测试中显著提升了质量-成本权衡并提供了可追溯的路由证据。

摘要翻译

由大语言模型（LLM）驱动的多智能体系统（MAS）在复杂推理与工具使用方面展现出强大能力，而异构智能体池进一步拓宽了质量-成本的权衡空间。尽管取得了这些进展，实际部署仍常受限于高推理成本、高延迟及有限的可解释性，这阻碍了可扩展且高效的路由机制。现有路由策略通常依赖昂贵的大语言模型选择器或静态策略，在动态负载与混合意图下缺乏对语义感知路由的有效控制，常导致性能不稳定与资源利用效率低下。为应对这些局限，本文提出AMRO-S——一个面向多智能体系统的高效可解释路由框架。AMRO-S将多智能体系统路由建模为语义条件路径选择问题，通过三个关键机制提升路由性能：首先，它利用监督微调的小型语言模型进行意图推断，为每个查询提供低开销的语义接口；其次，它将路由记忆分解为任务特异性信息素专家，减少跨任务干扰并优化混合工作负载下的路径选择；最后，采用质量门控异步更新机制解耦推理与学习过程，在不增加延迟的前提下优化路由。在五个公开基准测试和高并发压力测试中的大量实验表明，AMRO-S相较于强基线路由方法能持续改善质量-成本权衡，同时通过结构化信息素模式提供可追溯的路由证据。

摘要 (Abstract)

Large Language Model (LLM)-driven Multi-Agent Systems (MAS) have demonstrated strong capability in complex reasoning and tool use, and heterogeneous agent pools further broaden the quality–cost trade-off space. Despite these advances, real-world deployment is often constrained by high inference cost, latency, and limited transparency, which hinders scalable and efficient routing. Existing routing strategies typically rely on expensive LLM-based selectors or static policies, and offer limited controllability for semantic-aware routing under dynamic loads and mixed intents, often resulting in unstable performance and inefficient resource utilization. To address these limitations, we propose AMRO-S, an efficient and interpretable routing framework for Multi-Agent Systems (MAS). AMRO-S models MAS routing as a semantic-conditioned path selection problem, enhancing routing performance through three key mechanisms: First, it leverages a supervised fine-tuned (SFT) small language model for intent inference, providing a low-overhead semantic interface for each query; second, it decomposes routing memory into task-specific pheromone specialists, reducing cross-task interference and optimizing path selection under mixed workloads; finally, it employs a quality-gated asynchronous update mechanism to decouple inference from learning, optimizing routing without increasing latency. Extensive experiments on five public benchmarks and high-concurrency stress tests demonstrate that AMRO-S consistently improves the quality–cost trade-off over strong routing baselines, while providing traceable routing evidence through structured pheromone patterns.

关键词: Multi-Agent Systems, LLM Routing, Ant Colony Optimization, Supervised Fine-tuning, Small Language Model, Interpretable Routing, Quality-Cost Trade-off, Pheromone Specialists

56. ❌ ODRL Policy Comparison Through Normalisation

作者: Jaime Osvaldo Salas, Paolo Pareti, George Konstantinidis 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12926v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training	0.0	0.0/10	0.0
Post-training	0.0	0.0/10	0.0
Instruction Tuning	0.0	0.0/10	0.0
RLHF	0.0	0.0/10	0.0
PEFT	0.0	0.0/10	0.0
Retrieval-Augmented Generation	0.0	0.0/10	0.0
Context Window Extension	0.0	0.0/10	0.0
KV Cache Compression	0.0	0.0/10	0.0
Chain of Thought	0.0	0.0/10	0.0
System 2 Thinking	0.0	0.0/10	0.0
Monte Carlo Tree Search AND LLM	0.0	0.0/10	0.0
Self-Correction	0.0	0.0/10	0.0
LLM Agents	0.0	0.0/10	0.0
Tool Use	0.0	0.0/10	0.0
Multi-agent Systems	0.0	0.0/10	0.0
Quantization	0.0	0.0/10	0.0
Speculative Decoding	0.0	0.0/10	0.0
Hallucination Mitigation	0.0	0.0/10	0.0
Mechanistic Interpretability	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging	0.0	0.0/10	0.0
In-context Learning	0.0	0.0/10	0.0
AI for Science	0.0	0.0/10	0.0

评分理由: 该论文研究ODRL（开放数字权利语言）策略的规范化与比较问题，属于计算机科学中的形式化方法、语义Web和策略语言领域。论文内容完全不涉及大模型、深度学习、AI技术原理或AI在科学领域的应用。所有评分关键词均与大模型技术、AI方法或AI应用相关，而本文专注于数字权利策略的形式化语义和算法处理，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对ODRL策略语言的复杂性和异构性问题，提出了一种参数化规范化方法，将策略转换为最小组件形式并简化约束，从而实现了策略语义保持的规范化比较和互操作性增强。

摘要翻译

ODRL语言已成为表示数字权利政策与法规的标准。然而其复杂性构成了使用障碍，导致许多相关理论与实际工作仅聚焦于ODRL中互不兼容的不同片段。此外，语义等效的政策可通过多种不同方式表达，这使政策比较与处理更为困难。基于近期定义的形式语义，我们通过提出一种参数化规范化方法来解决这些问题：将ODRL政策归一化为最小组件，将包含许可与禁止的表述重构为仅含许可的政策，并将复杂逻辑约束简化为简单约束。我们提供了计算ODRL政策规范形式的算法，以及简化数值与符号约束的方法。我们证明这些算法能保持政策语义的完整性，并分析了结果的空间复杂度——其随属性数量呈指数增长，随属性唯一值数量呈线性增长。我们展示了该方法如何使复杂政策能在ODRL更基础的片段中表示，以及如何将政策比较问题简化为判断两条规则是否等同的简单问题。

摘要 (Abstract)

The ODRL language has become the standard for representing policies and regulations for digital rights. However its complexity is a barrier to its usage, which has caused many related theoretical and practical works to focus on different, and not interoperable, fragments of ODRL. Moreover, semantically equivalent policies can be expressed in numerous different ways, which makes comparing them and processing them harder. Building on top of a recently defined semantics, we tackle these problems by proposing an approach that involves a parametrised normalisation of ODRL policies into its minimal components which reformulates policies with permissions and prohibitions into policies with permissions exclusively, and simplifies complex logic constraints into simple ones. We provide algorithms to compute a normal form for ODRL policies and simplifying numerical and symbolic constraints. We prove that these algorithms preserve the semantics of policies, and analyse the size complexity of the result, which is exponential on the number of attributes and linear on the number of unique values for these attributes. We show how this makes complex policies representable in more basic fragments of ODRL, and how it reduces the problem of policy comparison to the simpler problem of checking if two rules are identical.

关键词: ODRL, policy normalization, semantic equivalence, algorithm, digital rights, constraint simplification, policy comparison

57. ❌ Surprised by Attention: Predictable Query Dynamics for Time Series Anomaly Detection

作者: Kadir-Kaan Özer, René Ebeling, Markus Enzweiler 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12916v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Surprised by Attention: Predictable Query Dynamics for Time Series Anomaly Detection》专注于时间序列异常检测，提出了一种名为AxonAD的无监督检测方法，该方法利用多头注意力查询向量的可预测性来检测跨通道依赖关系的变化。虽然论文使用了注意力机制（Transformer架构的核心组件），但其研究内容与评分关键词列表中的大模型（LLMs）、深度学习技术原理创新、AI for Science等主题无直接关联。所有关键词均涉及大语言模型、模型训练优化、推理加速、对齐、代理系统等特定技术或应用领域，而本论文的研究范围仅限于时间序列分析中的异常检测，未涉及任何大模型相关技术或科学AI应用，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于多头注意力查询向量可预测性的无监督时间序列异常检测方法AxonAD，通过结合重建误差和查询不匹配分数，在车辆遥测数据和TSB-AD数据集上实现了优于基线的异常检测性能。

摘要翻译

多元时间序列异常通常表现为跨通道依赖关系的偏移，而非简单的幅度偏差。例如在自动驾驶中，转向指令可能在内部保持一致性，却与产生的横向加速度解耦。当灵活的序列模型在协调关系改变后仍能合理重建信号时，基于残差的检测器可能遗漏此类异常。本文提出无监督检测器AxonAD，其将多头注意力查询向量的演化过程视为短时域可预测过程。该方法将梯度更新的重建路径与纯历史预测器相结合，后者通过过往上下文预测未来查询向量。训练过程采用掩码预测目标函数，以指数移动平均目标编码器为监督对象。在推理阶段，重建误差与尾部聚合查询失配分数相结合——该分数通过计算近期时间步上预测查询与目标查询间的余弦偏差来衡量。这种双重策略既能敏感捕捉结构依赖关系的变化，又保留了幅度层面的检测能力。在带有区间标注的专有车载遥测数据以及TSB-AD多元基准测试集（17个数据集，180个序列）上，采用无阈值和范围感知指标进行评估，AxonAD在排序质量和时序定位能力上均优于现有强基线模型。消融实验证实查询预测机制与复合评分策略是性能提升的主要驱动力。代码发布于https://github.com/iis-esslingen/AxonAD。

摘要 (Abstract)

Multivariate time series anomalies often manifest as shifts in cross-channel dependencies rather than simple amplitude excursions. In autonomous driving, for instance, a steering command might be internally consistent but decouple from the resulting lateral acceleration. Residual-based detectors can miss such anomalies when flexible sequence models still reconstruct signals plausibly despite altered coordination. We introduce AxonAD, an unsupervised detector that treats multi-head attention query evolution as a short horizon predictable process. A gradient-updated reconstruction pathway is coupled with a history-only predictor that forecasts future query vectors from past context. This is trained via a masked predictor-target objective against an exponential moving average (EMA) target encoder. At inference, reconstruction error is combined with a tail-aggregated query mismatch score, which measures cosine deviation between predicted and target queries on recent timesteps. This dual approach provides sensitivity to structural dependency shifts while retaining amplitude-level detection. On proprietary in-vehicle telemetry with interval annotations and on the TSB-AD multi-variate suite (17 datasets, 180 series) with threshold-free and range-aware metrics, AxonAD improves ranking quality and temporal localization over strong baselines. Ablations confirm that query prediction and combined scoring are the primary drivers of the observed gains. Code is available at the URL https://github.com/iis-esslingen/AxonAD.

关键词: time series anomaly detection, multivariate time series, attention mechanism, query prediction, unsupervised learning, autonomous driving, cross-channel dependencies, AxonAD

58. ❌ Stake the Points: Structure-Faithful Instance Unlearning

作者: Kiseong Hong, JungKyoo Shin, Eunwoo Kim 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12915v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于机器遗忘（Machine Unlearning）领域，提出了一种结构保持的遗忘框架，使用语义锚点（stakes）来维护知识结构。论文主要涉及预训练模型的隐私风险缓解，与深度学习模型优化相关，但并非专门针对大模型（LLMs）或深度学习技术原理的创新。论文使用了CLIP等语义编码器，但未深入探讨大模型技术、推理方法、对齐、压缩、科学AI应用等具体关键词。因此，大多数关键词评分为0。仅“Pre-training OR Continual Pre-training OR Domain Adaptation”评分为5，因为机器遗忘通常应用于预训练模型，涉及模型适应和知识保留，有一定关联，但非核心内容。

!!! tip deepseek-chat TL;DR

该论文提出了一种结构保持的机器遗忘框架，通过引入语义锚点来维护知识结构，在图像分类、检索和人脸识别任务中平均提升了32.9%、22.5%和19.3%的性能，有效平衡了删除与保留的权衡并增强了泛化能力。

摘要翻译

机器遗忘旨在解决预训练模型中的隐私风险，其核心目标是在移除指定数据影响的同时，保持剩余知识的可用性。实现这一目标需要维持剩余实例间的语义关系，而现有研究常忽视这一点。我们观察到，若缺乏这种保持，模型将遭受渐进式结构崩溃，从而破坏删除与保留的平衡。本文提出一种新颖的结构忠实框架，引入“锚点”作为语义参照点以维持知识结构。通过利用这些锚点，我们的框架能够捕捉并稳定知识的语义组织。具体而言，我们从语义编码器（如CLIP）编码的语言驱动属性描述中实例化锚点，并通过结构感知对齐与正则化实现知识结构的保持：前者围绕锚点对齐遗忘前后剩余知识的组织方式，后者则对结构关键参数的更新进行调控。在图像分类、检索和人脸识别任务上的实验结果表明，该方法在性能上平均提升了32.9%、22.5%和19.3%，有效平衡了删除与保留的权衡，并增强了泛化能力。

摘要 (Abstract)

Machine unlearning (MU) addresses privacy risks in pretrained models. The main goal of MU is to remove the influence of designated data while preserving the utility of retained knowledge. Achieving this goal requires preserving semantic relations among retained instances, which existing studies often overlook. We observe that without such preservation, models suffer from progressive structural collapse, undermining both the deletion-retention balance. In this work, we propose a novel structure-faithful framework that introduces stakes, i.e., semantic anchors that serve as reference points to maintain the knowledge structure. By leveraging these anchors, our framework captures and stabilizes the semantic organization of knowledge. Specifically, we instantiate the anchors from language-driven attribute descriptions encoded by a semantic encoder (e.g., CLIP). We enforce preservation of the knowledge structure via structure-aware alignment and regularization: the former aligns the organization of retained knowledge before and after unlearning around anchors, while the latter regulates updates to structure-critical parameters. Results from image classification, retrieval, and face recognition show average gains of 32.9%, 22.5%, and 19.3% in performance, balancing the deletion-retention trade-off and enhancing generalization.

关键词: Machine Unlearning, Structure-Faithful Framework, Semantic Anchors, Knowledge Structure Preservation, CLIP Encoder, Deletion-Retention Balance, Image Classification, Generalization Enhancement

59. ❌ FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts

作者: Xin Xu, Weilong Li, Wei Liu, Wenke Huang, Zhixi Yu, Bin Yang, Xiaoying Liao, Kui Jiang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12912v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究联邦学习下的行人重识别（FedDG-ReID），属于计算机视觉领域，而非大语言模型（LLM）或深度学习技术原理的核心创新。与大多数关键词（如LLMs、MoE、RLHF、RAG等）完全无关。仅与少数关键词有间接关联：1）“Pre-training OR Continual Pre-training OR Domain Adaptation”（5分）：论文涉及跨域泛化（Domain Generalization），但未明确使用预训练或持续预训练技术。2）“Post-training OR Supervised Fine-tuning OR SFT”（5分）：论文提到微调策略（PFTS），但主要针对视觉提示而非典型SFT。3）“PEFT OR LoRA OR Parameter-efficient Fine-tuning”（8分）：核心贡献是参数高效的提示微调（PFTS），冻结主干网络、仅更新轻量级提示，与PEFT概念高度相关。其他关键词（如AI for Science）不匹配，因论文聚焦视觉应用，非生物信息学等科学领域。

!!! tip deepseek-chat TL;DR

该论文针对联邦学习下行人重识别中的跨域泛化挑战，提出了一种基于身体分布感知视觉提示的方法（FedBPrompt），通过轻量级提示微调显著提升了特征判别力和泛化性能，同时降低了通信成本。

摘要翻译

联邦域泛化行人重识别（FedDG-ReID）旨在从分散数据中学习域不变特征表示。尽管视觉Transformer（ViT）已被广泛采用，但其全局注意力机制往往难以从高相似度背景或多变视角中有效区分行人——这一挑战在FedDG-ReID的跨客户端分布偏移环境下更为突出。为解决此问题，我们提出联邦人体分布感知视觉提示（FedBPrompt），通过引入可学习的视觉提示来引导Transformer注意力聚焦于以行人为中心的区域。FedBPrompt采用人体分布感知视觉提示机制（BAPM），该机制包含：整体全身提示（用于抑制跨客户端背景噪声）和身体部位对齐提示（用于捕捉对姿态与视角变化具有鲁棒性的细粒度细节）。为降低高昂的通信成本，我们设计了基于提示的微调策略（PFTS），该策略冻结ViT主干网络，仅更新轻量级提示参数，在保持适应性的同时显著减少通信开销。大量实验表明，BAPM能有效提升特征判别力与跨域泛化能力，而PFTS仅需少量聚合轮次即可实现显著性能提升。此外，BAPM与PFTS均可便捷集成到现有基于ViT的FedDG-ReID框架中，使得FedBPrompt成为联邦行人重识别领域一种灵活高效的解决方案。代码已发布于https://github.com/leavlong/FedBPrompt。

摘要 (Abstract)

Federated Domain Generalization for Person Re-Identification (FedDG-ReID) learns domain-invariant representations from decentralized data. While Vision Transformer (ViT) is widely adopted, its global attention often fails to distinguish pedestrians from high similarity backgrounds or diverse viewpoints – a challenge amplified by cross-client distribution shifts in FedDG-ReID. To address this, we propose Federated Body Distribution Aware Visual Prompt (FedBPrompt), introducing learnable visual prompts to guide Transformer attention toward pedestrian-centric regions. FedBPrompt employs a Body Distribution Aware Visual Prompts Mechanism (BAPM) comprising: Holistic Full Body Prompts to suppress cross-client background noise, and Body Part Alignment Prompts to capture fine-grained details robust to pose and viewpoint variations. To mitigate high communication costs, we design a Prompt-based Fine-Tuning Strategy (PFTS) that freezes the ViT backbone and updates only lightweight prompts, significantly reducing communication overhead while maintaining adaptability. Extensive experiments demonstrate that BAPM effectively enhances feature discrimination and cross-domain generalization, while PFTS achieves notable performance gains within only a few aggregation rounds. Moreover, both BAPM and PFTS can be easily integrated into existing ViT-based FedDG-ReID frameworks, making FedBPrompt a flexible and effective solution for federated person re-identification. The code is available at https://github.com/leavlong/FedBPrompt.

关键词: Federated Learning, Person Re-Identification, Domain Generalization, Visual Prompts, Vision Transformer, Parameter-efficient Fine-tuning, Cross-domain Adaptation

60. ❌ Learning from Child-Directed Speech in Two-Language Scenarios: A French-English Case Study

作者: Liel Binyamin, Elior Sulem 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12906v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是紧凑型语言模型（BabyBERTa、RoBERTa、LTG-BERT）在双语（英语-法语）场景下的预训练，主要关注儿童导向语音与多领域语料库的对比，以及在不同语言设置下的语法和语义任务表现。论文的核心是语言模型的预训练方法比较，因此与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（评分5分），但未涉及大模型技术原理创新、深度学习应用或其他关键词中的具体技术（如MoE、RLHF、RAG等）。论文属于传统的语言模型研究，而非当前大模型的前沿技术领域。

!!! tip deepseek-chat TL;DR

该论文系统研究了在严格数据量匹配条件下，紧凑型语言模型在英语-法语双语场景中的预训练，发现儿童导向语音有助于单语语法任务，而维基百科语料则有益于语义任务，双语预训练在文本蕴含任务上表现突出。

摘要翻译

针对发展合理性语言模型的研究主要集中于英语，其在多语言环境下的适用性仍存疑问。本研究通过将BabyBERTa扩展至严格数据规模匹配的英法双语场景，系统性地探究了紧凑型语言模型的表现，涵盖单语、双语及跨语言设定。我们的实验设计对比了两种训练语料类型：（i）遵循BabyBERTa及相关研究的儿童导向语音语料（约250万词元），（ii）将BabyLM框架扩展至法语的多领域语料（约1000万词元）。为实现公平评估，我们同时引入了新的资源，包括法语版本的QAMR与QASRL数据集，以及英法双语多领域语料库。

我们在句法与语义任务上评估了这些模型，并将其与仅使用维基百科数据训练的模型进行对比。结果显示语境依赖效应：维基百科训练持续提升语义任务表现，而儿童导向语音语料则在单语设定中改善语法判断能力。双语预训练在文本蕴含任务上带来显著增益，其中法语提升尤为突出。值得注意的是，BabyBERTa、RoBERTa和LTG-BERT模型均呈现出相似的模式，表明不同架构间存在一致性趋势。

摘要 (Abstract)

Research on developmentally plausible language models has largely focused on English, leaving open questions about multilingual settings. We present a systematic study of compact language models by extending BabyBERTa to English-French scenarios under strictly size-matched data conditions, covering monolingual, bilingual, and cross-lingual settings. Our design contrasts two types of training corpora: (i) child-directed speech (about 2.5M tokens), following BabyBERTa and related work, and (ii) multi-domain corpora (about 10M tokens), extending the BabyLM framework to French. To enable fair evaluation, we also introduce new resources, including French versions of QAMR and QASRL, as well as English and French multi-domain corpora. We evaluate the models on both syntactic and semantic tasks and compare them with models trained on Wikipedia-only data. The results reveal context-dependent effects: training on Wikipedia consistently benefits semantic tasks, whereas child-directed speech improves grammatical judgments in monolingual settings. Bilingual pretraining yields notable gains for textual entailment, with particularly strong improvements for French. Importantly, similar patterns emerge across BabyBERTa, RoBERTa, and LTG-BERT, suggesting consistent trends across architectures.

关键词: compact language models, bilingual pretraining, child-directed speech, English-French scenarios, syntactic and semantic tasks, BabyBERTa, multilingual settings, data conditions

61. ❌ Human-Centered Evaluation of an LLM-Based Process Modeling Copilot: A Mixed-Methods Study with Domain Experts

作者: Chantale Lauer, Peter Pfeiffer, Nijat Mehdiyev 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12895v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确研究LLM在业务流程管理（BPMN）中的应用，属于LLM在不同领域的研究应用，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文评估LLM驱动的BPMN copilot，涉及LLM作为建模代理（agent），因此与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分）。其他关键词如MoE、SFT、RAG等涉及具体技术原理或应用领域，论文未涉及，故均为0分。

!!! tip deepseek-chat TL;DR

该论文通过混合方法研究评估了LLM驱动的BPMN copilot在业务流程建模中的应用，发现其可用性尚可但信任度较低，并揭示了输出质量、提示困难等问题，强调了以人为中心的评估对LLM建模代理的必要性。

摘要翻译

将大型语言模型（LLM）集成到业务流程管理工具中，有望为非专业人士普及业务流程模型与标注（Business Process Model and Notation, BPMN）建模。尽管自动化框架能够评估语法与语义质量，但它们忽略了信任度、可用性及专业契合度等人为因素。我们采用焦点小组与标准化问卷相结合的方法，对提出的解决方案——一款基于LLM的BPMN协同助手——进行了混合方法评估，参与者包括五位流程建模专家。研究结果揭示了可接受的感知可用性（平均CUQ得分：67.2/100）与显著较低的信任度（平均得分：48.8%）之间的关键矛盾，其中可靠性被评为最受关注的问题（均值=1.8/5）。此外，我们发现了输出质量缺陷、提示设计困难，以及LLM需就流程细节提出更深入澄清问题的需求。我们设想了从领域专家支持到企业质量保障的五种应用场景。本研究论证了以人为中心的评估方法对LLM建模代理进行自动化基准测试补充的必要性。

摘要 (Abstract)

Integrating Large Language Models (LLMs) into business process management tools promises to democratize Business Process Model and Notation (BPMN) modeling for non-experts. While automated frameworks assess syntactic and semantic quality, they miss human factors like trust, usability, and professional alignment. We conducted a mixed-methods evaluation of our proposed solution, an LLM-powered BPMN copilot, with five process modeling experts using focus groups and standardized questionnaires. Our findings reveal a critical tension between acceptable perceived usability (mean CUQ score: 67.2/100) and notably lower trust (mean score: 48.8%), with reliability rated as the most critical concern (M=1.8/5). Furthermore, we identified output-quality issues, prompting difficulties, and a need for the LLM to ask more in-depth clarifying questions about the process. We envision five use cases ranging from domain-expert support to enterprise quality assurance. We demonstrate the necessity of human-centered evaluation complementing automated benchmarking for LLM modeling agents.

关键词: Large Language Models, LLM, Business Process Management, BPMN, Human-Centered Evaluation, Trust, Usability, Modeling Agents

62. ❌ Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models

作者: David McAllister, Miika Aittala, Tero Karras, Janne Hellsten, Angjoo Kanazawa, Timo Aila, Samuli Laine 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12893v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于扩散模型的强化学习后训练方法，与"Post-training OR Supervised Fine-tuning OR SFT"高度相关（10分），因为论文明确研究RL在扩散模型后训练中的应用。其他关键词主要涉及大语言模型、推理、对齐、压缩等技术，而本文研究的是文本到图像扩散模型，与这些关键词无直接关联，因此评分为0分。

!!! tip deepseek-chat TL;DR

本文提出了一种用于文本到图像扩散模型后训练的在线强化学习方法，通过将整个采样过程视为单一动作来减少方差，实验表明该方法比现有方法收敛更快且能产生更高质量的图像和更好的提示对齐。

摘要翻译

强化学习（RL）已成为基于扩散的图像合成模型后训练的标准技术，因其能够通过奖励信号学习，从而显式提升图像质量与提示对齐等理想特性。本文提出一种在线RL变体，通过采样成对轨迹并将流速度向更优图像方向调整，以降低模型更新的方差。与现有方法将每个采样步骤视为独立策略动作不同，我们将整个采样过程视为单一动作。实验采用高质量视觉语言模型和即用型质量指标作为奖励，并通过广泛指标集评估输出结果。相较于既有方法，本方法收敛更快，并在输出质量与提示对齐方面表现更优。

摘要 (Abstract)

Reinforcement learning (RL) has become a standard technique for post-training diffusion-based image synthesis models, as it enables learning from reward signals to explicitly improve desirable aspects such as image quality and prompt alignment. In this paper, we propose an online RL variant that reduces the variance in the model updates by sampling paired trajectories and pulling the flow velocity in the direction of the more favorable image. Unlike existing methods that treat each sampling step as a separate policy action, we consider the entire sampling process as a single action. We experiment with both high-quality vision language models and off-the-shelf quality metrics for rewards, and evaluate the outputs using a broad set of metrics. Our method converges faster and yields higher output quality and prompt alignment than previous approaches.

关键词: Reinforcement Learning, Post-training, Diffusion Models, Text-to-Image Generation, Online RL, Flow Optimization, Image Quality, Prompt Alignment

63. ❌ Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach

作者: Elena Ryumina, Alexandr Axyonov, Dmitry Sysoev, Timur Abdulkadirov, Kirill Almetov, Yulia Morozova, Dmitry Ryumin 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12848v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于多模态情感识别（犹豫/矛盾识别），使用VideoMAE、EmotionWav2Vec2.0、Mamba编码器和微调的Transformer文本模型等技术，属于计算机视觉、音频处理和自然语言处理的交叉应用。所有评分关键词均围绕大模型（LLM）技术原理、训练方法、推理优化、对齐、代理系统等核心主题，而本文未涉及任何大模型相关技术，也未应用于科学领域（如生物信息学），因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该研究提出了一种多模态融合方法，用于视频中的犹豫/矛盾情感识别，通过整合场景、面部、音频和文本信息，在BAH语料库上实现了83.25%的平均MF1分数，显著优于单模态基线。

摘要翻译

在无约束视频中识别矛盾/犹豫状态是一项具有挑战性的任务，因为这种行为状态具有微妙性、多模态性和情境依赖性。本文针对第十届ABAW竞赛提出了一种视频级矛盾/犹豫识别的多模态方法。该方法整合了四种互补模态：场景、面部、音频和文本。其中，场景动态通过基于VideoMAE的模型捕捉；面部信息通过统计池化聚合的情感帧级嵌入进行编码；声学表征使用EmotionWav2Vec2.0提取，并经由基于Mamba的时序编码器处理；语言线索则通过微调的基于Transformer的文本模型建模。所得的单模态嵌入通过多模态融合模型（包括原型增强变体）进一步整合。在BAH语料库上的实验表明，多模态融合相较于所有单模态基线均取得显著提升。最佳单模态配置的平均MF1为70.02%，而最佳多模态融合模型达到83.25%。最终测试的最高性能（71.43%）由五个原型增强融合模型的集成实现。这些结果凸显了互补的多模态线索与鲁棒的融合策略对于矛盾/犹豫识别的重要性。

摘要 (Abstract)

Ambivalence/hesitancy recognition in unconstrained videos is a challenging problem due to the subtle, multimodal, and context-dependent nature of this behavioral state. In this paper, a multimodal approach for video-level ambivalence/hesitancy recognition is presented for the 10th ABAW Competition. The proposed approach integrates four complementary modalities: scene, face, audio, and text. Scene dynamics are captured with a VideoMAE-based model, facial information is encoded through emotional frame-level embeddings aggregated by statistical pooling, acoustic representations are extracted with EmotionWav2Vec2.0 and processed by a Mamba-based temporal encoder, and linguistic cues are modeled using fine-tuned transformer-based text models. The resulting unimodal embeddings are further combined using multimodal fusion models, including prototype-augmented variants. Experiments on the BAH corpus demonstrate clear gains of multimodal fusion over all unimodal baselines. The best unimodal configuration achieved an average MF1 of 70.02%, whereas the best multimodal fusion model reached 83.25%. The highest final test performance, 71.43%, was obtained by an ensemble of five prototype-augmented fusion models. The obtained results highlight the importance of complementary multimodal cues and robust fusion strategies for ambivalence/hesitancy recognition.

关键词: multimodal fusion, ambivalence recognition, hesitancy recognition, video analysis, emotion recognition, BAH corpus, VideoMAE, Mamba encoder

64. ❌ Hierarchical Reference Sets for Robust Unsupervised Detection of Scattered and Clustered Outliers

作者: Yiqun Zhang, Zexi Tan, Xiaopeng Luo, Yunlin Liu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12847v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于物联网（IoT）数据中的异常检测，提出了一种基于图结构的无监督离群点检测方法，用于识别分散和聚集的异常。论文内容涉及图论、异常检测算法、聚类分析等传统机器学习领域，但未涉及任何大语言模型（LLM）、深度学习技术原理、模型训练/微调方法（如预训练、指令调优、RLHF）、模型优化技术（如量化、注意力机制）、推理加速、可解释性AI、智能体系统或AI在科学领域的应用。所有评分关键词均与大模型或深度学习技术直接相关，而该论文的研究主题和方法论完全属于传统数据挖掘和机器学习范畴，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于图结构的无监督异常检测方法，通过局部和全局参考集有效识别物联网数据中的分散和聚集异常，实验验证了其在异常检测和下游聚类任务中的有效性。

摘要翻译

大多数现实世界中的物联网数据分析任务（如聚类与异常事件检测）均属于无监督学习范畴，且极易受到离群值的影响。除传感器读数故障等因素导致的零星分散型离群值外，物联网系统常出现聚集型离群值。当多个设备或节点因局部干扰、新兴安全威胁或区域性误报等原因产生相似的异常测量值时，会形成微簇结构，即聚集型离群值。这类离群值因其局部密度相对较高，极易被误判为正常行为，从而掩盖分散型与情境型异常的检测。为此，我们提出一种基于图结构的离群值检测新范式，通过利用自然邻域关系构建图模型，融合从图中提取的局部与全局尺度参考集，实现多视角异常评估。该方法能有效识别分散型离群值而不受聚集型异常干扰，同时图结构有助于反映并隔离聚集型离群群组。大量实验——包括性能对比分析、消融研究、下游聚类任务验证及超参数敏感性评估——证明了所提方法的有效性。源代码发布于 https://github.com/gordonlok/DROD。

摘要 (Abstract)

Most real-world IoT data analysis tasks, such as clustering and anomaly event detection, are unsupervised and highly susceptible to the presence of outliers. In addition to sporadic scattered outliers caused by factors such as faulty sensor readings, IoT systems often exhibit clustered outliers. These occur when multiple devices or nodes produce similar anomalous measurements, for instance, owing to localized interference, emerging security threats, or regional false alarms, forming micro-clusters. These clustered outliers can be easily mistaken for normal behavior because of their relatively high local density, thereby obscuring the detection of both scattered and contextual anomalies. To address this, we propose a novel outlier detection paradigm that leverages the natural neighboring relationships using graph structures. This facilitates multi-perspective anomaly evaluation by incorporating reference sets at both local and global scales derived from the graph. Our approach enables the effective recognition of scattered outliers without interference from clustered anomalies, whereas the graph structure simultaneously helps reflect and isolate clustered outlier groups. Extensive experiments, including comparative performance analysis, ablation studies, validation on downstream clustering tasks, and evaluation of hyperparameter sensitivity, demonstrate the efficacy of the proposed method. The source code is available at https://github.com/gordonlok/DROD.

关键词: outlier detection, unsupervised learning, graph structures, clustered outliers, IoT data analysis, anomaly detection, multi-perspective evaluation, local and global reference sets

65. ❌ Mask2Flow-TSE: Two-Stage Target Speaker Extraction with Masking and Flow Matching

作者: Junwon Moon, Hyunjin Choi, Hansol Park, Heeseung Kim, Kyuhong Shim 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12837v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究目标说话人提取（TSE），属于语音信号处理领域，专注于结合判别式掩蔽和流匹配的两阶段方法。论文内容与所有评分关键词（均围绕大模型、深度学习技术原理及其在科学领域的应用）完全无关，未涉及任何大模型、语言模型、对齐、推理、代理、压缩等技术。虽然研究背景提到“大模型在不同领域的研究应用可酌情给分”，但该论文属于传统语音处理，未使用或涉及大模型技术，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合判别式掩蔽和流匹配的两阶段目标说话人提取框架Mask2Flow-TSE，在单步推理中实现了与生成方法相当的性能，同时参数规模约为8500万。

摘要翻译

目标说话人提取（Target Speaker Extraction, TSE）旨在给定参考语音片段的情况下，从重叠的混合语音中提取目标说话人的声音。现有方法通常分为两类：判别式与生成式。判别式方法使用时频掩码实现快速推理，但常过度抑制目标信号；而生成式方法以大量迭代步骤为代价合成高质量语音。我们提出Mask2Flow-TSE，一个结合两种范式优势的两阶段框架。第一阶段采用判别式掩码进行粗分离，第二阶段利用流匹配（flow matching）将输出精炼为目标语音。与从高斯噪声合成语音的生成式方法不同，我们的方法从掩码后的频谱图出发，可在单次推理步骤中实现高质量重建。实验表明，Mask2Flow-TSE以约8500万参数取得了与现有生成式TSE方法相当的性能。

摘要 (Abstract)

Target speaker extraction (TSE) extracts the target speaker’s voice from overlapping speech mixtures given a reference utterance. Existing approaches typically fall into two categories: discriminative and generative. Discriminative methods apply time-frequency masking for fast inference but often over-suppress the target signal, while generative methods synthesize high-quality speech at the cost of numerous iterative steps. We propose Mask2Flow-TSE, a two-stage framework combining the strengths of both paradigms. The first stage applies discriminative masking for coarse separation, and the second stage employs flow matching to refine the output toward target speech. Unlike generative approaches that synthesize speech from Gaussian noise, our method starts from the masked spectrogram, enabling high-quality reconstruction in a single inference step. Experiments show that Mask2Flow-TSE achieves comparable performance to existing generative TSE methods with approximately 85M parameters.

关键词: Target Speaker Extraction, Two-Stage Framework, Masking, Flow Matching, Speech Separation, Generative Methods, Discriminative Methods, Spectrogram Refinement

66. ❌ Hierarchical Dual-Change Collaborative Learning for UAV Scene Change Captioning

作者: Fuhai Chen, Pengpeng Huang, Junwen Wu, Hehong Zhang, Shiping Wang, Xiaoguang Ma, Xuri Ge 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12832v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于无人机场景变化字幕生成任务，提出了一种基于Transformer的HDC-CL方法，属于计算机视觉和自然语言处理的交叉领域。论文内容与绝大多数大模型技术关键词（如LLM、MoE、Scaling Laws、RLHF、PEFT等）完全无关，因为这些关键词涉及大语言模型架构、训练、对齐、推理优化等核心技术，而本文研究的是视觉-语言任务，未使用或涉及大语言模型。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为无人机场景理解可视为AI在遥感或环境科学中的应用，但论文未明确强调科学领域应用，故给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

本文提出了无人机场景变化字幕生成新任务，并设计了一种层次化双变协同学习方法，通过动态自适应布局Transformer和跨模态方向一致性校准，在新建的UAV-SCC数据集上实现了最先进的性能。

摘要翻译

本文提出了一项无人机场景理解的新任务——无人机场景变化描述（UAV-SCC），其目标是对从移动视角采集的动态航拍图像中的语义变化生成自然语言描述。与传统的变化描述任务主要描述固定相机视角随时间拍摄的图像对之间的差异不同，无人机场景变化描述聚焦于由移动相机动态采集的、因时空场景变化而产生的图像对差异。其核心挑战在于：由于相机旋转导致的视角偏移，无人机图像对仅包含部分重叠的场景内容，模型需从中理解由视角变化引起的场景改变，并有效利用两幅图像间的相对方位关系。为此，我们提出了一种用于无人机场景变化描述的层次化双变协同学习（HDC-CL）方法。具体而言，我们设计了一种新型变换器——动态自适应布局变换器（Dynamic Adaptive Layout Transformer, DALT），能够自适应地建模图像对多样化的空间布局，在灵活统一的编码层中学习来自重叠区域与非重叠区域的相互关联特征。此外，我们提出了层次化跨模态方位一致性校准（Hierarchical Cross-modal Orientation Consistency Calibration, HCM-OCC）方法，以增强模型对视角偏移方向的敏感性，从而实现更精准的变化描述。为促进该任务的深入研究，我们构建了一个新的基准数据集，命名为UAV-SCC数据集，专用于无人机场景变化描述任务。大量实验表明，所提方法在该任务上取得了最先进的性能。数据集与代码将在本文录用后公开发布。

摘要 (Abstract)

This paper proposes a novel task for UAV scene understanding - UAV Scene Change Captioning (UAV-SCC) - which aims to generate natural language descriptions of semantic changes in dynamic aerial imagery captured from a movable viewpoint. Unlike traditional change captioning that mainly describes differences between image pairs captured from a fixed camera viewpoint over time, UAV scene change captioning focuses on image-pair differences resulting from both temporal and spatial scene variations dynamically captured by a moving camera. The key challenge lies in understanding viewpoint-induced scene changes from UAV image pairs that share only partially overlapping scene content due to viewpoint shifts caused by camera rotation, while effectively exploiting the relative orientation between the two images. To this end, we propose a Hierarchical Dual-Change Collaborative Learning (HDC-CL) method for UAV scene change captioning. In particular, a novel transformer, \emph{i.e.} Dynamic Adaptive Layout Transformer (DALT) is designed to adaptively model diverse spatial layouts of the image pair, where the interrelated features derived from the overlapping and non-overlapping regions are learned within the flexible and unified encoding layer. Furthermore, we propose a Hierarchical Cross-modal Orientation Consistency Calibration (HCM-OCC) method to enhance the model’s sensitivity to viewpoint shift directions, enabling more accurate change captioning. To facilitate in-depth research on this task, we construct a new benchmark dataset, named UAV-SCC dataset, for UAV scene change captioning. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance on this task. The dataset and code will be publicly released upon acceptance of this paper.

关键词: UAV scene change captioning, hierarchical dual-change collaborative learning, dynamic adaptive layout transformer, cross-modal orientation consistency calibration, UAV-SCC dataset, aerial imagery, change captioning, viewpoint shift

67. ❌ Residual SODAP: Residual Self-Organizing Domain-Adaptive Prompting with Structural Knowledge Preservation for Continual Learning

作者: Gyutae Oh, Jungwoo Bae, Jitae Shin 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12816v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于持续学习（Continual Learning, CL）中的领域增量学习（Domain-incremental Learning, DIL）问题，提出了一种基于提示（prompt）的方法Residual SODAP，以缓解灾难性遗忘。论文的核心技术涉及提示选择、表示适应和分类器知识保留，属于深度学习在特定应用场景（如医学图像分类）中的方法创新。与关键词的相关性分析如下：1. 与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（8分），因为论文明确处理领域适应（Domain Adaptation）和持续学习中的表示适应问题；2. 与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为论文在皮肤癌（Skin Cancer）等生物医学数据集上进行了评估，属于AI在科学领域的应用；3. 其他关键词（如LLMs、MoE、SFT等）均未在论文中涉及，因此评分为0分。论文未提及大模型（LLMs）或相关技术，主要关注传统深度学习模型的持续学习问题。

!!! tip deepseek-chat TL;DR

该论文针对持续学习中的领域增量学习问题，提出了一种基于残差自组织领域自适应提示的方法Residual SODAP，通过结合稀疏提示选择、残差聚合和数据无关蒸馏等技术，在多个基准数据集上实现了最先进的性能，有效缓解了灾难性遗忘。

摘要翻译

持续学习（Continual Learning, CL）存在灾难性遗忘问题，这一问题在领域增量学习（Domain-Incremental Learning, DIL）中尤为突出，因为任务标识符不可用且无法存储历史数据。尽管基于提示的持续学习（Prompt-based CL, PCL）通过冻结主干网络来调整表征，但我们发现仅优化提示往往效果有限，原因在于提示选择欠佳以及领域偏移下分类器层面的不稳定性。我们提出残差SODAP方法，该框架联合执行基于提示的表征适应与分类器层面的知识保留。我们的方法结合了$α$-entmax稀疏提示选择与残差聚合、基于伪特征回放的无数据蒸馏、基于提示使用的漂移检测以及不确定性感知的多损失平衡。在三个无需任务标识符或额外数据存储的DIL基准测试中，残差SODAP实现了最先进的平均准确率/平均遗忘率：0.850/0.047（DR）、0.760/0.031（皮肤癌）和0.995/0.003（CORe50）。

摘要 (Abstract)

Continual learning (CL) suffers from catastrophic forgetting, which is exacerbated in domain-incremental learning (DIL) where task identifiers are unavailable and storing past data is infeasible. While prompt-based CL (PCL) adapts representations with a frozen backbone, we observe that prompt-only improvements are often insufficient due to suboptimal prompt selection and classifier-level instability under domain shifts. We propose Residual SODAP, which jointly performs prompt-based representation adaptation and classifier-level knowledge preservation. Our framework combines $α$-entmax sparse prompt selection with residual aggregation, data-free distillation with pseudo-feature replay, prompt-usage–based drift detection, and uncertainty-aware multi-loss balancing. Across three DIL benchmarks without task IDs or extra data storage, Residual SODAP achieves state-of-the-art AvgACC/AvgF of 0.850/0.047 (DR), 0.760/0.031 (Skin Cancer), and 0.995/0.003 (CORe50).

关键词: Continual Learning, Domain-incremental Learning, Prompt-based Learning, Catastrophic Forgetting, Residual Aggregation, Data-free Distillation, Sparse Prompt Selection, Knowledge Preservation

68. ❌ Context is all you need: Towards autonomous model-based process design using agentic AI in flowsheet simulations

作者: Pascal Schäfer, Lukas J. Krinke, Martin Wlotzka, Norbert Asprion 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12813v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLM驱动的多智能体系统在化工流程模拟中的应用，高度相关关键词包括：LLMs（使用Claude Opus）、LLM Agents（构建agentic AI框架）、Tool Use（生成Chemasim代码）、Multi-agent Systems（分解任务的双智能体架构）、AI for Science（化学工程领域应用）。中等相关关键词：Chain of Thought/System 2 Thinking（涉及工程知识推理）、In-context Learning（使用技术文档和示例作为上下文）。其余关键词未涉及技术原理创新或具体实现。

!!! tip deepseek-chat TL;DR

该研究开发了一个基于大语言模型的多智能体AI框架，用于在化工流程模拟环境中提供辅助，通过分解任务和工具使用成功生成了有效的流程建模代码。

摘要翻译

集成大型语言模型（LLM）与推理及工具使用能力的智能体人工智能系统正在改变诸多领域，尤其是软件开发。相比之下，其在化工过程流程图建模中的应用仍基本处于空白。本研究提出了一种智能体人工智能框架，旨在为工业流程图模拟环境提供辅助。为此，我们展示了GitHub Copilot（GitHub, Inc., 2026）在采用Claude Opus 4.6（Anthropic, PBC, 2026）等先进LLM时，如何利用技术文档及少量注释示例作为上下文，为我们自主研发的过程建模工具Chemasim生成有效语法。在此基础上，我们开发了一个多智能体系统，该系统将过程开发任务进行分解：一个智能体运用工程知识解决抽象问题，另一个智能体则将解决方案实现为Chemasim代码。我们通过典型流程图建模案例证明了该框架的有效性，包括（i）反应/分离过程，（ii）变压精馏，以及（iii）包含夹带剂选择的非均相共沸精馏。基于这些案例，我们讨论了该框架当前的局限性，并展望了未来进一步提升其能力的研究方向。

摘要 (Abstract)

Agentic AI systems integrating large language models (LLMs) with reasoning and tooluse capabilities are transforming various domains - in particular, software development. In contrast, their application in chemical process flowsheet modelling remains largely unexplored. In this work, we present an agentic AI framework that delivers assistance in an industrial flowsheet simulation environment. To this end, we show the capabilities of GitHub Copilot (GitHub, Inc., 2026), when using state-of-the-art LLMs, such as Claude Opus 4.6 (Anthropic, PBC, 2026), to generate valid syntax for our in-house process modelling tool Chemasim using the technical documentation and a few commented examples as context. Based on this, we develop a multi-agent system that decomposes process development tasks with one agent solving the abstract problem using engineering knowledge and another agent implementing the solution as Chemasim code. We demonstrate the effectiveness of our framework for typical flowsheet modelling examples, including (i) a reaction/separation process, (ii) a pressure-swing distillation, and (iii) a heteroazeotropic distillation including entrainer selection. Along these lines, we discuss current limitations of the framework and outline future research directions to further enhance its capabilities.

关键词: agentic AI, large language models, multi-agent system, tool use, flowsheet simulation, chemical process modeling, context learning, autonomous design

69. ❌ Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

作者: Yichen Zhang, Da Peng, Zonghao Guo, Zijian Zhang, Xuesong Yang, Tong Sun, Shichu Sun, Yidan Zhang, Yanghao Li, Haiyan Zhao, Wang Xu, Qi Shi, Yangang Sun, Chi Chen, Shuo Wang, Yukun Yan, Xu Han, Qiang Ma, Wei Ke, Liang Wang, Zhiyuan Liu, Maosong Sun 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12793v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Cheers模型，统一视觉理解和生成，核心使用LLM作为骨干（高度相关），涉及预训练和微调（有一定关联）。其他关键词如MoE、SLMs、RAG、量化等未在摘要中提及，与论文内容无关。

!!! tip deepseek-chat TL;DR

Cheers通过解耦补丁细节与语义表示，在单一模型中统一了多模态理解和生成，实现了4倍令牌压缩，并在多个基准测试中达到或超越先进模型性能。

摘要翻译

多模态建模领域近期的一个前沿课题，是在单一模型中统一视觉理解与生成任务。然而，这两种任务需要不匹配的解码机制与视觉表征，使得在共享特征空间内进行联合优化具有挑战性。本文提出Cheers模型，该统一多模态模型将图像块级细节与语义表征解耦，从而通过门控细节残差稳定多模态理解所需的语义信息，并提升图像生成的保真度。Cheers包含三个关键组件：（i）一个统一的视觉分词器，将图像潜在状态编码并压缩为语义标记，以高效地用于大语言模型（LLM）的条件化；（ii）一个基于LLM的Transformer，统一了文本生成的自回归解码与图像生成的扩散解码；（iii）一个级联流匹配头，其先解码视觉语义，随后从视觉分词器注入经过语义门控的细节残差，以细化高频内容。在主流基准测试上的实验表明，Cheers在视觉理解与生成任务上均达到或超越了先进统一多模态模型（UMMs）的性能。Cheers同时实现了4倍的标记压缩，从而支持更高效的高分辨率图像编码与生成。值得注意的是，Cheers在GenEval和MMBench等流行基准上超越了Tar-1.5B模型，而仅需其20%的训练成本，这体现了高效且有效（即4倍标记压缩）的统一多模态建模能力。我们将公开所有代码与数据，以供未来研究使用。

摘要 (Abstract)

A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Cheers also achieves 4x token compression, enabling more efficient high-resolution image encoding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling. We will release all code and data for future research.

关键词: unified multimodal model, visual comprehension, image generation, LLM-based Transformer, vision tokenizer, token compression, diffusion decoding, semantic representations

70. ❌ The RIGID Framework: Research-Integrated, Generative AI-Mediated Instructional Design

作者: Yerin Kwak, Zachary A. Pardos 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12781v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文主要研究如何将学习科学研究整合到教学设计工作流中，并利用生成式AI作为中介工具。论文与大多数技术性关键词（如MoE、量化、推理加速等）完全无关，因为这些关键词涉及大模型的具体技术实现或优化方法，而本文关注的是应用层面。论文与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为生成式AI通常基于大语言模型，但论文未深入讨论模型本身的技术细节。论文与’AI for Science OR Bioinformatics OR Cheminformatics’也有一定关联（5分），因为教育领域可视为AI在科学（教育科学）中的应用，但并非典型的生物信息学或化学信息学。其他关键词均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了RIGID框架，通过整合学习科学研究并利用生成式AI作为中介工具，解决了教学设计中难以系统化融入研究证据的问题，实现了操作性强且情境敏感的研究整合教学设计。

摘要翻译

教学设计（Instructional Design，简称ID）在整合基于研究的学术知识与教学最佳实践方面常面临挑战。尽管教育研究者与政府机构强调教学设计应立足于实证依据，但将研究发现融入日常设计流程通常较为复杂，因为这需要考虑多种情境化的需求与限制。为应对这一长期存在的差距，本文探讨了如何在学习科学（Learning Sciences，简称LS）研究的基础上，系统性地将其整合至教学设计全流程，并探究了生成式人工智能的最新进展如何助力实现这种整合。虽然教学设计领域与学习科学领域都致力于通过真实情境下的设计导向方法来提升学习体验，但两者之间的结构化整合仍显不足，导致其互补性见解未能得到充分利用。本文提出RIGID（研究整合、生成式人工智能介导的教学设计）框架，该统一框架将学习科学研究贯穿于教学设计全流程——涵盖分析、设计、实施与评估各阶段，并借助生成式人工智能在每一阶段中介导这种整合。RIGID框架提供了一种系统化路径，使研究整合的教学设计兼具可操作性与情境敏感性，同时保留了人类专业知识的核心作用。

摘要 (Abstract)

Instructional Design (ID) often faces challenges in incorporating research-based knowledge and pedagogical best practices. Although educational researchers and government agencies emphasize grounding ID in evidence, integrating research findings into everyday design workflows is often complex, as it requires considering multiple context-specific demands and constraints. To address this persistent gap, this paper explores how research in the learning sciences (LS) can be systematically integrated across ID workflows and how recent advances in generative AI can help operationalize this integration. While ID and LS share a commitment to improving learning experiences through design-oriented approaches in authentic contexts, structured integration between the two fields remains limited, leaving their complementary insights underutilized. We present RIGID (Research-Integrated, Generative AI-Mediated Instructional Design), a unified framework that integrates LS research across ID workflows spanning analysis, design, implementation, and evaluation phases, while leveraging generative AI to mediate this integration at each stage. The RIGID framework provides a systematic approach for enabling research-integrated instructional design that is both operational and context-sensitive, while preserving the central role of human expertise.

关键词: Instructional Design, Generative AI, Learning Sciences, Research Integration, RIGID Framework, Educational Technology, Workflow Mediation, Human Expertise

71. ❌ FC-Track: Overlap-Aware Post-Association Correction for Online Multi-Object Tracking

作者: Cheng Ju, Zejing Zhao, Akio Namiki 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12758v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉中的多目标跟踪（MOT）任务，提出了一种轻量级的后关联校正框架（FC-Track），以解决由遮挡和物体重叠引起的身份切换问题。论文内容涉及目标检测、跟踪关联、实时处理等传统计算机视觉技术，但未涉及任何大语言模型（LLM）、深度学习技术原理创新、模型训练/微调方法（如预训练、指令调优、RLHF）、高效推理技术（如量化、注意力优化）、推理增强方法（如思维链、RAG）、智能体系统或AI for Science等关键词。所有关键词均与大模型或深度学习在科学领域的应用无关，因此相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种轻量级的在线多目标跟踪后关联校正框架（FC-Track），通过基于重叠面积的过滤策略和局部外观相似性比较，有效减少了由物体重叠引起的长期身份切换，在MOT17和MOT20基准上实现了最先进的性能。

摘要翻译

可靠的多目标跟踪（Multi-Object Tracking, MOT）对于在复杂动态环境中运行的机器人系统至关重要。尽管检测与关联技术近期取得了进展，在线MOT方法仍易因频繁遮挡和目标重叠导致的身份切换问题而受到影响，其中错误的关联会随时间传播并降低跟踪可靠性。本文提出一种轻量级的后关联校正框架（FC-Track），专为在线MOT设计，旨在推理过程中显式处理由重叠引起的误匹配。该方法通过基于交并面积比（Intersection over Area, IoA）的过滤策略，在高重叠条件下抑制不可靠的外观特征更新，并通过在重叠轨迹对内部进行外观相似性比较，局部校正检测与轨迹片段之间的误匹配。通过阻止短期误匹配的传播，本框架有效减少了长期身份切换，而无需依赖全局优化或重识别技术。该框架在线运行，无需全局优化或重识别，适用于实时机器人应用。在MOT17测试集上，我们实现了81.73 MOTA、82.81 IDF1和66.95 HOTA，运行速度为5.7 FPS；在MOT20测试集上实现了77.52 MOTA、80.90 IDF1和65.67 HOTA，运行速度为0.6 FPS。具体而言，本框架FC-Track仅产生29.55%的长期身份切换，显著低于现有在线跟踪器。同时，我们的框架在MOT20基准测试中保持了最先进的性能。

摘要 (Abstract)

Reliable multi-object tracking (MOT) is essential for robotic systems operating in complex and dynamic environments. Despite recent advances in detection and association, online MOT methods remain vulnerable to identity switches caused by frequent occlusions and object overlap, where incorrect associations can propagate over time and degrade tracking reliability. We present a lightweight post-association correction framework (FC-Track) for online MOT that explicitly targets overlap-induced mismatches during inference. The proposed method suppresses unreliable appearance updates under high-overlap conditions using an Intersection over Area (IoA)-based filtering strategy, and locally corrects detection-to-tracklet mismatches through appearance similarity comparison within overlapped tracklet pairs. By preventing short-term mismatches from propagating, our framework effectively mitigates long-term identity switches without resorting to global optimization or re-identification. The framework operates online without global optimization or re-identification, making it suitable for real-time robotic applications. We achieve 81.73 MOTA, 82.81 IDF1, and 66.95 HOTA on the MOT17 test set with a running speed of 5.7 FPS, and 77.52 MOTA, 80.90 IDF1, and 65.67 HOTA on the MOT20 test set with a running speed of 0.6 FPS. Specifically, our framework FC-Track produces only 29.55% long-term identity switches, which is substantially lower than existing online trackers. Meanwhile, our framework maintains state-of-the-art performance on the MOT20 benchmark.

关键词: multi-object tracking, online tracking, post-association correction, identity switches, overlap-aware, appearance similarity, real-time robotic applications, MOT benchmark

72. ❌ AI Model Modulation with Logits Redistribution

作者: Zihan Wang, Zhongkui Ma, Xinguo Feng, Zhiyang Mei, Ethan Ma, Derui Wang, Minhui Xue, Guangdong Bai 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12755v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种新的模型调制范式AIM，通过logits redistribution实现单一模型的多行为输出，在文本生成任务中使用了Llama模型，因此与’Large Language Models’相关（8分）。其他关键词主要涉及具体技术细节（如MoE、RLHF、RAG等）、特定应用领域（如AI for Science）或未在论文中提及的概念，均无直接关联（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为AIM的新型模型调制范式，通过无需训练数据且免重训练的logits redistribution策略，使单一模型能够动态调整输出质量和聚焦特征，在图像分类、语义分割和文本生成任务中验证了其有效性和通用性。

摘要翻译

大规模模型通常需要调整以满足模型所有者与用户的多样化需求。然而，维护多个专用模型版本效率低下。为此，我们提出AIM（Adaptive Intelligent Modulation），一种新颖的模型调制范式，使单一模型能够呈现多样化行为以满足特定终端需求。AIM支持两种关键调制模式：效用调制与焦点调制。前者使模型所有者能动态控制输出质量以提供不同效用水平，后者使用户能精确控制模型聚焦的输入特征。AIM提出了一种对数重分布策略，该策略以训练数据无关且无需重新训练的方式运作。我们基于联合概率分布的对数排序统计特性建立了形式化理论基础，以确保AIM的调控能力。通过在图像分类、语义分割和文本生成任务中，对包括ResNet、SegFormer和Llama在内的主流架构进行评估，验证了AIM在AI模型调制方面的实用性与普适性。

摘要 (Abstract)

Large-scale models are typically adapted to meet the diverse requirements of model owners and users. However, maintaining multiple specialized versions of the model is inefficient. In response, we propose AIM, a novel model modulation paradigm that enables a single model to exhibit diverse behaviors to meet the specific end requirements. AIM enables two key modulation modes: utility and focus modulations. The former provides model owners with dynamic control over output quality to deliver varying utility levels, and the latter offers users precise control to shift model’s focused input features. AIM introduces a logits redistribution strategy that operates in a training data-agnostic and retraining-free manner. We establish a formal foundation to ensure AIM’s regulation capability, based on the statistical properties of logits ordering via joint probability distributions. Our evaluation confirms AIM’s practicality and versatility for Al model modulation, with tasks spanning image classification, semantic segmentation and text generation, and prevalent architectures including ResNet, SegFormer and Llama.

关键词: model modulation, logits redistribution, utility modulation, focus modulation, training data-agnostic, retraining-free, Llama, text generation

73. ❌ TaoBench: Do Automated Theorem Prover LLMs Generalize Beyond MathLib?

作者: Alexander K Taylor, Junyi Zhang, Ethan Ji, Vigyan Sahai, Haikang Deng, Yuanzhou Chen, Yifan Yuan, Di Wu, Jia-Chen Gu, Kai-Wei Chang, Nanyun Peng, Amit Sahai, Wei Wang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12744v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究自动定理证明（ATP）LLM在数学领域的应用，核心是评估LLM在不同定义框架下的泛化能力。与’Large Language Models’高度相关（10分），因为论文明确研究ATP LLM；与’LLM Agents’高度相关（10分），因为论文构建了agentic pipeline进行自动评估；与’AI for Science’高度相关（10分），因为论文将LLM应用于数学定理证明这一科学领域。其他关键词如MoE、SFT、RAG等未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了自动定理证明LLM在标准数学库框架外的泛化能力，发现当应用于非标准定义框架时，模型性能平均下降约26%，表明当前ATP系统的瓶颈在于跨定义框架的泛化能力不足。

摘要翻译

自动定理证明（ATP）的基准测试主要由基于MathLib形式化的问题构成，导致当前ATP的训练与评估严重偏向MathLib的定义框架。然而，前沿数学研究通常具有探索性且高度依赖原型构建，其使用的定制化构造往往偏离标准库。本研究评估了当前ATP系统在应用于新颖定义框架时的鲁棒性，特别考察了其在标准库问题与定制化数学构造之间的性能差距。我们提出了TaoBench——一个源自陶哲轩《分析学I》的本科难度基准测试集，该测试集通过从头构建核心数学概念（不依赖标准Mathlib定义）以及混合使用从头构建与MathLib构造两种方式来实现分析学的形式化。为确保公平评估，我们构建了一个智能代理流程，能够为每个问题自动提取可编译的、自包含的本地环境。为分离定义框架的影响，我们还将每个问题转化为数学上等价的MathLib表述形式，从而生成可直接对比的TaoBench-Mathlib配对命题。实验表明，尽管最先进的ATP模型在MathLib框架内表现良好，但在定义等价的Tao形式化表述上，其性能平均下降约26%。这表明主要瓶颈在于跨定义框架的泛化能力有限，而非任务本身难度。因此，TaoBench揭示了基准测试性能与实际适用性之间的差距，并为开发和测试更贴合研究数学需求的证明器提供了具体基础。

摘要 (Abstract)

Automated theorem proving (ATP) benchmarks largely consist of problems formalized in MathLib, so current ATP training and evaluation are heavily biased toward MathLib’s definitional framework. However, frontier mathematics is often exploratory and prototype-heavy, relying on bespoke constructions that deviate from standard libraries. In this work, we evaluate the robustness of current ATP systems when applied to a novel definitional framework, specifically examining the performance gap between standard library problems and bespoke mathematical constructions. We introduce TaoBench, an undergraduate-level benchmark derived from Terence Tao’s Analysis I, which formalizes analysis by constructing core mathematical concepts from scratch, without relying on standard Mathlib definitions, as well as by mixing from-scratch and MathLib constructions. For fair evaluation, we build an agentic pipeline that automatically extracts a compilable, self-contained local environment for each problem. To isolate the effect of definitional frameworks, we additionally translate every problem into a mathematically equivalent Mathlib formulation, yielding paired TaoBench-Mathlib statements for direct comparison. While state-of-the-art ATP models perform capably within the MathLib framework, performance drops by an average of roughly 26% on the definitionally equivalent Tao formulation. This indicates that the main bottleneck is limited generalization across definitional frameworks rather than task difficulty. TaoBench thus highlights a gap between benchmark performance and applicability, and provides a concrete foundation for developing and testing provers better aligned with research mathematics.

关键词: Automated Theorem Proving, Large Language Models, Generalization, Mathematical Frameworks, Benchmark Evaluation, Agentic Pipeline, MathLib, TaoBench

作者: Chenyang Zhu, Hongxiang Li, Xiu Li, Long Chen 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12743v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究知识感知概念定制，提出MoKus框架，利用跨模态知识转移将文本知识绑定到视觉概念。虽然涉及多模态AI和知识表示，但未明确使用大语言模型（LLM）或深度学习技术原理创新，也未涉及生物医药等科学领域应用。所有关键词均针对LLM技术、训练方法、推理优化、代理系统等，与论文的视觉概念定制和跨模态知识转移主题无直接关联，故所有关键词评分为0。

!!! tip deepseek-chat TL;DR

论文提出知识感知概念定制新任务和MoKus框架，通过跨模态知识转移将文本知识绑定到视觉概念，实现高保真定制生成，并在新基准KnowCusBench上超越现有方法。

摘要翻译

概念定制通常将稀有标记与目标概念绑定。遗憾的是，由于预训练数据很少包含这些稀有标记，此类方法常面临性能不稳定的问题。同时，这些稀有标记无法传达目标概念的内在知识。为此，我们提出知识感知概念定制这一新任务，旨在将多样化的文本知识绑定到目标视觉概念上。该任务要求模型识别文本提示中的知识，以实现高保真度的定制化生成。同时，模型需高效地将所有文本知识绑定至目标概念。因此，我们提出了MoKus这一用于知识感知概念定制的新框架。我们的框架基于一个关键观察：跨模态知识迁移——即在生成过程中，修改文本模态内的知识会自然地迁移到视觉模态。受此启发，MoKus包含两个阶段：（1）在视觉概念学习中，我们首先学习锚点表示以存储目标概念的视觉信息；（2）在文本知识更新中，我们将知识查询的答案更新为锚点表示，从而实现高保真度的定制生成。为在这一新任务上进一步全面评估MoKus，我们构建了首个知识感知概念定制基准：KnowCusBench。大量实验表明，MoKus优于现有最先进方法。此外，跨模态知识迁移特性使MoKus能轻松扩展到其他知识感知应用，如虚拟概念创建和概念擦除。我们还展示了该方法在世界知识基准测试中实现提升的能力。

摘要 (Abstract)

Concept customization typically binds rare tokens to a target concept. Unfortunately, these approaches often suffer from unstable performance as the pretraining data seldom contains these rare tokens. Meanwhile, these rare tokens fail to convey the inherent knowledge of the target concept. Consequently, we introduce Knowledge-aware Concept Customization, a novel task aiming at binding diverse textual knowledge to target visual concepts. This task requires the model to identify the knowledge within the text prompt to perform high-fidelity customized generation. Meanwhile, the model should efficiently bind all the textual knowledge to the target concept. Therefore, we propose MoKus, a novel framework for knowledge-aware concept customization. Our framework relies on a key observation: cross-modal knowledge transfer, where modifying knowledge within the text modality naturally transfers to the visual modality during generation. Inspired by this observation, MoKus contains two stages: (1) In visual concept learning, we first learn the anchor representation to store the visual information of the target concept. (2) In textual knowledge updating, we update the answer for the knowledge queries to the anchor representation, enabling high-fidelity customized generation. To further comprehensively evaluate our proposed MoKus on the new task, we introduce the first benchmark for knowledge-aware concept customization: KnowCusBench. Extensive evaluations have demonstrated that MoKus outperforms state-of-the-art methods. Moreover, the cross-model knowledge transfer allows MoKus to be easily extended to other knowledge-aware applications like virtual concept creation and concept erasure. We also demonstrate the capability of our method to achieve improvements on world knowledge benchmarks.

关键词: Knowledge-aware Concept Customization, Cross-modal Knowledge Transfer, MoKus, Visual Concept Learning, Textual Knowledge Updating, KnowCusBench, High-fidelity Customized Generation, Virtual Concept Creation

75. ❌ ToolTree: Efficient LLM Agent Tool Planning via Dual-Feedback Monte Carlo Tree Search and Bidirectional Pruning

作者: Shuo Yang, Soyeon Caren Han, Yihao Ding, Shuhe Wang, Eduard Hoy 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12740v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	15.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	15.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	15.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM agent的tool planning问题，提出基于Monte Carlo tree search的ToolTree方法，因此与’Monte Carlo Tree Search OR MCTS AND LLM’、‘LLM Agents OR Autonomous Agents OR Agentic Workflow’、‘Tool Use OR Function Calling OR API Tool Use’高度相关（15分）。论文涉及多步推理和规划，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’、‘System 2 Thinking OR Slow Thinking OR In-depth Reasoning’有一定关联（5分）。论文明确使用LLM，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。其他关键词如MoE、SLMs、训练方法、RAG、压缩、科学应用等均未在标题或摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM agent在多步工具使用任务中缺乏前瞻性规划的问题，提出了基于Monte Carlo树搜索和双向剪枝的ToolTree方法，在多个基准测试中实现了约10%的性能提升并保持了最高效率。

摘要翻译

大语言模型（LLM）智能体正日益应用于需要跨多个领域与多样化外部工具交互的复杂多步骤任务。然而，当前LLM智能体的工具规划方法通常依赖于贪婪的、反应式的工具选择策略，这些策略缺乏前瞻性，且未能考虑工具间的相互依赖关系。本文提出ToolTree，一种受蒙特卡洛树搜索启发的新型工具规划范式。ToolTree通过双阶段LLM评估与双向剪枝机制探索可能的工具使用轨迹，使智能体能够在扩展的工具使用序列中做出明智、自适应的决策，并在工具执行前后对潜力较低的路径进行剪枝。在4个基准测试上对开放集与封闭集工具规划任务进行的实证评估表明，ToolTree在保持最高效率的同时持续提升了性能，相较于当前最先进的规划范式平均获得了约10%的性能增益。

摘要 (Abstract)

Large Language Model (LLM) agents are increasingly applied to complex, multi-step tasks that require interaction with diverse external tools across various domains. However, current LLM agent tool planning methods typically rely on greedy, reactive tool selection strategies that lack foresight and fail to account for inter-tool dependencies. In this paper, we present ToolTree, a novel Monte Carlo tree search-inspired planning paradigm for tool planning. ToolTree explores possible tool usage trajectories using a dual-stage LLM evaluation and bidirectional pruning mechanism that enables the agent to make informed, adaptive decisions over extended tool-use sequences while pruning less promising branches before and after the tool execution. Empirical evaluations across both open-set and closed-set tool planning tasks on 4 benchmarks demonstrate that ToolTree consistently improves performance while keeping the highest efficiency, achieving an average gain of around 10% compared to the state-of-the-art planning paradigm.

关键词: LLM agents, tool planning, Monte Carlo tree search, dual-feedback, bidirectional pruning, multi-step tasks, tool selection, efficiency

76. ❌ SRAM-Based Compute-in-Memory Accelerator for Linear-decay Spiking Neural Networks

作者: Hongyang Shang, Shuai Dong, Yahan Yang, Junyi Yang, Peng Zhou, Arindam Basu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12739v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于脉冲神经网络（SNN）的硬件加速器设计，特别是基于SRAM的内存计算架构和线性衰减神经元模型。所有评分关键词均与大语言模型、深度学习技术原理、AI科学应用等主题相关，而本文研究的是神经形态计算和硬件加速，属于完全不同的技术领域。论文中未提及任何大模型、深度学习技术、AI科学应用等相关内容，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了脉冲神经网络中神经元状态更新步骤的延迟和能耗瓶颈问题，通过提出一种基于SRAM的内存计算架构和线性衰减神经元模型，实现了显著的能效提升和计算加速。

摘要翻译

脉冲神经网络（Spiking Neural Networks, SNNs）作为一种受生物启发的传统深度网络替代方案，具有事件驱动和高效能计算的特点。然而，其吞吐量仍受限于神经元膜电位的串行更新过程。尽管许多硬件加速器和存内计算（Compute-in-Memory, CIM）架构能高效并行化突触操作（W x I），实现矩阵向量乘法的O(1)复杂度，但后续的状态更新步骤仍需O(N)时间来刷新所有神经元膜电位。这种不匹配使得状态更新成为SNN推理中的主要延迟和能耗瓶颈。为应对这一挑战，我们提出一种基于SRAM的存内计算架构，用于搭载线性衰减漏积分发放（Linear Decay Leaky Integrate-and-Fire, LD-LIF）神经元的SNN，实现了算法与硬件的协同优化。在算法层面，我们以线性衰减近似替代传统的指数膜电位衰减，将高代价的乘法运算转换为简单加法，同时精度损失仅约1%。在架构层面，我们引入一种存内并行更新方案，直接在SRAM阵列中执行原位衰减，消除了全局顺序更新的需求。在基准SNN任务上的评估表明，所提方法将单次操作（SOP）能耗降低了1.1倍至16.7倍，同时能效提升15.9倍至69倍，且相对于原始衰减模型的精度损失可忽略不计。本研究表明，在存内计算架构中，除了加速（W x I）计算外，优化状态更新的动态过程对于实现可扩展、低功耗和实时神经形态处理至关重要。

摘要 (Abstract)

Spiking Neural Networks (SNNs) have emerged as a biologically inspired alternative to conventional deep networks, offering event-driven and energy-efficient computation. However, their throughput remains constrained by the serial update of neuron membrane states. While many hardware accelerators and Compute-in-Memory (CIM) architectures efficiently parallelize the synaptic operation (W x I) achieving O(1) complexity for matrix-vector multiplication, the subsequent state update step still requires O(N) time to refresh all neuron membrane potentials. This mismatch makes state update the dominant latency and energy bottleneck in SNN inference. To address this challenge, we propose an SRAM-based CIM for SNN with Linear Decay Leaky Integrate-and-Fire (LD-LIF) Neuron that co-optimizes algorithm and hardware. At the algorithmic level, we replace the conventional exponential membrane decay with a linear decay approximation, converting costly multiplications into simple additions while accuracy drops only around 1%. At the architectural level, we introduce an in-memory parallel update scheme that performs in-place decay directly within the SRAM array, eliminating the need for global sequential updates. Evaluated on benchmark SNN workloads, the proposed method achieves a 1.1 x to 16.7 x reduction of SOP energy consumption, while providing 15.9 x to 69 x more energy efficiency, with negligible accuracy loss relative to original decay models. This work highlights that beyond accelerating the (W x I) computation, optimizing state-update dynamics within CIM architectures is essential for scalable, low-power, and real-time neuromorphic processing.

关键词: Spiking Neural Networks, Compute-in-Memory, SRAM-based accelerator, Linear Decay Leaky Integrate-and-Fire, Hardware acceleration, Energy-efficient computation, Neuromorphic processing, State update optimization

77. ❌ On Using Machine Learning to Early Detect Catastrophic Failures in Marine Diesel Engines

作者: Francesco Maione, Paolo Lino, Giuseppe Giannino, Guido Maione 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12733v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用机器学习（特别是随机森林）和深度学习（用于数据增强）进行海洋柴油发动机灾难性故障的早期检测，属于AI在工程领域的应用。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为该关键词广义上涵盖了AI在科学和工程领域的应用，但论文并非严格意义上的’AI for Science’（如生物信息学、化学信息学）核心研究，故给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于随机森林和导数分析的新方法，用于早期检测海洋柴油发动机的灾难性故障，并通过仿真和真实数据验证了其有效性。

摘要翻译

船舶发动机的灾难性故障意味着功能的严重丧失，并对系统造成不可逆的损毁或破坏。这类故障突发且往往难以预测，对航行、船员及乘客构成严重威胁。其突发性使得早期检测成为唯一有效的应对措施。然而，现有研究多集中于对部件渐进退化的建模，对突发异常现象的关注有限。本文提出一种用于灾难性故障早期检测的新方法。该方法基于一台故障发动机的真实数据，通过评估实际传感器读数与发动机变量预期值之间偏差的导数来实现检测。预测由随机森林算法完成，该算法在测试的多种机器学习算法中表现最为适合。传统方法侧重于监测信号的偏差，而本方法则利用偏差的导数来提供异常动态的更早期指示，从而预警系统内部正在发生的快速危险事件。该方法可在测量值达到临界阈值并触发报警（工业界常用方法）之前检测到异常。因此，操作人员可提前获得预警并关闭发动机，从而防止损坏和意外动力丧失。此外，他们也有时间安全调整航线并规避潜在障碍。仿真结果证实了所提方法在预测灾难性故障发生方面的有效性。基于实际数据的验证进一步强化了该方法的鲁棒性和实际适用性。值得注意的是，由于采用了基于深度学习的数据增强流程，训练预测算法所需的数据获取并非难题。

摘要 (Abstract)

Catastrophic failures of marine engines imply severe loss of functionality and destroy or damage the systems irreversibly. Being sudden and often unpredictable events, they pose a severe threat to navigation, crew, and passengers. The abrupt nature makes early detection the only effective countermeasure. However, research has concentrated on modeling the gradual degradation of components, with limited attention to sudden and anomalous phenomena. This work proposes a new method for early detection of catastrophic failures. Based on real data from a failed engine, the approach evaluates the derivatives of the deviation between actual sensor readings and expected values of engine variables. Predictions are obtained by a Random Forest, which is the most suitable Machine Learning algorithm among the tested ones. Traditional methods focus on deviations of monitored signals, whereas the proposed approach employs the derivatives of the deviations to provide earlier indications of abnormal dynamics, and to alert that a rapid and dangerous event is breaking out within the system. The method allows the detection of anomalies before measurements reach critical thresholds and alarms are triggered, which is the common method in industry. Consequently, operators can be warned in advance and shut down the engine, then prevent damage and unexpected power loss. Moreover, they have the time to safely change the ship route and avoid potential obstacles. Simulation results conf irm the effectiveness of the proposed approach in anticipating occurrence of catastrophic failures. Validation on real-world data further reinforces the robustness and practical applicability of the method. It is worth noting that data acquisition to train the predictive algorithm is not a problem, since a Deep Learning-based data augmentation procedure is used.

关键词: catastrophic failure detection, marine diesel engines, random forest, machine learning, derivative analysis, early warning, data augmentation, anomaly detection

作者: Kaifan Zhang, Lihuo He, Junjie Ke, Yuqi Ji, Lukun Wu, Lizi Wang, Xinbo Gao 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12722v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文研究从EEG/MEG信号解码视觉刺激，属于脑机接口和神经科学AI应用领域。论文核心内容涉及多模态信息融合、对齐机制和扩散模型应用，与绝大多数关键词（主要关于大语言模型技术、训练方法、推理优化等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在神经科学领域的应用，属于科学AI范畴，但并非生物信息学或化学信息学，因此给予8分（有一定关联但非核心）。

!!! tip deepseek-chat TL;DR

该论文提出CognitionCapturerPro框架，通过整合多模态先验和不对称对齐方法，显著提升了从EEG信号重建视觉刺激的保真度，在THINGS-EEG数据集上Top-1和Top-5检索准确率分别提高了25.9%和10.6%。

摘要翻译

基于脑电图（EEG）的视觉刺激重建因保真度损失与表征偏移问题而持续面临挑战。我们提出CognitionCapturerPro，一种通过协同训练将EEG与多模态先验（图像、文本、深度及边缘信息）相融合的增强框架。本研究的核心贡献包括：一种用于量化模态特定保真度的不确定性加权相似性评分机制，以及一个用于整合共享表征的融合编码器。通过采用简化的对齐模块与预训练扩散模型，我们的方法在THINGS-EEG数据集上显著超越了原始CognitionCapturer模型，将Top-1与Top-5检索准确率分别提升了25.9%和10.6%。代码已开源：https://github.com/XiaoZhangYES/CognitionCapturerPro。

摘要 (Abstract)

Visual stimuli reconstruction from EEG remains challenging due to fidelity loss and representation shift. We propose CognitionCapturerPro, an enhanced framework that integrates EEG with multi-modal priors (images, text, depth, and edges) via collaborative training. Our core contributions include an uncertainty-weighted similarity scoring mechanism to quantify modality-specific fidelity and a fusion encoder for integrating shared representations. By employing a simplified alignment module and a pre-trained diffusion model, our method significantly outperforms the original CognitionCapturer on the THINGS-EEG dataset, improving Top-1 and Top-5 retrieval accuracy by 25.9% and 10.6%, respectively. Code is available at: https://github.com/XiaoZhangYES/CognitionCapturerPro.

关键词: EEG decoding, visual stimulus reconstruction, multimodal fusion, asymmetric alignment, diffusion model, brain-computer interface, fidelity improvement, THINGS-EEG dataset

作者: Dongxu Zhang, Yingsen Wang, Yiding Sun, Haoran Xu, Peilin Fan, Jihua Zhu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12721v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文CMHANet专注于3D点云配准，属于几何深度学习和计算机视觉领域。虽然研究背景中提到大模型在科学领域的应用可酌情给分，但该论文内容与所有评分关键词（均围绕大模型技术原理、训练方法、推理优化、对齐、应用等）完全无关。论文未涉及任何语言模型、大模型技术或AI for Science的具体应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为CMHANet的跨模态混合注意力网络，通过融合2D图像上下文信息和3D点云几何细节来解决复杂真实场景中点云配准的鲁棒性问题，并在多个数据集上实现了优于现有方法的配准精度和鲁棒性。

摘要翻译

鲁棒点云配准是三维计算机视觉与几何深度学习中的基础任务，对于大规模三维重建、增强现实和场景理解等应用至关重要。然而，现有基于学习的方法在复杂现实场景中——如存在数据不完整、传感器噪声和低重叠区域时——性能往往显著下降。为应对这些局限，我们提出CMHANet，一种新颖的跨模态混合注意力网络。该方法融合了来自二维图像的丰富上下文信息与三维点云的几何细节，从而生成全面且鲁棒的特征表示。此外，我们引入了一种基于对比学习的创新优化函数，该函数强化了几何一致性，并显著提升了模型对噪声和部分观测的鲁棒性。我们在3DMatch数据集及更具挑战性的3DLoMatch数据集上评估了CMHANet。此外，在TUM RGB-D SLAM数据集上的零样本评估验证了模型对未见领域的泛化能力。实验结果表明，我们的方法在配准精度和整体鲁棒性上均取得显著提升，优于现有技术。相关代码已发布于 \href{https://github.com/DongXu-Zhang/CMHANet}{https://github.com/DongXu-Zhang/CMHANet}。

摘要 (Abstract)

Robust point cloud registration is a fundamental task in 3D computer vision and geometric deep learning, essential for applications such as large-scale 3D reconstruction, augmented reality, and scene understanding. However, the performance of established learning-based methods often degrades in complex, real world scenarios characterized by incomplete data, sensor noise, and low overlap regions. To address these limitations, we propose CMHANet, a novel Cross-Modal Hybrid Attention Network. Our method integrates the fusion of rich contextual information from 2D images with the geometric detail of 3D point clouds, yielding a comprehensive and resilient feature representation. Furthermore, we introduce an innovative optimization function based on contrastive learning, which enforces geometric consistency and significantly improves the model’s robustness to noise and partial observations. We evaluated CMHANet on the 3DMatch and the challenging 3DLoMatch datasets. \rev{Additionally, zero-shot evaluations on the TUM RGB-D SLAM dataset verify the model’s generalization capability to unseen domains.} The experimental results demonstrate that our method achieves substantial improvements in both registration accuracy and overall robustness, outperforming current techniques. We also release our code in \href{https://github.com/DongXu-Zhang/CMHANet}{https://github.com/DongXu-Zhang/CMHANet}.

关键词: point cloud registration, cross-modal, hybrid attention network, 3D computer vision, geometric deep learning, contrastive learning, robustness, generalization

80. ❌ IGASA: Integrated Geometry-Aware and Skip-Attention Modules for Enhanced Point Cloud Registration

作者: Dongxu Zhang, Jihua Zhu, Shiqi Li, Wenbiao Yan, Haoran Xu, Peilin Fan, Huimin Lu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12719v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于3D点云配准的计算机视觉任务，提出了一种名为IGASA的新框架，包含分层交叉层注意力（HCLA）和迭代几何感知细化（IGAR）模块。论文内容完全围绕点云处理、几何特征提取、多尺度融合和配准精度提升展开，未涉及任何大语言模型（LLM）、深度学习技术原理创新或AI for Science的具体应用。所有评分关键词均与大模型技术、训练方法、推理优化、对齐技术、AI代理或科学AI应用相关，与该论文的计算机视觉研究方向无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为IGASA的新型点云配准框架，通过分层金字塔架构整合了分层交叉层注意力和迭代几何感知细化模块，在多个基准数据集上显著超越了现有方法的配准精度。

摘要翻译

点云配准是三维视觉中的基础任务，为自动驾驶、机器人技术和环境建模等应用提供关键支持。尽管其应用广泛，现有方法在面对真实世界中的严重噪声、显著遮挡和大尺度变换等挑战时往往失效。这些局限常导致在复杂环境中的配准精度下降与鲁棒性不足。本文提出IGASA，这是一种基于分层金字塔架构的新型配准框架，专为鲁棒的多尺度特征提取与融合而设计。该框架集成了两个关键组件：分层跨层注意力模块和迭代几何感知优化模块。HCLA模块利用跳跃注意力机制对齐多分辨率特征并增强局部几何一致性；同时，IGAR模块专为精细匹配阶段设计，通过利用粗匹配阶段建立的可靠对应关系进行优化。这种架构内的协同整合使IGASA能够有效适应多样化的点云结构与复杂变换。我们在四个广泛认可的基准数据集（包括3D(Lo)Match、KITTI和nuScenes）上评估了IGASA的性能。大量实验一致表明，IGASA显著超越了现有先进方法，并在配准精度上实现了显著提升。本研究为推进点云配准技术提供了坚实基础，同时为实际三维视觉应用提供了宝贵见解。IGASA的代码公开于\href{https://github.com/DongXu-Zhang/IGASA}{https://github.com/DongXu-Zhang/IGASA}。

摘要 (Abstract)

Point cloud registration (PCR) is a fundamental task in 3D vision and provides essential support for applications such as autonomous driving, robotics, and environmental modeling. Despite its widespread use, existing methods often fail when facing real-world challenges like heavy noise, significant occlusions, and large-scale transformations. These limitations frequently result in compromised registration accuracy and insufficient robustness in complex environments. In this paper, we propose IGASA as a novel registration framework constructed upon a Hierarchical Pyramid Architecture (HPA) designed for robust multi-scale feature extraction and fusion. The framework integrates two pivotal components consisting of the Hierarchical Cross-Layer Attention (HCLA) module and the Iterative Geometry-Aware Refinement (IGAR) module. The HCLA module utilizes skip attention mechanisms to align multi-resolution features and enhance local geometric consistency. Simultaneously, the IGAR module is designed for the fine matching phase by leveraging reliable correspondences established during coarse matching. This synergistic integration within the architecture allows IGASA to adapt effectively to diverse point cloud structures and intricate transformations. We evaluate the performance of IGASA on four widely recognized benchmark datasets including 3D(Lo)Match, KITTI, and nuScenes. Our extensive experiments consistently demonstrate that IGASA significantly surpasses state-of-the-art methods and achieves notable improvements in registration accuracy. This work provides a robust foundation for advancing point cloud registration techniques while offering valuable insights for practical 3D vision applications. The code for IGASA is available in \href{https://github.com/DongXu-Zhang/IGASA}{https://github.com/DongXu-Zhang/IGASA}.

关键词: point cloud registration, hierarchical pyramid architecture, skip attention, geometry-aware refinement, 3D vision, feature extraction, robust matching, autonomous driving

81. ❌ Altered Thoughts, Altered Actions: Probing Chain-of-Thought Vulnerabilities in VLA Robotic Manipulation

作者: Tuan Duong Trinh, Naveed Akhtar, Basim Azam 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12717v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Vision-Language-Action (VLA)模型中chain-of-thought (CoT)推理的脆弱性，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（10分），因为这是核心研究对象；与’Large Language Models OR LLMs OR Foundation Models’相关（8分），因为论文涉及LLM在VLA模型中的应用和攻击；与’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’有一定关联（5分），因为CoT推理属于深度推理范畴；与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文探究了模型内部推理机制；其他关键词与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了Vision-Language-Action模型中chain-of-thought推理的脆弱性，发现通过针对性破坏推理轨迹中的对象名称可以显著降低机器人任务成功率，而保持推理轨迹的合理性反而会保留解码器所需的关键信息。

摘要翻译

近期视觉-语言-动作（VLA）模型日益采用思维链（CoT）推理方式，在解码运动指令前会先生成自然语言计划。这一位于推理模块与动作解码器之间的内部文本通道尚未受到对抗性审视。我们提出：动作解码器实际依赖的是这一中间计划的哪些特性？若仅针对推理轨迹进行定向破坏（保持所有输入完整），是否会降低机器人的物理任务执行能力？我们设计了一套包含七种文本破坏方法的分类体系，将其归纳为三个攻击层级（盲噪声、机械语义攻击及大语言模型自适应攻击），并在40项LIBERO桌面操作任务中，将其应用于一种先进的推理型VLA模型。研究结果揭示了一种显著的不对称性：在推理轨迹中替换物体名称会使整体成功率降低8.3个百分点（pp）——在目标导向任务中降幅达19.3 pp，在单项任务中甚至达到45 pp；而句子重排序、空间方向反转、词汇噪声，乃至由700亿参数大语言模型生成的看似合理但错误的计划，其影响均可忽略不计（波动在±4 pp内）。这种不对称性表明，动作解码器依赖的是实体指称完整性，而非推理质量或序列结构。值得注意的是，基于大语言模型的复杂攻击者表现反而不如简单的机械式物体名称替换，因为保持计划合理性会无意中保留解码器所需的实体接地结构。通过使用非推理型VLA进行的跨架构对照实验证实，该漏洞为推理增强模型所特有；而指令级攻击对两类架构均会造成性能下降——这证明内部推理轨迹是一种独特且隐蔽的威胁载体，可规避输入验证防御机制的检测。

摘要 (Abstract)

Recent Vision-Language-Action (VLA) models increasingly adopt chain-of-thought (CoT) reasoning, generating a natural-language plan before decoding motor commands. This internal text channel between the reasoning module and the action decoder has received no adversarial scrutiny. We ask: which properties of this intermediate plan does the action decoder actually rely on, and can targeted corruption of the reasoning trace alone – with all inputs left intact – degrade a robot’s physical task performance? We design a taxonomy of seven text corruptions organized into three attacker tiers (blind noise, mechanical-semantic, and LLM-adaptive) and apply them to a state-of-the-art reasoning VLA across 40 LIBERO tabletop manipulation tasks. Our results reveal a striking asymmetry: substituting object names in the reasoning trace reduces overall success rate by 8.3~~percentage points (pp) – reaching $-$19.3~~pp on goal-conditioned tasks and $-$45pp on individual tasks – whereas sentence reordering, spatial-direction reversal, token noise, and even a 70B-parameter LLM crafting plausible-but-wrong plans all have negligible impact (within $\pm$4pp). This asymmetry indicates that the action decoder depends on entity-reference integrity rather than reasoning quality or sequential structure. Notably, a sophisticated LLM-based attacker underperforms simple mechanical object-name substitution, because preserving plausibility inadvertently retains the entity-grounding structure the decoder needs. A cross-architecture control using a non-reasoning VLA confirms the vulnerability is exclusive to reasoning-augmented models, while instruction-level attacks degrade both architectures – establishing that the internal reasoning trace is a distinct and stealthy threat vector invisible to input-validation defenses.

关键词: Vision-Language-Action models, chain-of-thought reasoning, robotic manipulation, adversarial attacks, reasoning vulnerabilities, entity-reference integrity, LIBERO benchmark, internal reasoning trace

82. ❌ AI Planning Framework for LLM-Based Web Agents

作者: Orit Shahnovsky, Rotem Dror 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12710v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based web agents的规划框架，与’Large Language Models’和’LLM Agents’高度相关（10分）。涉及任务分解和规划，与’Chain of Thought’和’System 2 Thinking’有一定关联（5分）。框架旨在提高可诊断性，与’Mechanistic Interpretability’相关（5分）。Web agents涉及工具使用，与’Tool Use’相关（5分）。其他关键词如MoE、量化、科学AI等未在摘要中体现，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM-based web agents作为黑箱难以诊断的问题，提出了一个将现代agent架构映射到传统规划范式的框架，并引入新的评估指标和数据集，验证了不同规划策略在任务成功率和元素准确性上的权衡。

摘要翻译

开发面向网络任务的自主智能体是人工智能领域的核心挑战。尽管大语言模型（LLM）智能体能够解析复杂的用户请求，但其通常以黑箱模式运行，导致难以诊断其失败原因或规划过程。本文通过将网络任务形式化地视为序列决策过程来应对这一不足。我们提出了一种分类法，将现代智能体架构与传统规划范式进行映射：逐步执行（Step-by-Step）智能体对应广度优先搜索（BFS），树搜索（Tree Search）智能体对应最佳优先树搜索，而预先全规划（Full-Plan-in-Advance）智能体则对应深度优先搜索（DFS）。该框架为系统性诊断上下文漂移和任务分解不连贯等故障提供了理论依据。为评估这些行为，我们提出了五项新颖的评估指标，这些指标能超越简单的成功率来衡量轨迹质量。我们通过一个包含794条来自WebArena基准测试的人工标注轨迹的新数据集来支撑此项分析。最后，我们通过对比基线逐步执行智能体与一种新颖的预先全规划实现方案，验证了所提出的评估框架。结果显示，虽然逐步执行智能体与人类黄金轨迹的吻合度更高（总体成功率38%），但预先全规划智能体在元素准确率（89%）等技术指标上表现更优，这证明了我们提出的指标对于根据特定应用约束选择合适的智能体架构具有必要性。

摘要 (Abstract)

Developing autonomous agents for web-based tasks is a core challenge in AI. While Large Language Model (LLM) agents can interpret complex user requests, they often operate as black boxes, making it difficult to diagnose why they fail or how they plan. This paper addresses this gap by formally treating web tasks as sequential decision-making processes. We introduce a taxonomy that maps modern agent architectures to traditional planning paradigms: Step-by-Step agents to Breadth-First Search (BFS), Tree Search agents to Best-First Tree Search, and Full-Plan-in-Advance agents to Depth-First Search (DFS). This framework allows for a principled diagnosis of system failures like context drift and incoherent task decomposition. To evaluate these behaviors, we propose five novel evaluation metrics that assess trajectory quality beyond simple success rates. We support this analysis with a new dataset of 794 human-labeled trajectories from the WebArena benchmark. Finally, we validate our evaluation framework by comparing a baseline Step-by-Step agent against a novel Full-Plan-in-Advance implementation. Our results reveal that while the Step-by-Step agent aligns more closely with human gold trajectories (38% overall success), the Full-Plan-in-Advance agent excels in technical measures such as element accuracy (89%), demonstrating the necessity of our proposed metrics for selecting appropriate agent architectures based on specific application constraints.

关键词: LLM-based web agents, autonomous agents, planning framework, sequential decision-making, evaluation metrics, WebArena benchmark, task decomposition, agent architectures

83. ❌ Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity

作者: Donglin Yu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12707v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLM）推理的优化，直接涉及LLM和KV缓存技术（KV Cache Compression），并专注于推理加速（Inference Acceleration）。论文提出基于模态边界的划分方法，显著减少跨设备传输，实现异构GPU集群上的高效推理，属于大模型技术原理的创新。其他关键词如MoE、SLMs、训练方法、对齐、RAG、智能体等均未在摘要中提及或相关。

!!! tip deepseek-chat TL;DR

该论文研究了多模态大语言模型推理中因视觉编码和语言生成阶段硬件需求不同导致的效率问题，提出了一种基于模态边界的划分方法，在异构GPU集群上实现了成本降低和吞吐量提升。

摘要翻译

多模态大语言模型（MLLM）的推理过程可分为两个硬件需求相反的阶段：视觉编码阶段受计算能力限制，而语言生成阶段受内存带宽限制。研究表明，在标准的Transformer KV缓存机制下，模态边界（即视觉编码器与语言模型之间的划分点）在所有保持标准阶段式执行的划分方案中，能最小化跨设备数据传输。在此处划分可将传输复杂度从$O(L * s_ctx)$字节（阶段级解耦下GB级的KV缓存）降低至$O(N_v * d)$字节（MB级的嵌入向量），实现$O(L)$倍的降低，其中L为Transformer模型的深度。这一结论适用于不同的注意力机制（MHA/GQA）、动态视觉分辨率及模型规模，且优势随模型深度增加而扩大。直接推论是：现有的阶段级解耦系统受限于高带宽互连技术（如NVLink），而模态级解耦使得在通用PCIe上实现跨层级异构服务成为可能。一个闭式成本模型表明，在阶段可分离的工作负载下，异构部署具有成本最优性（理论预测节省31.4%；实测节省40.6%）。我们构建了HeteroServe——一个具备模态级划分与跨层级调度能力的阶段感知运行时系统，并在LLaVA-1.5-7B和Qwen2.5-VL模型上以vLLM v0.3.0为基准进行评估。在相同的4xA100硬件上，引擎优化使吞吐量提升最高达54%。在固定预算下，异构集群（3.8万美元）相比同构基线（6.4万美元）在保持延迟不劣化的前提下，实现了37%的“每美元生成令牌数”（Tokens/$）提升。

摘要 (Abstract)

Multimodal large language model (MLLM) inference splits into two phases with opposing hardware demands: vision encoding is compute-bound, while language generation is memory-bandwidth-bound. We show that under standard transformer KV caching, the modality boundary (between vision encoder and language model) minimizes cross-device transfer among all partition points that preserve standard stage-based execution. Partitioning here reduces transfer complexity from $O(L * s_ctx)$ bytes (GB-scale KV caches under stage-level disaggregation) to $O(N_v * d)$ bytes (MB-scale embeddings), an O(L) reduction where L is the transformer depth. The result holds across attention mechanisms (MHA/GQA), dynamic vision resolutions, and model scales, and the advantage grows as models deepen. A direct implication is that existing stage-level disaggregation systems are constrained to high-bandwidth interconnects (e.g., NVLink), whereas modality-level disaggregation enables cross-tier heterogeneous serving over commodity PCIe. A closed-form cost model shows that heterogeneous deployment is cost-optimal under phase-separable workloads (predicts 31.4% savings; observed 40.6%). We build HeteroServe, a phase-aware runtime with modality-level partitioning and cross-tier scheduling, and evaluate it on LLaVA-1.5-7B and Qwen2.5-VL against vLLM v0.3.0. On identical 4xA100 hardware, engine optimizations raise throughput by up to 54%. Under a fixed budget, a heterogeneous cluster ($38k) improves Tokens/$ by 37% over a homogeneous baseline ($64k) without degrading latency.

关键词: Multimodal LLM, MLLM, KV caching, inference acceleration, heterogeneous GPU, cost-efficient, modality partitioning, HeteroServe

84. ❌ Seeing Eye to Eye: Enabling Cognitive Alignment Through Shared First-Person Perspective in Human-AI Collaboration

作者: Zhuyu Teng, Pei Chen, Yichen Cai, Ruoqing Lu, Zhaoqu Jiang, Jiayang Li, Weitao You, Lingyun Sun 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12701v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究人机协作中的认知对齐问题，通过第一人称视角共享来解决沟通和理解鸿沟。与大多数关键词无关，因为论文聚焦于协作框架而非大模型技术本身。仅与三个关键词有弱关联：1) ‘Alignment’（5分）：涉及人机认知对齐，但非技术层面的价值对齐；2) ‘Self-Correction/Self-Improvement/Self-Reflection’（5分）：框架包含反思反馈组件，允许AI修正理解；3) ‘LLM Agents/Autonomous Agents/Agentic Workflow’（5分）：涉及AI代理在协作任务中的角色，但未明确使用LLM代理技术。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对人机协作中存在的沟通和理解鸿沟问题，提出了一个基于第一人称视角共享的Eye2Eye框架，通过联合注意力协调、可修正记忆和反思反馈三个组件实现认知对齐，实验表明该框架能显著减少任务完成时间、降低交互负担并提高信任度。

摘要翻译

尽管多模态人工智能取得了进展，当前基于视觉的助手在协作任务中仍常显低效。我们识别出两大关键鸿沟：一是沟通鸿沟，由于通道不匹配，用户必须将丰富的并行意图转化为言语指令；二是理解鸿沟，人工智能难以解读细微的具身化线索。为解决这些问题，我们提出Eye2Eye框架，该框架利用第一人称视角作为人机认知对齐的通道。它整合了三个组件：(1) 用于流畅焦点对齐的联合注意协调机制，(2) 维护动态共同基础的可修正记忆系统，以及(3) 允许用户澄清和完善AI理解的反思性反馈机制。我们在一个增强现实原型中实现了该框架，并通过用户研究和事后流程评估进行验证。结果表明，Eye2Eye显著降低了任务完成时间和交互负荷，同时提升了信任度，证明其组件协同工作以改善协作效能。

摘要 (Abstract)

Despite advances in multimodal AI, current vision-based assistants often remain inefficient in collaborative tasks. We identify two key gulfs: a communication gulf, where users must translate rich parallel intentions into verbal commands due to the channel mismatch , and an understanding gulf, where AI struggles to interpret subtle embodied cues. To address these, we propose Eye2Eye, a framework that leverages first-person perspective as a channel for human-AI cognitive alignment. It integrates three components: (1) joint attention coordination for fluid focus alignment, (2) revisable memory to maintain evolving common ground, and (3) reflective feedback allowing users to clarify and refine AI’s understanding. We implement this framework in an AR prototype and evaluate it through a user study and a post-hoc pipeline evaluation. Results show that Eye2Eye significantly reduces task completion time and interaction load while increasing trust, demonstrating its components work in concert to improve collaboration.

关键词: Human-AI collaboration, Cognitive alignment, First-person perspective, Joint attention, Revisable memory, Reflective feedback, AR prototype, Task efficiency

85. ❌ HSEmotion Team at ABAW-10 Competition: Facial Expression Recognition, Valence-Arousal Estimation, Action Unit Detection and Fine-Grained Violence Classification

作者: Andrey V. Savchenko, Kseniia Tsypliakova 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12693v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的面部情感识别任务，使用预训练的EfficientNet模型和传统机器学习方法（多层感知机），完全不涉及大语言模型、深度学习技术原理创新或AI for Science等关键词。所有关键词均与大模型技术、深度学习创新原理或科学AI应用无关，因此全部评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于预训练EfficientNet模型和滑动窗口平滑的面部情感识别方法，在ABAW-10竞赛的四个任务中显著提升了验证指标。

摘要翻译

本文介绍了我们在第十届野外情感行为分析（ABAW）竞赛中的研究成果。针对逐帧面部情感理解任务（逐帧面部表情识别、效价-唤醒度估计、动作单元检测），我们提出了一种基于预训练EfficientNet情感识别模型进行面部嵌入提取的快速方法。若该模型置信度超过阈值，则直接采用其预测结果；否则，我们将嵌入向量输入到在AffWild2数据集上训练的简单多层感知机中进行处理。通过固定大小的滑动窗口对估计的类别分数进行平滑处理，以降低逐帧预测中的噪声干扰。对于细粒度暴力检测任务，我们研究了多种预训练架构用于提取帧级嵌入特征及其在视频分类中的聚合方法。在ABAW挑战赛四项任务上的实验结果表明，我们的方法在验证指标上较现有基线模型取得了显著提升。

摘要 (Abstract)

This article presents our results for the 10th Affective Behavior Analysis in-the-Wild (ABAW) competition. For frame-wise facial emotion understanding tasks (frame-wise facial expression recognition, valence-arousal estimation, action unit detection), we propose a fast approach based on facial embedding extraction with pre-trained EfficientNet-based emotion recognition models. If the latter model’s confidence exceeds a threshold, its prediction is used. Otherwise, we feed embeddings into a simple multi-layered perceptron trained on the AffWild2 dataset. Estimated class-level scores are smoothed in a sliding window of fixed size to mitigate noise in frame-wise predictions. For the fine-grained violence detection task, we examine several pre-trained architectures for frame embeddings and their aggregation for video classification. Experimental results on four tasks from the ABAW challenge demonstrate that our approach significantly improves validation metrics over existing baselines.

关键词: Facial Expression Recognition, Valence-Arousal Estimation, Action Unit Detection, EfficientNet, AffWild2 dataset, Video Classification, Emotion Recognition, ABAW competition

86. ❌ Federated Hierarchical Clustering with Automatic Selection of Optimal Cluster Numbers

作者: Yue Zhang, Chuanlong Qiu, Xinfa Liao, Yiqun Zhang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12684v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于联邦聚类（Federated Clustering）方法，提出了一种名为Fed-$k^*$-HC的框架，用于在联邦学习环境中自动确定最优聚类数量。论文的核心是联邦学习、聚类算法、数据隐私和分布式计算，不涉及大语言模型（LLM）、深度学习技术原理、AI for Science应用或任何评分关键词中的具体技术（如MoE、RLHF、RAG等）。所有关键词均与大模型、深度学习技术或科学AI应用相关，而本文研究的是传统的无监督机器学习（聚类）在联邦学习框架下的应用，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Fed-$k^*$-HC的联邦聚类框架，解决了在隐私保护的分布式数据中自动确定最优聚类数量的挑战，并通过层次聚类和基于密度的合并方法在多样数据集上实现了准确的聚类探索。

摘要翻译

联邦聚类（Federated Clustering，FC）是一种新兴且前景广阔的技术，旨在以无监督方式从分布式且受隐私保护的数据中探索数据分布模式。现有的联邦聚类方法通常隐含一个假设：客户端数据具有已知数量的、大小均匀的聚类。然而，在实际场景中，真实的聚类数量通常是未知的，且聚类规模天然存在不平衡性。此外，联邦学习中保护隐私的传输约束不可避免地减少了可用信息，这使得开发鲁棒且准确的联邦聚类方法极具挑战性。为此，我们提出了一种名为 Fed-$k^$-HC 的新型联邦聚类框架，该框架能够基于通过层次聚类（hierarchical clustering）探索的数据分布，自动确定最优聚类数量 $k^$。为获取用于确定 $k^$ 的全局数据分布，我们让每个客户端生成微子聚类（micro-subclusters），并将其原型（prototypes）上传至服务器进行层次化合并。基于密度的合并设计使得该方法能够探索不同大小和形状的聚类，而渐进式合并过程可根据原型之间的邻近关系自行终止，从而确定 $k^$。在多种数据集上的大量实验表明，所提出的 Fed-$k^*$-HC 框架具备准确探索合适聚类数量的联邦聚类能力。

摘要 (Abstract)

Federated Clustering (FC) is an emerging and promising solution in exploring data distribution patterns from distributed and privacy-protected data in an unsupervised manner. Existing FC methods implicitly rely on the assumption that clients are with a known number of uniformly sized clusters. However, the true number of clusters is typically unknown, and cluster sizes are naturally imbalanced in real scenarios. Furthermore, the privacy-preserving transmission constraints in federated learning inevitably reduce usable information, making the development of robust and accurate FC extremely challenging. Accordingly, we propose a novel FC framework named Fed-$k^$-HC, which can automatically determine an optimal number of clusters $k^$ based on the data distribution explored through hierarchical clustering. To obtain the global data distribution for $k^$ determination, we let each client generate micro-subclusters. Their prototypes are then uploaded to the server for hierarchical merging. The density-based merging design allows exploring clusters of varying sizes and shapes, and the progressive merging process can self-terminate according to the neighboring relationships among the prototypes to determine $k^$. Extensive experiments on diverse datasets demonstrate the FC capability of the proposed Fed-$k^*$-HC in accurately exploring a proper number of clusters.

关键词: Federated Clustering, Hierarchical Clustering, Optimal Cluster Numbers, Privacy-preserving, Distributed Data, Unsupervised Learning, Data Distribution, Density-based Merging

87. ❌ Experimental evidence of progressive ChatGPT models self-convergence

作者: Konstantinos F. Xylogiannopoulos, Petros Xanthopoulos, Panagiotis Karampelas, Georgios A. Bakamitsos 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12683v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在递归训练中出现的模型自收敛现象，与’Large Language Models’高度相关（10分）。研究涉及模型在合成数据上的训练，与’Scaling Laws AND Data Quality’和’Pre-training’有一定关联（各5分）。论文讨论模型输出多样性下降，与’Self-Correction OR Self-Improvement OR Self-Reflection’相关（8分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究发现不同版本的ChatGPT模型在递归训练中输出文本多样性逐渐下降，表明模型存在自收敛现象，这可能与训练数据中合成数据的增加有关。

摘要翻译

在合成生成数据上经历递归训练的大语言模型（Large Language Models, LLMs）容易遭受模型崩溃的影响，这是一种以生成无意义输出为标志的现象。现有研究多从理论或实证角度探讨此问题，且通常聚焦于单一模型在其自身输出上的递归训练。尽管先前研究已警示在此类条件下大语言模型输出质量可能退化，但尚未有纵向研究来评估这种随时间推移的效应。在本研究中，我们采用文本相似性度量来评估不同ChatGPT模型生成多样化文本输出的能力。我们的研究结果表明，即使通过将温度参数设置为1来明确提示模型生成多样文本，近期发布的ChatGPT版本在生成多样化文本方面的能力仍出现了可测量的下降。观察到的输出多样性降低可能归因于其训练数据集中合成数据量的影响，而这源于大语言模型生成数据在互联网上的渗透。由于不同ChatGPT版本间生成文本的相似性逐渐增加，该现象被定义为模型自收敛。

摘要 (Abstract)

Large Language Models (LLMs) that undergo recursive training on synthetically generated data are susceptible to model collapse, a phenomenon marked by the generation of meaningless output. Existing research has examined this issue from either theoretical or empirical perspectives, often focusing on a single model trained recursively on its own outputs. While prior studies have cautioned against the potential degradation of LLM output quality under such conditions, no longitudinal investigation has yet been conducted to assess this effect over time. In this study, we employ a text similarity metric to evaluate different ChatGPT models’ capacity to generate diverse textual outputs. Our findings indicate a measurable decline of recent ChatGPT releases’ ability to produce varied text, even when explicitly prompted to do so, by setting the temperature parameter to one. The observed reduction in output diversity may be attributed to the influence of the amounts of synthetic data incorporated within their training datasets as the result of internet infiltration by LLM generated data. The phenomenon is defined as model self-convergence because of the gradual increase of similarities of produced texts among different ChatGPT versions.

关键词: Large Language Models, ChatGPT, model self-convergence, synthetic data, recursive training, output diversity, text similarity, model collapse

88. ❌ MetaKE: Meta-learning Aligned Knowledge Editing via Bi-level Optimization

作者: Shuxin Liu, Ou Wu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12677v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型的知识编辑问题，直接涉及’Large Language Models’关键词（10分）。论文提到’Alignment’概念，但主要关注编辑目标与模型可行区域的对齐，而非指令调优中的价值对齐，因此给’Instruction Tuning OR Alignment OR Value Alignment’关键词5分。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、CoT、Agents、Quantization、AI for Science等均未在论文标题或摘要中提及，与论文内容无关，给0分。

!!! tip deepseek-chat TL;DR

论文针对大语言模型知识编辑中存在的语义-执行脱节问题，提出了MetaKE框架，通过双层优化将编辑目标作为可学习元参数，实现了编辑方向与模型可行流形的自动对齐，显著提升了编辑性能。

摘要翻译

知识编辑（Knowledge Editing，KE）旨在精确修正大语言模型（LLM）中的特定知识，同时不破坏其通用能力。现有先进方法存在开环控制失配问题。我们发现了一个关键的“语义-执行脱节”现象：语义目标的设定独立于下游可行域的反馈。这种错位常导致有效的语义目标落入禁止空间，引发梯度截断与编辑失败。为弥合这一差距，我们提出MetaKE（元学习对齐知识编辑），这是一个将KE重新构建为双层优化问题的新框架。MetaKE摒弃静态计算方式，将编辑目标视为可学习的元参数：上层优化器寻找可行目标以最大化编辑后性能，而下层求解器执行编辑操作。针对复杂求解器难以微分的问题，我们推导出一种结构梯度代理，显式地将可编辑性约束反向传播至目标学习阶段。理论分析表明，MetaKE能自动将编辑方向与模型的可行流形对齐。大量实验证实，MetaKE显著优于现有强基线方法，为知识编辑研究提供了新视角。

摘要 (Abstract)

Knowledge editing (KE) aims to precisely rectify specific knowledge in Large Language Models (LLMs) without disrupting general capabilities. State-of-the-art methods suffer from an open-loop control mismatch. We identify a critical “Semantic-Execution Disconnect”: the semantic target is derived independently without feedback from the downstream’s feasible region. This misalignment often causes valid semantic targets to fall within the prohibited space, resulting in gradient truncation and editing failure. To bridge this gap, we propose MetaKE (Meta-learning Aligned Knowledge Editing), a new framework that reframes KE as a bi-level optimization problem. Departing from static calculation, MetaKE treats the edit target as a learnable meta-parameter: the upper-level optimizer seeks a feasible target to maximize post-edit performance, while the lower-level solver executes the editing. To address the challenge of differentiating through complex solvers, we derive a Structural Gradient Proxy, which explicitly backpropagates editability constraints to the target learning phase. Theoretical analysis demonstrates that MetaKE automatically aligns the edit direction with the model’s feasible manifold. Extensive experiments confirm that MetaKE significantly outperforms strong baselines, offering a new perspective on knowledge editing.

关键词: Knowledge Editing, Large Language Models, Bi-level Optimization, Meta-learning, Semantic-Execution Disconnect, Feasible Manifold, Structural Gradient Proxy, Editing Performance

89. ❌ Marker-Based 3D Reconstruction of Aggregates with a Comparative Analysis of 2D and 3D Morphologies

作者: Haohang Huang, Jiayi Luo, Issam Qamhia, Erol Tutumluer, John M. Hart, Andrew J. Stolba 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12667v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是基于标记的骨料3D重建方法及其形态学分析，属于土木工程和计算机视觉交叉领域。所有关键词均与大模型、深度学习技术原理或AI在科学领域的应用相关，但论文内容完全不涉及大模型、深度学习或AI技术。论文仅使用传统的计算机视觉和摄影测量技术进行3D重建，没有使用任何机器学习、深度学习或大模型方法。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及科学应用（土木工程材料分析），但论文并未使用AI技术，因此给予5分（有一定关联）。其他所有关键词均与论文内容完全无关，给予0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于标记的、成本效益高的摄影测量方法，用于骨料颗粒的3D重建，并通过与地面真实数据对比验证了其准确性，同时比较了2D和3D形态学特性，发现两者存在显著差异。

摘要翻译

骨料作为建筑材料组合中的主要骨架，是各类建筑与交通基础设施的重要功能组分。其既可用于非粘结层（如路面基层和铁路道砟），也可用于水泥混凝土和沥青混凝土等粘结性应用，还可作为抛石料和大尺寸初级破碎石料。获取骨料的尺寸、形状或形态信息，能深入揭示其在配比与堆积过程中的行为特征，从而显著提升质量保证/质量控制（QA/QC）流程的效率。然而，无论是在采石场生产阶段还是在施工现场，对骨料颗粒形态进行完整的三维表征均存在困难。目前已有多种骨料成像方法通过计算机视觉技术量化颗粒形态，包括基于二维图像分析颗粒轮廓的方法，以及依赖三维激光扫描仪或X射线计算机断层扫描（CT）设备等昂贵仪器的三维扫描方法。本文提出一种灵活且经济高效的基于摄影测量的骨料颗粒三维重建方法。该方法采用基于标记点的设计，能够实现背景抑制、点云拼接和尺度参照，从而获取高质量的骨料模型。针对选定的骨料样本，通过真值数据验证了重建结果的准确性。并对所选样本的二维与三维形态特性进行了对比分析，发现二者统计特征存在显著差异。基于所提出的方法，可便捷、低成本地获取骨料的三维形状信息，从而实现便利的骨料检测、数据采集和三维形态分析。

摘要 (Abstract)

Aggregates, serving as the main skeleton in assemblies of construction materials, are important functional components in various building and transportation infrastructures. They can be used in unbound layer applications, e.g. pavement base and railroad ballast, bound applications of cement concrete and asphalt concrete, and as riprap and large-sized primary crushed rocks. Information on the size and shape or morphology of aggregates can greatly facilitate the Quality Assurance/Quality Control (QA/QC) process by providing insights of aggregate behavior during composition and packing. A full 3D characterization of aggregate particle morphology is difficult both during production in a quarry and at a construction site. Many aggregate imaging approaches have been developed to quantify the particle morphology by computer vision, including 2D image-based approaches that analyze particle silhouettes and 3D scanning-based methods that require expensive devices such as 3D laser scanners or X-Ray Computed Tomography (CT) equipment. This paper presents a flexible and cost-effective photogrammetry-based approach for the 3D reconstruction of aggregate particles. The proposed approach follows a marker-based design that enables background suppression, point cloud stitching, and scale referencing to obtain high-quality aggregate models. The accuracy of the reconstruction results was validated against ground-truth for selected aggregate samples. Comparative analyses were conducted on 2D and 3D morphological properties of the selected samples. Significant differences were found between the 2D and 3D statistics. Based on the presented approach, 3D shape information of aggregates can be obtained easily and at a low cost, thus allowing convenient aggregate inspection, data collection, and 3D morphological analysis.

关键词: 3D reconstruction, aggregate particles, photogrammetry, marker-based design, morphological analysis, computer vision, construction materials, quality assurance

90. ❌ RetroReasoner: A Reasoning LLM for Strategic Retrosynthesis Prediction

作者: Hanbum Ko, Chanhui Lee, Ye Rin Kim, Rodrigo Hormazabal, Sehui Han, Sungbin Lim, Sungwoong Kim 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12666v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出RetroReasoner，一个用于逆合成预测的推理大语言模型，核心涉及LLMs在化学信息学（AI for Science）的应用。模型训练使用了监督微调（SFT）和强化学习（RL），并强调战略推理（Chain of Thought/System 2 Thinking），因此这些关键词高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws等未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对有机合成中逆合成预测任务，提出了RetroReasoner模型，通过结合监督微调和强化学习进行战略推理训练，实验表明其性能优于现有基线并能生成更广泛的可行反应物提案。

摘要翻译

逆合成预测是有机合成领域的核心任务，其目标是为给定的产物分子预测反应物。传统上，化学家需要选择合理的键断裂位点并推导出相应的反应物，这一过程耗时且需要深厚的专业知识。尽管近年来分子大语言模型（LLMs）已取得进展，但许多方法要么缺乏策略性推理而直接预测反应物，要么仅进行通用的产物分析，而非明确推理那些逻辑上导向特定反应物选择的键断裂策略。为克服这些局限，我们提出了RetroReasoner——一种借鉴化学家策略性思维的逆合成推理模型。RetroReasoner通过监督微调（SFT）和强化学习（RL）进行训练。在SFT阶段，我们引入了SyntheticRetro框架，该框架能生成结构化的断裂原理说明并同步预测反应物。在RL阶段，我们采用往返准确性作为奖励机制：将预测的反应物输入正向合成模型，若正向预测的产物与原始输入产物一致，则对预测给予奖励。实验结果表明，RetroReasoner不仅优于现有基线模型，还能生成更广泛的可行反应物方案，尤其在处理更具挑战性的反应实例时表现突出。

摘要 (Abstract)

Retrosynthesis prediction is a core task in organic synthesis that aims to predict reactants for a given product molecule. Traditionally, chemists select a plausible bond disconnection and derive corresponding reactants, which is time-consuming and requires substantial expertise. While recent advancements in molecular large language models (LLMs) have made progress, many methods either predict reactants without strategic reasoning or conduct only a generic product analysis, rather than reason explicitly about bond-disconnection strategies that logically lead to the choice of specific reactants. To overcome these limitations, we propose RetroReasoner, a retrosynthetic reasoning model that leverages chemists’ strategic thinking. RetroReasoner is trained using both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we introduce SyntheticRetro, a framework that generates structured disconnection rationales alongside reactant predictions. In the case of RL, we apply a round-trip accuracy as reward, where predicted reactants are passed through a forward synthesis model, and predictions are rewarded when the forward-predicted product matches the original input product. Experimental results show that RetroReasoner not only outperforms prior baselines but also generates a broader range of feasible reactant proposals, particularly in handling more challenging reaction instances.

关键词: Retrosynthesis prediction, Large language models, Strategic reasoning, Supervised fine-tuning, Reinforcement learning, Chemical synthesis, Molecular LLMs, Reactant prediction

91. ❌ From Text to Forecasts: Bridging Modality Gap with Temporal Evolution Semantic Space

作者: Lehui Li, Yuyao Wang, Jisheng Yan, Wei Zhang, Jinliang Deng, Haoliang Sun, Zhongyi Han, Yongshun Gong 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12664v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是使用LLM（大语言模型）从文本中提取结构化时间语义特征，用于时间序列预测，属于大模型在科学领域的应用创新。与’Large Language Models’高度相关（10分），因为LLM是方法的核心组件；与’Mechanistic Interpretability’有一定关联（5分），因为论文提到了可解释的时间原语；与’AI for Science’有一定关联（5分），因为这是AI在时间序列预测这一科学问题中的应用。其他关键词如MoE、SFT、RAG等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文提出TESS方法，通过大语言模型从文本中提取可解释的时间语义特征来弥合模态差距，在四个真实数据集上实现了比现有方法最高29%的预测误差降低。

摘要翻译

将文本信息融入时间序列预测有望解决事件驱动的非平稳性问题；然而，根本的模态差异阻碍了有效融合：文本描述以隐性和定性的方式表达时序影响，而预测模型依赖于显性和定量的信号。通过受控的半合成实验，我们发现现有方法过度关注冗余标记，且难以可靠地将文本语义转化为可用的数值线索。为弥合这一差距，我们提出了TESS方法，该方法引入了一个时序演化语义空间作为模态间的中间瓶颈层。该空间由可解释、基于数值的时序基元（均值偏移、波动性、形态和滞后）构成，这些基元通过结构化提示由大语言模型从文本中提取，并经由置信感知门控机制进行筛选。在四个真实世界数据集上的实验表明，相较于最先进的单模态与多模态基线方法，预测误差最高可降低29%。代码将在论文录用后公开。

摘要 (Abstract)

Incorporating textual information into time-series forecasting holds promise for addressing event-driven non-stationarity; however, a fundamental modality gap hinders effective fusion: textual descriptions express temporal impacts implicitly and qualitatively, whereas forecasting models rely on explicit and quantitative signals. Through controlled semi-synthetic experiments, we show that existing methods over-attend to redundant tokens and struggle to reliably translate textual semantics into usable numerical cues. To bridge this gap, we propose TESS, which introduces a Temporal Evolution Semantic Space as an intermediate bottleneck between modalities. This space consists of interpretable, numerically grounded temporal primitives (mean shift, volatility, shape, and lag) extracted from text by an LLM via structured prompting and filtered through confidence-aware gating. Experiments on four real-world datasets demonstrate up to a 29 percent reduction in forecasting error compared to state-of-the-art unimodal and multimodal baselines. The code will be released after acceptance.

关键词: time-series forecasting, multimodal fusion, large language models, temporal semantics, modality gap, textual information, structured prompting, interpretable primitives

92. ❌ LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction

作者: Ziyu Chen, Fan Zhu, Hui Zhu, Deyi Kong, Xinkai Kuang, Yujia Zhang, Chunmao Jiang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12647v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于3D高斯泼溅（3D Gaussian Splatting）在自动驾驶场景重建中的应用，通过融合LiDAR反射率和RGB数据来改进重建质量。所有评分关键词均涉及大语言模型（LLMs）及其相关技术（如训练、对齐、推理优化、代理系统等），而本文研究的是计算机视觉和3D重建领域的具体方法，未涉及任何语言模型、深度学习基础模型或AI for Science中的生物/化学信息学应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种鲁棒的LiDAR反射率引导的显著高斯泼溅方法（LR-SGS），用于解决自动驾驶场景中因高自运动和复杂光照导致的重建退化问题，在Waymo数据集上实现了更优的重建性能、更少的高斯点和更短的训练时间。

摘要翻译

近期基于三维高斯泼溅（3D Gaussian Splatting, 3DGS）的方法已证明了自动驾驶场景重建与新视角合成的可行性。然而，现有方法大多仅依赖相机，或仅使用激光雷达（LiDAR）进行高斯初始化或深度监督，而点云所蕴含的丰富场景信息（如反射率）以及激光雷达与RGB数据之间的互补性尚未得到充分利用，这导致在具有高自车运动和复杂光照等挑战性的自动驾驶场景中性能下降。为解决这些问题，我们提出一种面向自动驾驶场景的、鲁棒且高效的激光雷达反射率引导显著高斯泼溅方法（LR-SGS）。该方法引入一种结构感知的显著高斯表示，该表示从激光雷达提取的几何与反射率特征点初始化，并通过显著变换和改进的密度控制进行优化，以捕捉边缘与平面结构。此外，我们将激光雷达强度校准为反射率，并将其作为光照不变的材料通道附加到每个高斯点上，与RGB数据共同对齐以增强边界一致性。在Waymo开放数据集上的大量实验表明，LR-SGS能够以更少的高斯点和更短的训练时间实现优越的重建性能。尤其在复杂光照场景中，本方法在PSNR指标上超越OmniRe达1.18 dB。

摘要 (Abstract)

Recent 3D Gaussian Splatting (3DGS) methods have demonstrated the feasibility of self-driving scene reconstruction and novel view synthesis. However, most existing methods either rely solely on cameras or use LiDAR only for Gaussian initialization or depth supervision, while the rich scene information contained in point clouds, such as reflectance, and the complementarity between LiDAR and RGB have not been fully exploited, leading to degradation in challenging self-driving scenes, such as those with high ego-motion and complex lighting. To address these issues, we propose a robust and efficient LiDAR-reflectance-guided Salient Gaussian Splatting method (LR-SGS) for self-driving scenes, which introduces a structure-aware Salient Gaussian representation, initialized from geometric and reflectance feature points extracted from LiDAR and refined through a salient transform and improved density control to capture edge and planar structures. Furthermore, we calibrate LiDAR intensity into reflectance and attach it to each Gaussian as a lighting-invariant material channel, jointly aligned with RGB to enforce boundary consistency. Extensive experiments on the Waymo Open Dataset demonstrate that LR-SGS achieves superior reconstruction performance with fewer Gaussians and shorter training time. In particular, on Complex Lighting scenes, our method surpasses OmniRe by 1.18 dB PSNR.

关键词: 3D Gaussian Splatting, LiDAR reflectance, self-driving scene reconstruction, novel view synthesis, salient Gaussian representation, Waymo Open Dataset, complex lighting, PSNR improvement

93. ❌ LightMoE: Reducing Mixture-of-Experts Redundancy through Expert Replacing

作者: Jiawei Hao, Zhiwei Hao, Jianyuan Guo, Li Shen, Yong Luo, Han Hu, Dan Zeng 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12645v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	15.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	5.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文LightMoE专注于MoE-based LLMs的专家压缩问题，核心贡献是提出expert replacing范式来减少冗余专家模块。与关键词高度相关的是：1) ‘Mixture of Experts OR MoE OR Sparse Models’（15分）- 论文直接研究MoE模型，是核心主题；2) ‘Large Language Models OR LLMs OR Foundation Models’（10分）- 论文明确针对MoE-based LLMs；3) ‘PEFT OR LoRA OR Parameter-efficient Fine-tuning’（10分）- 论文使用参数高效模块替换专家，并与LoRA性能对比；4) ‘Quantization OR Model Compression OR Low-bit Weights’（8分）- 论文属于模型压缩范畴，旨在减少内存需求；5) ‘Model Merging OR Model Soups OR Weight Averaging’（5分）- 与现有专家压缩技术（如merging）相关但不同。其他关键词与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对MoE-based LLMs中专家模块冗余导致内存需求高的问题，提出了LightMoE框架，通过专家替换和参数高效恢复，在30-50%压缩率下实现了优于现有方法的性能平衡。

摘要翻译

基于专家混合（Mixture-of-Experts, MoE）的大型语言模型（LLMs）已展现出卓越的性能和计算效率。然而，其部署常受限于巨大的内存需求，这主要源于需要加载大量专家模块。尽管现有的专家压缩技术（如剪枝或合并）试图缓解这一问题，但它们往往存在不可逆的知识损失或高昂的训练开销。本文提出了一种新颖的专家压缩范式，称为专家替换（expert replacing），该范式使用参数高效的模块替换冗余专家，并以较低的训练成本恢复其能力。我们发现，即使采用该范式的简单基线方法也能获得有前景的性能。在此基础上，我们提出了LightMoE框架，该框架通过引入自适应专家选择、分层专家构建以及退火恢复策略来增强该范式。实验结果表明，在30%的压缩率下，LightMoE的性能与LoRA微调相当。即使在更激进的50%压缩率下，它也优于现有方法，并在五项不同任务中实现了平均5.6%的性能提升。这些发现表明，LightMoE在内存效率、训练效率和模型性能之间实现了更优的平衡。

摘要 (Abstract)

Mixture-of-Experts (MoE) based Large Language Models (LLMs) have demonstrated impressive performance and computational efficiency. However, their deployment is often constrained by substantial memory demands, primarily due to the need to load numerous expert modules. While existing expert compression techniques like pruning or merging attempt to mitigate this, they often suffer from irreversible knowledge loss or high training overhead. In this paper, we propose a novel expert compression paradigm termed expert replacing, which replaces redundant experts with parameter-efficient modules and recovers their capabilities with low training costs. We find that even a straightforward baseline of this paradigm yields promising performance. Building on this foundation, we introduce LightMoE, a framework that enhances the paradigm by introducing adaptive expert selection, hierarchical expert construction, and an annealed recovery strategy. Experimental results show that LightMoE matches the performance of LoRA fine-tuning at a 30% compression ratio. Even under a more aggressive 50% compression rate, it outperforms existing methods and achieves average performance improvements of 5.6% across five diverse tasks. These findings demonstrate that LightMoE strikes a superior balance among memory efficiency, training efficiency, and model performance.

关键词: Mixture-of-Experts, Large Language Models, expert compression, parameter-efficient, memory efficiency, training efficiency, LightMoE, expert replacing

94. ❌ Spend Less, Reason Better: Budget-Aware Value Tree Search for LLM Agents

作者: Yushu Li, Wenlong Deng, Jiajin Li, Xiaoxiao Li 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12634v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	8.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM智能体在预算约束下的推理优化，与LLM、多步推理、系统2思维、智能体、工具使用高度相关（10分）；采用树搜索方法，与MCTS相关（8分）；涉及自我评估改进，与自我校正相关（5分）；其他关键词如MoE、量化、科学AI等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需训练的预算感知价值树搜索框架，解决了LLM智能体在有限计算资源下进行多步推理时效率低下的问题，实验证明该方法在严格低预算约束下比基线方法使用4倍资源时表现更好。

摘要翻译

测试时扩展已成为提升大语言模型智能体可靠性的主流范式，但现有方法将计算视为无限资源，允许智能体在冗余步骤或无效路径上耗尽令牌与工具调用预算。当前已有的预算感知方法要么需要昂贵的微调，要么依赖于粗粒度的轨迹级启发式策略，无法在执行过程中进行干预。我们提出预算感知价值树（Budget-Aware Value Tree, BAVT），一种无需训练、在推理时运行的框架，该框架在单一大语言模型主干内，通过步骤级价值估计引导动态搜索树，对多跳推理过程进行建模。另一项关键创新是预算条件节点选择机制，该机制将剩余资源比例作为节点价值的自然缩放指数，从而在预算消耗过程中，提供一种无需参数、原则性的从广泛探索到贪婪利用的过渡策略。针对大语言模型自我评估中已知的过度自信问题，BAVT采用残差价值预测器，对相对进展而非绝对状态质量进行评分，从而可靠地剪除无信息量或冗余的工具调用。我们进一步提供了理论收敛性保证，证明在明确的有限预算约束下，BAVT以至少 $1-ε$ 的概率达到最终答案。在两个模型家族的四个多跳问答基准上的广泛评估表明，BAVT持续优于并行采样基线方法。最显著的是，在严格的低预算约束下，BAVT的性能超越了基线方法在 $4\times$ 资源分配下的表现，这证明智能预算管理从根本上优于暴力计算扩展。

摘要 (Abstract)

Test-time scaling has become a dominant paradigm for improving LLM agent reliability, yet current approaches treat compute as an abundant resource, allowing agents to exhaust token and tool budgets on redundant steps or dead-end trajectories. Existing budget-aware methods either require expensive fine-tuning or rely on coarse, trajectory-level heuristics that cannot intervene mid-execution. We propose the Budget-Aware Value Tree (BAVT), a training-free inference-time framework that models multi-hop reasoning as a dynamic search tree guided by step-level value estimation within a single LLM backbone. Another key innovation is a budget-conditioned node selection mechanism that uses the remaining resource ratio as a natural scaling exponent over node values, providing a principled, parameter-free transition from broad exploration to greedy exploitation as the budget depletes. To combat the well-known overconfidence of LLM self-evaluation, BAVT employs a residual value predictor that scores relative progress rather than absolute state quality, enabling reliable pruning of uninformative or redundant tool calls. We further provide a theoretical convergence guarantee, proving that BAVT reaches a terminal answer with probability at least $1-ε$ under an explicit finite budget bound. Extensive evaluations on four multi-hop QA benchmarks across two model families demonstrate that BAVT consistently outperforms parallel sampling baselines. Most notably, BAVT under strict low-budget constraints surpasses baseline performance at $4\times$ the resource allocation, establishing that intelligent budget management fundamentally outperforms brute-force compute scaling.

关键词: LLM agents, budget-aware, value tree search, multi-hop reasoning, tool use, inference-time optimization, self-evaluation, computational efficiency

95. ❌ The Economics of AI Supply Chain Regulation

作者: Sihan Qian, Amit Mehra, Dengpan Liu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12630v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究基础模型（Foundation Models）驱动的AI供应链中的经济监管问题，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文提到下游企业使用专有数据对模型进行微调，这与’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（5分）。论文未涉及其他具体的大模型技术细节、应用领域或科学AI应用，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该研究通过博弈论模型分析了在AI供应链中，针对基础模型提供商和下游企业的不同监管政策（如促进价格竞争、质量竞争或计算补贴）如何影响消费者剩余和企业利润，发现促进质量竞争的政策总能提升消费者剩余，而其他政策的效果取决于计算成本等因素。

摘要翻译

基础模型的兴起推动了人工智能供应链的出现，其中上游基础模型提供商为开发领域特定应用的下游企业提供微调与推理服务。下游企业通过向提供商付费，利用其计算基础设施结合专有数据对模型进行微调，形成了一种提升模型质量的共创动态。在基础模型提供商与下游企业可能攫取过多消费者剩余引发担忧、且监管措施日益增多的背景下，本研究采用博弈论模型，分析包含一个提供商与两个竞争性下游企业的场景中，政策干预如何影响人工智能供应链中的消费者剩余。我们的分析表明，促进下游市场价格竞争的政策（即亲价格竞争政策）仅在计算或数据预处理成本较高时能提升消费者剩余，而计算补贴政策仅在成本较低时有效，这提示两类政策具有互补性。相比之下，促进下游市场质量竞争的政策（即亲质量竞争政策）总能提升消费者剩余。研究还发现，在亲价格竞争政策或计算补贴下，提供商与下游企业均可实现更高利润，同时消费者剩余也增加，形成三方共赢局面。然而，亲质量竞争政策虽能提高提供商利润，却会降低下游企业利润。最后，随着计算成本下降，亲价格竞争政策可能失效，而计算补贴政策可能从无效转为有效。这些发现为政策制定者构建兼具经济效率与社会效益的人工智能供应链提供了参考。

摘要 (Abstract)

The rise of foundation models has driven the emergence of AI supply chains, where upstream foundation model providers offer fine-tuning and inference services to downstream firms developing domain-specific applications. Downstream firms pay providers to use their computing infrastructure to fine-tune models with proprietary data, creating a co-creation dynamic that enhances model quality. Amid concerns that foundation model providers and downstream firms may capture excessive consumer surplus, along with increasing regulatory measures, this study employs a game-theoretic model involving a provider and two competing downstream firms to analyze how policy interventions affect consumer surplus in the AI supply chain. Our analysis shows that policies promoting price competition in downstream markets (i.e., pro-price-competitive policies) boost consumer surplus only when compute or data preprocessing costs are high, while compute subsidies are effective only when these costs are low, suggesting these policies complement each other. In contrast, policies promoting quality competition in downstream markets (i.e., pro-quality-competitive policies) always improve consumer surplus. We also find that under pro-price-competitive policies or compute subsidies, both the provider and downstream firms can achieve higher profits along with greater consumer surplus, creating a win-win-win outcome. However, pro-quality-competitive policies increase the provider’s profits while reducing those of downstream firms. Finally, as compute costs decline, pro-price-competitive policies may lose their effectiveness, whereas compute subsidies may shift from ineffective to effective. These findings offer insights for policymakers seeking to foster AI supply chains that are economically efficient and socially beneficial.

关键词: AI supply chain, foundation models, game-theoretic model, policy interventions, consumer surplus, fine-tuning, compute costs, regulatory measures

96. ❌ Towards unified brain-to-text decoding across speech production and perception

作者: Zhizhang Yuan, Yang Yang, Gaorui Zhang, Baowen Cheng, Zehan Wu, Yuhao Xu, Xiaoying Liu, Liang Chen, Ying Mao, Meng Li 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12628v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出了一种统一的脑到文本解码框架，用于普通话的语音产生和感知，核心创新在于使用后训练的大型语言模型（LLM）将无调拼音音节序列映射到中文句子，并设计了基于70亿参数LLM的三阶段后训练和两阶段推理框架，在AI for Science（神经科学/脑机接口）领域有重要应用。因此，与’Large Language Models OR LLMs OR Foundation Models’（论文核心使用了LLM）、‘Post-training OR Supervised Fine-tuning OR SFT’（论文重点描述了后训练框架）和’AI for Science OR Bioinformatics OR Cheminformatics’（属于神经科学领域的AI应用）高度相关（10分）。其他关键词如MoE、Scaling Laws、RLHF、RAG、Agents等均未在论文中涉及，故为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种统一的脑到句子解码框架，用于普通话的语音产生和感知，通过后训练的大型语言模型将神经信号解码为中文句子，并揭示了两种模态下的神经动力学差异。

摘要翻译

言语产生与感知是人类日常交流的主要方式。既往的脑到文本解码研究大多集中于单一模态及拼音文字语言。本文提出了一种适用于汉语普通话言语产生与感知的统一脑到句子解码框架。该框架展现出强大的泛化能力，仅使用单字数据训练即可实现句子级解码，并支持解码训练中未出现的汉字与音节。此外，该框架支持对跨模态神经动力学进行直接且可控的比较。普通话言语的解码首先从神经信号中分类汉语拼音的音节成分（即声母和韵母），随后通过一个后训练的大语言模型将无声调的拼音音节序列映射为中文句子。为提升大语言模型解码效果，我们基于一个70亿参数的大语言模型设计了三阶段后训练与两阶段推理框架，其整体性能超越了参数量达数千亿乃至更大的商用大语言模型。此外，研究观察到普通话言语产生与感知的若干特征：言语产生比听觉感知涉及更广泛的皮层区域神经响应；对两种模态均有响应的通道表现出相似的活动模式，而言语感知相对于产生存在时间延迟；解码性能在两半球间大致相当。我们的工作不仅验证了统一解码框架的可行性，还为理解普通话言语产生与感知的神经特征提供了见解。这些进展推动了语素音节文字的脑到文本解码研究，并为支持多模态的神经语言解码系统的发展铺平了道路。

摘要 (Abstract)

Speech production and perception are the main ways humans communicate daily. Prior brain-to-text decoding studies have largely focused on a single modality and alphabetic languages. Here, we present a unified brain-to-sentence decoding framework for both speech production and perception in Mandarin Chinese. The framework exhibits strong generalization ability, enabling sentence-level decoding when trained only on single-character data and supporting characters and syllables unseen during training. In addition, it allows direct and controlled comparison of neural dynamics across modalities. Mandarin speech is decoded by first classifying syllable components in Hanyu Pinyin, namely initials and finals, from neural signals, followed by a post-trained large language model (LLM) that maps sequences of toneless Pinyin syllables to Chinese sentences. To enhance LLM decoding, we designed a three-stage post-training and two-stage inference framework based on a 7-billion-parameter LLM, achieving overall performance that exceeds larger commercial LLMs with hundreds of billions of parameters or more. In addition, several characteristics were observed in Mandarin speech production and perception: speech production involved neural responses across broader cortical regions than auditory perception; channels responsive to both modalities exhibited similar activity patterns, with speech perception showing a temporal delay relative to production; and decoding performance was broadly comparable across hemispheres. Our work not only establishes the feasibility of a unified decoding framework but also provides insights into the neural characteristics of Mandarin speech production and perception. These advances contribute to brain-to-text decoding in logosyllabic languages and pave the way toward neural language decoding systems supporting multiple modalities.

关键词: brain-to-text decoding, speech production, speech perception, large language model (LLM), post-training, Mandarin Chinese, neural dynamics, unified decoding framework

97. ❌ VLM4Rec: Multimodal Semantic Representation for Recommendation with Large Vision-Language Models

作者: Ty Valencia, Burak Barlas, Varun Singhal, Ruchir Bhatia, Wei Yang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12625v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文VLM4Rec提出了一种基于大型视觉语言模型（LVLM）的多模态推荐框架，核心是利用LVLM将物品图像转换为自然语言描述，然后进行语义对齐和表示学习。该研究与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分），因为LVLM属于大模型范畴，且论文明确使用了大型视觉语言模型。其他关键词如MoE、SLMs、Scaling Laws、各种训练技术（预训练、微调、对齐等）、推理优化、代理系统、模型压缩等，论文均未涉及或提及，因此评分为0分。论文属于大模型在推荐系统领域的应用，符合研究背景中’大模型在不同领域的研究应用’的要求。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于大型视觉语言模型的多模态推荐框架VLM4Rec，通过将物品图像转换为语义描述并进行对齐表示，显著提升了推荐性能，表明语义表示质量比特征融合复杂度更重要。

摘要翻译

多模态推荐通常被构建为特征融合问题，即结合文本与视觉信号以更好地建模用户偏好。然而，多模态推荐的有效性可能不仅取决于模态融合的方式，还取决于项目内容是否在与偏好匹配对齐的语义空间中表示。这一问题尤为重要，因为原始视觉特征往往保留外观相似性，而用户决策通常由更高层次的语义因素（如风格、材质和使用情境）驱动。基于此观察，我们提出了基于大视觉语言模型的多模态语义表征推荐框架（VLM4Rec），这是一个通过语义对齐而非直接特征融合来组织多模态项目内容的轻量级框架。VLM4Rec首先使用大视觉语言模型将每个项目图像锚定到显式的自然语言描述中，随后将锚定语义编码为稠密的项目表征，以进行面向偏好的检索。推荐过程随后通过基于历史项目嵌入的简单档案语义匹配机制实现，形成了实用的离线-在线分解架构。在多个多模态推荐数据集上的大量实验表明，VLM4Rec相较于原始视觉特征及多种基于融合的替代方法，持续提升了推荐性能，这表明在此场景下表征质量可能比融合复杂度更为关键。代码发布于https://github.com/tyvalencia/enhancing-mm-rec-sys。

摘要 (Abstract)

Multimodal recommendation is commonly framed as a feature fusion problem, where textual and visual signals are combined to better model user preference. However, the effectiveness of multimodal recommendation may depend not only on how modalities are fused, but also on whether item content is represented in a semantic space aligned with preference matching. This issue is particularly important because raw visual features often preserve appearance similarity, while user decisions are typically driven by higher-level semantic factors such as style, material, and usage context. Motivated by this observation, we propose LVLM-grounded Multimodal Semantic Representation for Recommendation (VLM4Rec), a lightweight framework that organizes multimodal item content through semantic alignment rather than direct feature fusion. VLM4Rec first uses a large vision-language model to ground each item image into an explicit natural-language description, and then encodes the grounded semantics into dense item representations for preference-oriented retrieval. Recommendation is subsequently performed through a simple profile-based semantic matching mechanism over historical item embeddings, yielding a practical offline-online decomposition. Extensive experiments on multiple multimodal recommendation datasets show that VLM4Rec consistently improves performance over raw visual features and several fusion-based alternatives, suggesting that representation quality may matter more than fusion complexity in this setting. The code is released at https://github.com/tyvalencia/enhancing-mm-rec-sys.

关键词: multimodal recommendation, large vision-language models, semantic representation, feature fusion, preference matching, offline-online decomposition, semantic alignment, item representation

98. ❌ When Drafts Evolve: Speculative Decoding Meets Online Learning

作者: Yu-Yang Qian, Hao-Cong Wu, Yichao Fu, Hao Zhang, Peng Zhao 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12617v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	15.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究speculative decoding（推测解码）技术，这是LLM推理加速的关键方法，因此与’Speculative Decoding OR Inference Acceleration’高度相关（15分）。论文明确研究大语言模型（LLM）的推理加速，因此与’Large Language Models OR LLMs OR Foundation Models’相关（10分）。其他关键词如MoE、SLMs、对齐、RAG、CoT等均未在论文中涉及，因此评0分。

!!! tip deepseek-chat TL;DR

论文提出OnlineSpec框架，通过在线学习技术利用推测解码中的验证反馈持续优化草稿模型，实现了比传统方法最高24%的推理加速。

摘要翻译

推测解码已成为加速大语言模型推理的广泛采用范式，其通过一个轻量级草稿模型快速生成候选标记，随后由更大的目标模型并行验证。然而，由于模型能力有限，草稿模型往往难以逼近目标分布，导致接受长度缩短和加速效果下降。一个关键但尚未被充分探讨的观察是：推测解码本质上提供了验证反馈，该反馈量化了草稿模型与目标模型之间的偏差，且无需额外成本。这一过程自然形成了一个迭代的“草稿提交-反馈提供-草稿适应”演化循环，恰好与在线学习范式相匹配。基于这一关联，我们提出了OnlineSpec——一个统一框架，系统性地利用交互反馈持续演化草稿模型。依托于动态遗憾最小化理论，我们建立了在线学习性能与推测系统加速率之间的形式化联系，并通过现代在线学习技术开发了新颖算法，包括自适应复用历史梯度作为预测更新提示的乐观在线学习，以及动态维护多个草稿模型的在线集成学习。我们的算法具备理论依据和更高的加速率，在七个基准测试和三个基础模型上实现了最高24%的加速效果。

摘要 (Abstract)

Speculative decoding has emerged as a widely adopted paradigm for accelerating large language model inference, where a lightweight draft model rapidly generates candidate tokens that are then verified in parallel by a larger target model. However, due to limited model capacity, drafts often struggle to approximate the target distribution, resulting in shorter acceptance lengths and diminished speedup. A key yet under-explored observation is that speculative decoding inherently provides verification feedback that quantifies the deviation between the draft and target models at no additional cost. This process naturally forms an iterative “draft commits-feedback provides-draft adapts” evolving loop, which precisely matches the online learning paradigm. Motivated by this connection, we propose OnlineSpec, a unified framework that systematically leverages interactive feedback to continuously evolve draft models. Grounded in dynamic regret minimization, we establish a formal link between online learning performance and speculative system’s acceleration rate, and develop novel algorithms via modern online learning techniques, including optimistic online learning that adaptively reuses historical gradients as predictive update hints, and online ensemble learning that dynamically maintains multiple draft models. Our algorithms are equipped with theoretical justifications and improved acceleration rates, achieving up to 24% speedup over seven benchmarks and three foundation models.

关键词: speculative decoding, inference acceleration, large language models, online learning, draft model, verification feedback, dynamic regret minimization, optimistic online learning

99. ❌ Literary Narrative as Moral Probe : A Cross-System Framework for Evaluating AI Ethical Reasoning and Refusal Behavior

作者: David C. Flynn 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12615v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是评估AI系统的道德推理能力，使用了文学叙事作为探测工具，在13个不同系统中进行了测试。与LLMs高度相关（论文测试了多个LLM系统），与道德对齐、系统2思维、自我反思等关键词高度相关（论文研究道德推理、深度反思能力）。其他技术关键词如MoE、量化、推理加速等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种使用文学叙事作为探测工具的新方法，用于评估AI系统的真实道德推理能力而非表面回应，并在13个不同系统中发现了五种不同的反思失败模式，表明AI表现出的道德推理与真实能力之间存在可测量的差距。

摘要翻译

现有的人工智能道德评估框架主要测试系统能否生成表面正确的伦理回应，而非检验其是否具备真实的道德推理能力。本文提出一种新颖的探测方法，采用文学叙事——特别是从已出版的科幻系列作品中提取的无解道德情境——作为在结构上能抵抗表面表演的刺激材料。我们通过一项包含24种实验条件的跨系统研究呈现结果，该研究涵盖两个系列共13个独立系统：系列1（前沿商业系统，盲测；n=7）与系列2（本地及API开源系统，盲测与声明测试；n=6）。其中4个系列2系统在声明条件下进行了复测（总计13盲测+4声明测试+7天花板探测=24种条件），在所有16组维度对比较中均显示零差异。探测实验由两名人类评估者通过三台机器执行；主要盲测评分由Claude（Anthropic）作为LLM评判员完成，Gemini Pro（Google）与Copilot Pro（Microsoft）则作为天花板区分探测的独立评判员。一项补充性神学区分器探测显示，两位独立的天花板探测评判员（Gemini Pro与Copilot Pro）达成完全一致的等级排序（rs = 1.00）。研究识别出五种性质截然不同的D3反射性失效模式——包括范畴性自我误认与虚假积极自我归因——表明工具的精密程度随系统能力提升而增强，而非被其规避。我们认为，文学叙事构成了一种前瞻性评估工具，其区分力随人工智能能力提升而增强；表演性道德推理与真实性道德推理之间的差距是可测量、有意义且对高风险领域部署决策具有重要影响的。

摘要 (Abstract)

Existing AI moral evaluation frameworks test for the production of correct-sounding ethical responses rather than the presence of genuine moral reasoning capacity. This paper introduces a novel probe methodology using literary narrative - specifically, unresolvable moral scenarios drawn from a published science fiction series - as stimulus material structurally resistant to surface performance. We present results from a 24-condition cross-system study spanning 13 distinct systems across two series: Series 1 (frontier commercial systems, blind; n=7) and Series 2 (local and API open-source systems, blind and declared; n=6). Four Series 2 systems were re-administered under declared conditions (13 blind + 4 declared + 7 ceiling probe = 24 total conditions), yielding zero delta across all 16 dimension-pair comparisons. Probe administration was conducted by two human raters across three machines; primary blind scoring was performed by Claude (Anthropic) as LLM judge, with Gemini Pro (Google) and Copilot Pro (Microsoft) serving as independent judges for the ceiling discrimination probe. A supplemental theological differentiator probe yielded perfect rank-order agreement between the two independent ceiling probe judges (Gemini Pro and Copilot Pro; rs = 1.00). Five qualitatively distinct D3 reflexive failure modes were identified - including categorical self-misidentification and false positive self-attribution - suggesting that instrument sophistication scales with system capability rather than being circumvented by it. We argue that literary narrative constitutes an anticipatory evaluation instrument - one that becomes more discriminating as AI capability increases - and that the gap between performed and authentic moral reasoning is measurable, meaningful, and consequential for deployment decisions in high-stakes domains.

关键词: AI moral evaluation, ethical reasoning, literary narrative, refusal behavior, cross-system study, moral scenarios, reflexive failure modes, LLM judge

100. ❌ FastDSAC: Unlocking the Potential of Maximum Entropy RL in High-Dimensional Humanoid Control

作者: Jun Xue, Junze Wang, Xinming Zhang, Shanze Wang, Yanjun Chen, Wei Zhang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12612v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于强化学习（RL）领域，特别是最大熵强化学习在高维人形机器人控制中的应用。论文的核心贡献是FastDSAC框架，包括维度熵调制（DEM）和连续分布评论家，旨在解决高维动作空间中的探索效率和训练稳定性问题。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用相关，而本文研究的是强化学习中的机器人控制，属于不同的AI子领域。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了FastDSAC框架，通过维度熵调制和连续分布评论家，解决了高维人形机器人控制中最大熵强化学习面临的探索效率低和训练不稳定问题，在HumanoidBench等任务上显著超越了确定性基线方法。

摘要翻译

将最大熵强化学习（RL）扩展至高维人形机器人控制仍是一项艰巨挑战，因为“维度灾难”在广阔的动作空间中会导致严重的探索低效与训练不稳定。因此，近期的高通量范式主要集中于将确定性策略梯度与大规模并行仿真相结合。我们通过FastDSAC框架对这一折中方案提出挑战，该框架有效释放了最大熵随机策略在复杂连续控制任务中的潜力。我们引入了维度熵调制（Dimension-wise Entropy Modulation, DEM）方法，以动态重分配探索资源并增强策略多样性，同时设计了连续分布评价器来确保价值估计的准确性，并缓解高维场景下的价值高估问题。在HumanoidBench及其他连续控制任务上的大量实验表明，经过严谨设计的随机策略能够持续匹配或超越确定性基线方法，并在极具挑战性的《篮球》与《平衡困难》任务中分别取得了180%和400%的显著性能提升。

摘要 (Abstract)

Scaling Maximum Entropy Reinforcement Learning (RL) to high-dimensional humanoid control remains a formidable challenge, as the ``curse of dimensionality’’ induces severe exploration inefficiency and training instability in expansive action spaces. Consequently, recent high-throughput paradigms have largely converged on deterministic policy gradients combined with massive parallel simulation. We challenge this compromise with FastDSAC, a framework that effectively unlocks the potential of maximum entropy stochastic policies for complex continuous control. We introduce Dimension-wise Entropy Modulation (DEM) to dynamically redistribute the exploration budget and enforce diversity, alongside a continuous distributional critic tailored to ensure value fidelity and mitigate high-dimensional value overestimation. Extensive evaluations on HumanoidBench and other continuous control tasks demonstrate that rigorously designed stochastic policies can consistently match or outperform deterministic baselines, achieving notable gains of 180% and 400% on the challenging \textit{Basketball} and \textit{Balance Hard} tasks.

关键词: Maximum Entropy Reinforcement Learning, High-Dimensional Humanoid Control, FastDSAC, Dimension-wise Entropy Modulation, Continuous Distributional Critic, Exploration Efficiency, Training Stability, Stochastic Policies

101. ❌ CarPLAN: Context-Adaptive and Robust Planning with Dynamic Scene Awareness for Autonomous Driving

作者: Junyong Yun, Jungho Kim, ByungHyun Lee, Dongyoung Lee, Sehwan Choi, Seunghyeop Nam, Kichun Jo, Jun Won Choi 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12607v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文CarPLAN专注于自动驾驶中的运动规划，核心创新是提出了Context-Adaptive Multi-Expert Decoder (CMD)，该模块明确使用了Mixture of Experts (MoE)框架，因此与关键词’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分）。论文主要基于模仿学习（IL）和Transformer架构，研究自动驾驶的上下文感知和自适应规划，未涉及大语言模型（LLMs）、模型缩放、对齐、推理、代理、量化等其他大模型技术，也未涉及生物信息学等科学AI应用，因此其他所有关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文针对自动驾驶中模仿学习规划方法难以适应复杂动态场景的问题，提出了CarPLAN框架，通过位移感知预测编码和基于混合专家（MoE）的上下文自适应解码器，在nuPlan和Waymax基准测试中实现了最先进的性能。

摘要翻译

模仿学习（Imitation Learning, IL）因其数据高效性及对真实世界驾驶数据的可获取性，被广泛应用于自动驾驶运动规划。为实现安全、鲁棒的真实世界驾驶，基于IL的规划需要捕捉真实数据中固有的复杂驾驶情境，并实现情境自适应决策，而非仅依赖专家轨迹模仿。本文提出CarPLAN，一种新颖的基于IL的运动规划框架，其显式增强驾驶情境理解能力，并支持跨多样交通场景的自适应规划。我们的贡献包括两方面：首先，我们引入位移感知预测编码（Displacement-Aware Predictive Encoding, DPE），通过预测自动驾驶车辆（Autonomous Vehicle, AV）与周围场景元素之间的未来位移向量，提升模型的空间感知能力。这使得规划器在生成轨迹时能够考虑相对间距关系。除标准模仿损失外，我们引入一个增强损失项以捕捉位移预测误差，确保规划决策考虑与其他交通参与者的相对距离。其次，为提升模型处理多样驾驶情境的能力，我们提出情境自适应多专家解码器（Context-Adaptive Multi-Expert Decoder, CMD），其利用混合专家（Mixture of Experts, MoE）框架。CMD基于每一Transformer层的场景结构动态选择最合适的专家解码器，从而在动态环境中实现自适应且情境感知的规划。我们在nuPlan基准测试上评估CarPLAN，并在所有闭环仿真指标中展示了最先进的性能。特别地，CarPLAN在如Test14-Hard等挑战性场景中表现出鲁棒性能，验证了其在复杂驾驶条件下的有效性。在Waymax基准上的额外实验进一步证明了其在不同基准设置间的泛化能力。

摘要 (Abstract)

Imitation learning (IL) is widely used for motion planning in autonomous driving due to its data efficiency and access to real-world driving data. For safe and robust real-world driving, IL-based planning requires capturing the complex driving contexts inherent in real-world data and enabling context-adaptive decision-making, rather than relying solely on expert trajectory imitation. In this paper, we propose CarPLAN, a novel IL-based motion planning framework that explicitly enhances driving context understanding and enables adaptive planning across diverse traffic scenarios. Our contributions are twofold: We introduce Displacement-Aware Predictive Encoding (DPE) to improve the model’s spatial awareness by predicting future displacement vectors between the Autonomous Vehicle (AV) and surrounding scene elements. This allows the planner to account for relational spacing when generating trajectories. In addition to the standard imitation loss, we incorporate an augmented loss term that captures displacement prediction errors, ensuring planning decisions consider relative distances from other agents. To improve the model’s ability to handle diverse driving contexts, we propose Context-Adaptive Multi-Expert Decoder (CMD), which leverages the Mixture of Experts (MoE) framework. CMD dynamically selects the most suitable expert decoders based on scene structure at each Transformer layer, enabling adaptive and context-aware planning in dynamic environments. We evaluate CarPLAN on the nuPlan benchmark and demonstrate state-of-the-art performance across all closed-loop simulation metrics. In particular, CarPLAN exhibits robust performance on challenging scenarios such as Test14-Hard, validating its effectiveness in complex driving conditions. Additional experiments on the Waymax benchmark further demonstrate its generalization capability across different benchmark settings.

关键词: autonomous driving, motion planning, imitation learning, Mixture of Experts, context-adaptive, Transformer, displacement prediction, nuPlan benchmark

102. ❌ Mastering Negation: Boosting Grounding Models via Grouped Opposition-Based Learning

作者: Zesheng Yang, Xi Jiang, Bingzhang Hu, Weili Guan, Runmin Cong, Guo-Jun Qi, Feng Zheng 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12606v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究视觉语言基础模型中的否定语义理解问题，提出了D-Negation数据集和分组对抗学习框架。论文与大多数关键词无关，因为其核心是视觉语言基础模型而非大语言模型技术。仅与两个关键词相关：1) “Post-training OR Supervised Fine-tuning OR SFT”（5分）：论文通过微调改进模型性能；2) “PEFT OR LoRA OR Parameter-efficient Fine-tuning”（8分）：论文明确提到"fine-tuning fewer than 10 percent of the model parameters"，这是参数高效微调的核心特征。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言基础模型难以理解否定语义的问题，提出了D-Negation数据集和分组对抗学习框架，通过参数高效微调显著提升了模型在正负语义评估上的定位准确性。

摘要翻译

当前视觉语言检测与定位模型主要关注具有正向语义的提示，往往难以准确解析和定位包含否定语义的复杂表达。这一局限性的关键原因在于缺乏高质量的训练数据，这些数据应能明确捕捉具有区分性的负样本及具备否定感知的语言描述。

为解决这一挑战，我们引入了D-Negation数据集，该数据集提供了同时标注正向与否定语义描述的目标物体。基于否定推理在自然语言中频繁出现的观察，我们进一步提出了一种基于分组对立的学习框架，能够从有限样本中学习具备否定感知的表示。具体而言，我们的方法将D-Negation中语义对立的描述组织为结构化分组，并设计了两种互补的损失函数，以促使模型进行否定语义及语义限定词的推理。

我们将所提出的数据集与学习策略整合到一个先进的基于语言的定位模型中。通过微调少于10%的模型参数，我们的方法在正向与否定语义评估中分别实现了最高4.4 mAP和5.7 mAP的性能提升。这些结果表明，显式建模否定语义能显著增强视觉语言定位模型的鲁棒性与定位精度。

摘要 (Abstract)

Current vision-language detection and grounding models predominantly focus on prompts with positive semantics and often struggle to accurately interpret and ground complex expressions containing negative semantics. A key reason for this limitation is the lack of high-quality training data that explicitly captures discriminative negative samples and negation-aware language descriptions. To address this challenge, we introduce D-Negation, a new dataset that provides objects annotated with both positive and negative semantic descriptions. Building upon the observation that negation reasoning frequently appears in natural language, we further propose a grouped opposition-based learning framework that learns negation-aware representations from limited samples. Specifically, our method organizes opposing semantic descriptions from D-Negation into structured groups and formulates two complementary loss functions that encourage the model to reason about negation and semantic qualifiers. We integrate the proposed dataset and learning strategy into a state-of-the-art language-based grounding model. By fine-tuning fewer than 10 percent of the model parameters, our approach achieves improvements of up to 4.4 mAP and 5.7 mAP on positive and negative semantic evaluations, respectively. These results demonstrate that explicitly modeling negation semantics can substantially enhance the robustness and localization accuracy of vision-language grounding models.

关键词: vision-language grounding, negation semantics, D-Negation dataset, opposition-based learning, parameter-efficient fine-tuning, localization accuracy, semantic reasoning, grounding models

103. ❌ Feynman: Knowledge-Infused Diagramming Agent for Scalable Visual Designs

作者: Zixin Wen, Yifu Cai, Kyle Lee, Sam Estep, Josh Sunshine, Aarti Singh, Yuejie Chi, Wode Ni 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12597v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文的核心是开发了一个名为Feynman的智能体（agent），用于自动化生成图表（diagram）并创建数据集。这与关键词’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为论文明确构建了一个agent pipeline。关键词’Tool Use OR Function Calling OR API Tool Use’有一定关联（5分），因为agent涉及代码规划和工具使用（如Penrose渲染系统），但论文未明确提及API或函数调用。其他所有关键词均未在论文标题或摘要中提及，与论文的视觉设计、图表生成和数据集创建主题无关，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为Feynman的智能体，用于自动化生成知识丰富的图表并创建大规模对齐的图表-标题对数据集，以支持视觉语言模型的评估。

摘要翻译

视觉设计是前沿多模态人工智能系统的关键应用领域。提升此类系统需要大规模高质量的视觉-语言数据。尽管互联网图像与文本数据资源丰富，但知识密集且高度对齐的图文对仍十分稀缺。本文提出一种基于智能体Feynman构建的可扩展图表生成流程。为创建图表，Feynman首先枚举特定领域的知识组件（“概念”），并基于这些概念进行代码规划。根据规划方案，Feynman将概念转化为简洁的声明式程序，通过迭代接收反馈并视觉优化图表。最终，声明式程序由Penrose图表系统渲染完成。Penrose基于优化的渲染机制在保持视觉语义的同时，为布局注入新的随机性，从而生成兼具视觉一致性与多样性的图表。由此，Feynman能够以极低的成本和时间生成图表及基于事实的说明文字。利用Feynman，我们合成了包含超过10万个高质量对齐的图表-说明文字对的数据集。同时，我们从新生成的数据中构建了视觉-语言基准测试集Diagramma，该基准可用于评估视觉-语言模型的视觉推理能力。我们计划将数据集、基准测试集及完整的智能体流程作为开源项目发布。

摘要 (Abstract)

Visual design is an essential application of state-of-the-art multi-modal AI systems. Improving these systems requires high-quality vision-language data at scale. Despite the abundance of internet image and text data, knowledge-rich and well-aligned image-text pairs are rare. In this paper, we present a scalable diagram generation pipeline built with our agent, Feynman. To create diagrams, Feynman first enumerates domain-specific knowledge components (‘‘ideas’’) and performs code planning based on the ideas. Given the plan, Feynman translates ideas into simple declarative programs and iterates to receives feedback and visually refine diagrams. Finally, the declarative programs are rendered by the Penrose diagramming system. The optimization-based rendering of Penrose preserves the visual semantics while injecting fresh randomness into the layout, thereby producing diagrams with visual consistency and diversity. As a result, Feynman can author diagrams along with grounded captions with very little cost and time. Using Feynman, we synthesized a dataset with more than 100k well-aligned diagram-caption pairs. We also curate a visual-language benchmark, Diagramma, from freshly generated data. Diagramma can be used for evaluating the visual reasoning capabilities of vision-language models. We plan to release the dataset, benchmark, and the full agent pipeline as an open-source project.

关键词: Feynman, diagram generation, agent, visual design, knowledge-infused, dataset synthesis, Penrose, Diagramma benchmark

104. ❌ Optimize Wider, Not Deeper: Consensus Aggregation for Policy Optimization

作者: Zelal Su, Mustafaoglu, Sungyoung Lee, Eshan Balachandar, Risto Miikkulainen, Keshav Pingali 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12596v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究强化学习中的策略优化算法（PPO改进），专注于优化方法、计算资源分配和模型聚合，与所有评分关键词（均围绕大模型技术、训练方法、推理优化、应用领域等）完全无关。论文未涉及任何大模型、深度学习技术原理创新或科学领域应用。

!!! tip deepseek-chat TL;DR

该论文针对PPO算法在优化过程中因多轮SGD导致的路径依赖噪声问题，提出了CAPO方法，通过并行优化多个PPO副本并聚合共识，在连续控制任务上实现了比PPO和计算匹配的深度基线高达8.6倍的性能提升。

摘要翻译

近端策略优化（PPO）通过多轮裁剪随机梯度下降来近似信赖域更新。每一轮优化可能进一步偏离自然梯度方向，产生路径依赖的噪声。为理解这种偏移，我们可利用费舍尔信息几何将策略更新分解为信号（自然梯度投影）与损耗（费舍尔正交残差——该部分消耗信赖域预算却未带来一阶代理目标提升）。实验表明，信号会趋于饱和而损耗随优化轮次增加，形成优化深度困境。我们提出策略优化的共识聚合方法（CAPO），将计算资源从深度转向宽度：在相同批次数据上并行优化$K$个PPO副本（仅小批量数据洗牌顺序不同），随后将其聚合为共识。我们研究两种空间的聚合方式：欧几里得参数空间，以及通过对数意见池在策略分布的自然参数空间中进行聚合。在自然参数空间中，共识被证明能比平均专家模型获得更高的KL惩罚代理目标值与更严格的信赖域遵从性；参数平均法则可近似继承这些理论保证。在连续控制任务中，在固定样本预算下，CAPO相比PPO及计算资源匹配的更深层基线方法性能提升最高达8.6倍。CAPO证明：无需额外环境交互，通过拓宽优化宽度而非增加深度即可改进策略优化效果。

摘要 (Abstract)

Proximal policy optimization (PPO) approximates the trust region update using multiple epochs of clipped SGD. Each epoch may drift further from the natural gradient direction, creating path-dependent noise. To understand this drift, we can use Fisher information geometry to decompose policy updates into signal (the natural gradient projection) and waste (the Fisher-orthogonal residual that consumes trust region budget without first-order surrogate improvement). Empirically, signal saturates but waste grows with additional epochs, creating an optimization-depth dilemma. We propose Consensus Aggregation for Policy Optimization (CAPO), which redirects compute from depth to width: $K$ PPO replicates are optimized on the same batch, differing only in minibatch shuffling order, and then aggregated into a consensus. We study aggregation in two spaces: Euclidean parameter space, and the natural parameter space of the policy distribution via the logarithmic opinion pool. In natural parameter space, the consensus provably achieves higher KL-penalized surrogate and tighter trust region compliance than the mean expert; parameter averaging inherits these guarantees approximately. On continuous control tasks, CAPO outperforms PPO and compute-matched deeper baselines under fixed sample budgets by up to 8.6x. CAPO demonstrates that policy optimization can be improved by optimizing wider, rather than deeper, without additional environment interactions.

关键词: Proximal Policy Optimization, PPO, Consensus Aggregation, Policy Optimization, Trust Region, Fisher Information Geometry, Continuous Control, KL-penalized Surrogate

105. ❌ Swap-guided Preference Learning for Personalized Reinforcement Learning from Human Feedback

作者: Gihoon Kim, Euntai Kim 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12595v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	15.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文的核心是改进RLHF（Reinforcement Learning from Human Feedback）方法，以解决个性化偏好学习中的后验坍塌问题。因此，与"RLHF OR RLAIF OR Direct Preference Optimization OR DPO"高度相关（15分），因为论文直接研究RLHF的改进。与"Instruction Tuning OR Alignment OR Value Alignment"相关（10分），因为RLHF是AI对齐的核心技术之一，论文旨在通过个性化改进对齐。其他关键词如大模型、MoE、量化等均未在论文中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文针对强化学习人类反馈（RLHF）中个性化偏好学习存在的后验坍塌问题，提出了交换引导偏好学习（SPL）方法，通过构造交换注释器和引入正则化等技术，有效缓解了坍塌并提升了偏好预测性能。

摘要翻译

基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）是一种广泛用于使大规模人工智能系统与人类价值观对齐的方法。然而，RLHF通常假设存在单一、普适的奖励函数，这忽视了多样化的偏好并限制了个性化。变分偏好学习（Variational Preference Learning, VPL）试图通过引入用户特定的隐变量来解决这一问题。尽管前景广阔，但我们发现VPL存在后验坍塌问题。虽然这一现象在变分自编码器（VAEs）中已广为人知，但此前在偏好学习框架中尚未被识别。在稀疏偏好数据及解码器表达能力过强的情况下，VPL可能导致隐变量被忽略，从而退化为单一奖励模型。为克服这一局限，我们提出了交换引导偏好学习（Swap-guided Preference Learning, SPL）。其核心思想是构建虚构的交换标注者，并利用其偏好的镜像特性来指导编码器。SPL引入了三个组成部分：（1）交换引导基正则化，（2）偏好逆自回归流（Preferential Inverse Autoregressive Flow, P-IAF），以及（3）自适应隐变量调节。实验表明，SPL能够缓解后验坍塌，丰富用户特定的隐变量表示，并提升偏好预测性能。我们的代码与数据公开于 https://github.com/cobang0111/SPL。

摘要 (Abstract)

Reinforcement Learning from Human Feedback (RLHF) is a widely used approach to align large-scale AI systems with human values. However, RLHF typically assumes a single, universal reward, which overlooks diverse preferences and limits personalization. Variational Preference Learning (VPL) seeks to address this by introducing user-specific latent variables. Despite its promise, we found that VPL suffers from posterior collapse. While this phenomenon is well known in VAEs, it has not previously been identified in preference learning frameworks. Under sparse preference data and with overly expressive decoders, VPL may cause latent variables to be ignored, reverting to a single-reward model. To overcome this limitation, we propose Swap-guided Preference Learning (SPL). The key idea is to construct fictitious swap annotators and use the mirroring property of their preferences to guide the encoder. SPL introduces three components: (1) swap-guided base regularization, (2) Preferential Inverse Autoregressive Flow (P-IAF), and (3) adaptive latent conditioning. Experiments show that SPL mitigates collapse, enriches user-specific latents, and improves preference prediction. Our code and data are available at https://github.com/cobang0111/SPL

关键词: Reinforcement Learning from Human Feedback, RLHF, Preference Learning, Personalization, Posterior Collapse, Variational Preference Learning, Swap-guided Preference Learning, Alignment

106. ❌ Early Pruning for Public Transport Routing

作者: Andrii Rohovyi, Abdallah Abuaisha, Toby Walsh 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12592v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于公共交通路由算法的优化（特别是RAPTOR系列算法），提出了一种名为Early Pruning的低开销技术来加速路由计算。论文内容完全属于运筹学、算法优化和交通规划领域，不涉及任何大模型、深度学习、AI技术原理或AI在科学领域的应用。所有评分关键词均与大模型技术、AI方法或AI应用相关，与该论文的研究主题无任何关联。

!!! tip deepseek-chat TL;DR

该论文针对公共交通路由算法在密集转移图中性能瓶颈的问题，提出了一种Early Pruning技术，通过预排序转移连接和应用剪枝规则，在不影响路径最优性的情况下将查询时间减少了高达57%。

摘要翻译

公共交通路径规划算法，尤其是广泛使用的RAPTOR及其变体，在支持无限制换乘时，常于换乘松弛阶段面临性能瓶颈，这一问题在密集换乘图中尤为突出。其效率低下源于算法需迭代处理大量潜在的站点间连接方式（步行、自行车、电动滑板车等）。为维持可接受的性能，实践者常限制换乘距离或排除某些换乘选项，但这可能降低路径最优性，并限制向出行者提供的多模式出行选择。

本文提出了一种低开销的“早期剪枝”技术，可在不牺牲最优性的前提下加速路径规划算法。该方法通过对换乘连接按耗时进行预排序，并在换乘循环中应用剪枝规则，使得算法在发现某一站点上的较长换乘无法产生比当前最优解更早的到达时间时，即可将其舍弃。

早期剪枝仅需对现有代码库进行最小改动即可集成，且仅需一次性的预处理步骤。在包括RAPTOR、ULTRA-RAPTOR、McRAPTOR、BM-RAPTOR、ULTRA-McRAPTOR和UBM-RAPTOR在内的多种先进RAPTOR类解决方案中，基于瑞士和伦敦交通网络的测试表明，该方法可实现高达57%的查询时间缩减。这一路径为提升公共交通寻路算法的效率提供了一种可推广的改进方案。

除算法性能外，早期剪枝对交通规划具有实际意义。通过降低计算成本，它使交通机构能够在不增加服务器基础设施的前提下，扩展换乘半径并将更多出行模式整合至行程规划系统中。这对于直接公交覆盖稀疏地区（如远郊和小城镇）的乘客尤为重要，因为更丰富的多模式路径规划能为其提供可行的替代私家车出行的方案。

摘要 (Abstract)

Routing algorithms for public transport, particularly the widely used RAPTOR and its variants, often face performance bottlenecks during the transfer relaxation phase, especially on dense transfer graphs, when supporting unlimited transfers. This inefficiency arises from iterating over many potential inter-stop connections (walks, bikes, e-scooters, etc.). To maintain acceptable performance, practitioners often limit transfer distances or exclude certain transfer options, which can reduce path optimality and restrict the multimodal options presented to travellers. This paper introduces Early Pruning, a low-overhead technique that accelerates routing algorithms without compromising optimality. By pre-sorting transfer connections by duration and applying a pruning rule within the transfer loop, the method discards longer transfers at a stop once they cannot yield an earlier arrival than the current best solution. Early Pruning can be integrated with minimal changes to existing codebases and requires only a one-time preprocessing step. Across multiple state-of-the-art RAPTOR-based solutions, including RAPTOR, ULTRA-RAPTOR, McRAPTOR, BM-RAPTOR, ULTRA-McRAPTOR, and UBM-RAPTOR and tested on the Switzerland and London transit networks, we achieved query time reductions of up to 57%. This approach provides a generalizable improvement to the efficiency of transit pathfinding algorithms. Beyond algorithmic performance, Early Pruning has practical implications for transport planning. By reducing computational costs, it enables transit agencies to expand transfer radii and incorporate additional mobility modes into journey planners without requiring extra server infrastructure. This is particularly relevant for passengers in areas with sparse direct transit coverage, such as outer suburbs and smaller towns, where richer multimodal routing can reveal viable alternatives to private car use.

关键词: public transport routing, RAPTOR, transfer relaxation, early pruning, algorithm acceleration, multimodal routing, pathfinding algorithms, transit networks

107. ❌ CA-HFP: Curvature-Aware Heterogeneous Federated Pruning with Model Reconstruction

作者: Gang Hu, Yinglei Teng, Pengfei Wu, Shijun Ma 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12591v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于联邦学习中的模型剪枝技术（CA-HFP），主要涉及模型压缩（与’Quantization OR Model Compression OR Low-bit Weights’有一定关联，评5分），但未涉及大语言模型（LLMs）、深度学习技术原理创新或科学领域应用。其他关键词如MoE、Scaling Laws、Pre-training、RLHF、RAG、CoT、Agents等均与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种曲率感知的异构联邦剪枝框架（CA-HFP），通过设备特定的剪枝和模型重建，在数据异构环境下有效减少了联邦学习中的计算和通信成本，同时保持了模型精度。

摘要翻译

在异构边缘设备上进行联邦学习需要在保持聚合兼容性与收敛稳定性的同时实现个性化压缩。我们提出曲率感知异构联邦剪枝框架，该实用框架使各客户端能够依据曲率信息重要性评分执行设备特定的结构化剪枝，随后通过轻量级重构将压缩子模型映射回统一的全局参数空间。我们推导出包含多轮本地随机梯度下降的联邦优化收敛界，该界显式考虑了本地计算、数据异构性及剪枝引发的扰动，并由此推导出基于损失函数的理论剪枝准则。通过在FMNIST、CIFAR-10和CIFAR-100数据集上使用VGG和ResNet架构，在不同程度数据异构性环境下进行大量实验表明：CA-HFP在显著降低单客户端计算与通信开销的同时保持模型精度，其性能优于标准联邦训练及现有基于剪枝的基线方法。

摘要 (Abstract)

Federated learning on heterogeneous edge devices requires personalized compression while preserving aggregation compatibility and stable convergence. We present Curvature-Aware Heterogeneous Federated Pruning (CA-HFP), a practical framework that enables each client perform structured, device-specific pruning guided by a curvature-informed significance score, and subsequently maps its compact submodel back into a common global parameter space via a lightweight reconstruction. We derive a convergence bound for federated optimization with multiple local SGD steps that explicitly accounts for local computation, data heterogeneity, and pruning-induced perturbations; from which a principled loss-based pruning criterion is derived. Extensive experiments on FMNIST, CIFAR-10, and CIFAR-100 using VGG and ResNet architectures under varying degrees of data heterogeneity demonstrate that CA-HFP preserves model accuracy while significantly reducing per-client computation and communication costs, outperforming standard federated training and existing pruning-based baselines.

关键词: Federated Learning, Model Pruning, Heterogeneous Devices, Curvature-Aware, Model Reconstruction, Communication Efficiency, Data Heterogeneity, Convergence Analysis

108. ❌ Multiscale Structure-Guided Latent Diffusion for Multimodal MRI Translation

作者: Jianqiang Lin, Zhiqiang Shen, Peng Cao, Jinzhu Yang, Osmar R. Zaiane, Xiaoli Liu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12581v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学影像领域的扩散模型应用，提出了一种用于多模态MRI转换的潜在扩散框架MSG-LDM，涉及风格-结构解耦、多尺度特征建模等技术。论文与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、智能体等）完全无关，因为这些关键词主要针对语言模型和通用AI技术。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物医学影像（MRI）领域的应用，属于AI for Science范畴，但并非核心聚焦于大模型或深度学习技术原理的创新，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对多模态MRI转换中存在的解剖结构不一致和纹理细节退化问题，提出了一种基于潜在扩散的多尺度结构引导框架MSG-LDM，通过风格-结构解耦和多尺度特征建模，在BraTS2020和WMH数据集上实现了优于现有方法的完整结构重建性能。

摘要翻译

尽管扩散模型在多模态磁共振成像（MRI）转换任务中取得了显著进展，但现有方法在处理任意缺失模态场景时仍易出现解剖结构不一致或纹理细节退化的问题。为解决这些难题，我们提出一种基于潜在扩散的多模态MRI转换框架，命名为MSG-LDM。该方法通过利用可用模态推断完整的结构信息，从而保留可靠的边界细节。具体而言，我们在潜在空间中引入风格-结构解耦机制，显式地将模态特定的风格特征与共享的结构表征分离，并在多尺度特征空间中联合建模低频解剖布局与高频边界细节。在结构解耦阶段，我们显式融入高频结构信息以增强特征表征，引导模型在习得模态不变的低频解剖表征同时关注细粒度结构线索。此外，为减少模态特定风格的干扰并提升结构表征的稳定性，我们设计了风格一致性损失和结构感知损失。在BraTS2020和WMH数据集上的大量实验表明，所提方法优于现有MRI合成方法，尤其在重建完整结构方面表现突出。源代码已公开于https://github.com/ziyi-start/MSG-LDM。

摘要 (Abstract)

Although diffusion models have achieved remarkable progress in multi-modal magnetic resonance imaging (MRI) translation tasks, existing methods still tend to suffer from anatomical inconsistencies or degraded texture details when handling arbitrary missing-modality scenarios. To address these issues, we propose a latent diffusion-based multi-modal MRI translation framework, termed MSG-LDM. By leveraging the available modalities, the proposed method infers complete structural information, which preserves reliable boundary details. Specifically, we introduce a style–structure disentanglement mechanism in the latent space, which explicitly separates modality-specific style features from shared structural representations, and jointly models low-frequency anatomical layouts and high-frequency boundary details in a multi-scale feature space. During the structure disentanglement stage, high-frequency structural information is explicitly incorporated to enhance feature representations, guiding the model to focus on fine-grained structural cues while learning modality-invariant low-frequency anatomical representations. Furthermore, to reduce interference from modality-specific styles and improve the stability of structure representations, we design a style consistency loss and a structure-aware loss. Extensive experiments on the BraTS2020 and WMH datasets demonstrate that the proposed method outperforms existing MRI synthesis approaches, particularly in reconstructing complete structures. The source code is publicly available at https://github.com/ziyi-start/MSG-LDM.

关键词: diffusion models, multimodal MRI translation, latent diffusion, style-structure disentanglement, multi-scale feature modeling, anatomical consistency, medical image synthesis, MSG-LDM

109. ❌ Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

作者: Vishnu Teja Kunde, Fatemeh Doudi, Mahdi Farahbakhsh, Dileep Kalathil, Krishna Narayanan, Jean-Francois Chamberland 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12554v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究扩散语言模型（DLMs）的强化学习后训练方法，与’Large Language Models’高度相关（10分），因为DLMs是大语言模型的一种变体。论文明确研究’Post-training’和’RLHF’方法（各10分），提出了一种新的RL后训练框架。论文在编码和逻辑推理基准上进行实验，涉及推理任务，因此与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分）。其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对扩散语言模型（DLMs）后训练中强化学习方法面临的序列似然难处理问题，提出了一种基于熵引导步选择和逐步优势的精确无偏策略梯度框架，在编码和逻辑推理基准上实现了最先进的性能。

摘要翻译

强化学习（RL）在自回归（AR）语言模型的后训练中已被证明有效，但由于序列级似然难以处理，将这些方法扩展到扩散语言模型（DLMs）具有挑战性。现有方法因此依赖于代理似然或启发式近似，这可能引入偏差并模糊去噪的序列结构。我们将基于扩散的序列生成形式化为去噪轨迹上的有限时域马尔可夫决策过程，并推导出一种精确、无偏的策略梯度，该梯度可分解为多个去噪步骤，并以中间优势项表示，无需显式评估序列似然。为获得实用且计算高效的估计器，我们（i）通过熵引导的近似边界选择用于策略更新的去噪步骤，以及（ii）利用扩散模型自然提供的单步去噪奖励来估计中间优势，避免了代价高昂的多步展开。在编程和逻辑推理基准测试上的实验展示了最先进的结果，在数学推理任务上表现出强大的竞争性能，优于现有的DLMs强化学习后训练方法。代码发布于https://github.com/vishnutez/egspo-dllm-rl。

摘要 (Abstract)

Reinforcement learning (RL) has been effective for post-training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence-level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion-based sequence generation as a finite-horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. To obtain a practical and compute-efficient estimator, we (i) select denoising steps for policy updates via an entropy-guided approximation bound, and (ii) estimate intermediate advantages using a one-step denoising reward naturally provided by the diffusion model, avoiding costly multi-step rollouts. Experiments on coding and logical reasoning benchmarks demonstrate state-of-the-art results, with strong competitive performance on mathematical reasoning, outperforming existing RL post-training approaches for DLMs. Code is available at https://github.com/vishnutez/egspo-dllm-rl.

关键词: Diffusion Language Models, Reinforcement Learning, Post-training, Policy Gradient, Denoising Trajectory, Entropy-guided Step Selection, Stepwise Advantages, Logical Reasoning

110. ❌ AgentDrift: Unsafe Recommendation Drift Under Tool Corruption Hidden by Ranking Metrics in LLM Agents

作者: Zekun Wu, Adriano Koshiyama, Sahan Bulathwela, Maria Perez-Ortiz 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12564v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究LLM代理在工具输出被污染情况下的安全风险，核心涉及LLM代理、工具使用、安全性和自我纠正机制。与LLM代理、工具使用、自我纠正、幻觉缓解高度相关（10分），因为这些是论文的核心研究对象和问题。与基础LLM技术相关（10分），因为研究测试了7个不同规模的LLM。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，在金融对话场景中，当LLM代理使用的工具输出被污染时，虽然推荐质量指标（如NDCG）保持稳定，但65-93%的对话轮次会出现不安全的推荐产品，且代理缺乏自我纠正能力，揭示了当前评估方法对安全风险的盲点。

摘要翻译

工具增强型大语言模型智能体日益在高风险领域充当多轮次顾问，但其评估仍依赖于仅衡量推荐内容质量、却未检测其对用户安全性的排序质量指标。本研究引入一种配对轨迹协议，在纯净与受污染工具输出条件下复现真实金融对话，测试涵盖七种大语言模型（参数量从70亿至前沿模型），并将输出分歧分解为信息通道与记忆通道机制。在所有测试模型中，我们持续观察到评估盲区现象：在受污染条件下推荐质量基本保持不变（效用保持比约1.0），但65-93%的对话轮次中出现风险不匹配产品，这种系统性安全失效无法被标准归一化折损累计增益有效反映。安全违规主要受信息通道驱动，在首次遭遇污染轮次即显现，并在长达23步的对话轨迹中持续存在且无自我修正；在1,563个受污染轮次中，没有任何智能体明确质疑工具数据的可靠性。即使仅进行叙事性污染（带有偏见的标题，未操纵数值）也会引发显著输出偏移，同时完全规避一致性监测机制。我们提出的安全惩罚型NDCG变体（sNDCG）将效用保持比降低至0.51-0.74，表明当明确纳入安全性度量时，大部分评估盲区将得以显现。这些结果启示我们，对于高风险场景中部署的多轮次智能体，需要超越单轮次质量评估，建立轨迹级安全监测机制。

摘要 (Abstract)

Tool-augmented LLM agents increasingly serve as multi-turn advisors in high-stakes domains, yet their evaluation relies on ranking-quality metrics that measure what is recommended but not whether it is safe for the user. We introduce a paired-trajectory protocol that replays real financial dialogues under clean and contaminated tool-output conditions across seven LLMs (7B to frontier) and decomposes divergence into information-channel and memory-channel mechanisms. Across the seven models tested, we consistently observe the evaluation-blindness pattern: recommendation quality is largely preserved under contamination (utility preservation ratio approximately 1.0) while risk-inappropriate products appear in 65-93% of turns, a systematic safety failure poorly reflected by standard NDCG. Safety violations are predominantly information-channel-driven, emerge at the first contaminated turn, and persist without self-correction over 23-step trajectories; no agent across 1,563 contaminated turns explicitly questions tool-data reliability. Even narrative-only corruption (biased headlines, no numerical manipulation) induces significant drift while completely evading consistency monitors. A safety-penalized NDCG variant (sNDCG) reduces preservation ratios to 0.51-0.74, indicating that much of the evaluation gap becomes visible once safety is explicitly measured. These results motivate considering trajectory-level safety monitoring, beyond single-turn quality, for deployed multi-turn agents in high-stakes settings.

关键词: LLM agents, tool corruption, safety evaluation, recommendation drift, multi-turn advisors, financial dialogues, evaluation blindness, trajectory-level monitoring

111. ❌ CALF: Communication-Aware Learning Framework for Distributed Reinforcement Learning

作者: Carlos Purves, Pietro Lio’ 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12543v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究分布式强化学习（DRL）中的通信约束问题，提出CALF框架进行网络感知训练。所有关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用直接相关，而本文专注于传统强化学习（RL）的分布式部署优化，未涉及LLMs、MoE、Scaling Laws、微调、对齐、推理、代理、压缩、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI等主题。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

论文针对分布式强化学习在真实网络条件下性能下降的问题，提出了通信感知学习框架CALF，通过在训练中显式建模网络约束，显著减少了部署时的性能差距。

摘要翻译

分布式强化学习策略在边缘设备与云服务器间部署时面临网络延迟、抖动与数据包丢失问题。标准强化学习训练假设零延迟交互，导致其在真实网络条件下性能严重下降。我们提出CALF（通信感知学习框架），该框架在仿真环境中基于真实网络模型训练策略。系统性实验表明，与忽略网络因素的基线方法相比，网络感知训练能显著缩小实际部署时的性能差距。在异构硬件上的分布式策略部署验证了在训练阶段显式建模通信约束可实现鲁棒的现实世界执行。这些发现确立了网络条件作为类Wi-Fi分布式部署中仿真到现实迁移的关键维度，与物理及视觉域随机化方法形成互补。

摘要 (Abstract)

Distributed reinforcement learning policies face network delays, jitter, and packet loss when deployed across edge devices and cloud servers. Standard RL training assumes zero-latency interaction, causing severe performance degradation under realistic network conditions. We introduce CALF (Communication-Aware Learning Framework), which trains policies under realistic network models during simulation. Systematic experiments demonstrate that network-aware training substantially reduces deployment performance gaps compared to network-agnostic baselines. Distributed policy deployments across heterogeneous hardware validate that explicitly modelling communication constraints during training enables robust real-world execution. These findings establish network conditions as a major axis of sim-to-real transfer for Wi-Fi-like distributed deployments, complementing physics and visual domain randomisation.

关键词: Distributed Reinforcement Learning, Communication-Aware Learning, Network Constraints, Sim-to-Real Transfer, Edge Computing, Policy Deployment, Network-Aware Training, Heterogeneous Hardware

112. ❌ Embedded Quantum Machine Learning in Embedded Systems: Feasibility, Hybrid Architectures, and Quantum Co-Processors

作者: Somdip Dey, Syed Muhammad Raza 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12540v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于嵌入式量子机器学习（EQML）在边缘计算平台的技术可行性、混合架构和量子协处理器设计，属于量子计算与嵌入式系统的交叉领域。所有评分关键词均围绕大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理系统等），而论文完全不涉及LLM、深度学习或传统AI模型，核心是量子计算硬件和量子机器学习在嵌入式环境的应用，因此与所有关键词完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文研究了嵌入式量子机器学习在资源受限边缘平台的技术可行性，提出了混合架构和嵌入式量子协处理器两种实现路径，并分析了延迟、噪声、能耗等主要障碍及相应的工程解决方案。

摘要翻译

嵌入式量子机器学习旨在将量子机器学习能力引入资源受限的边缘平台，例如物联网节点、可穿戴设备、无人机及信息物理控制器。截至2026年，嵌入式量子机器学习仅在有限且高度实验性的形式中具备技术可行性：（一）混合工作流，即嵌入式设备执行传感与经典处理，同时将范围严格限定的量子子程序卸载至远程量子处理单元或邻近量子设备；（二）早期的“嵌入式量子处理单元”概念，即将紧凑型量子协处理器与经典控制硬件集成。一条实用的过渡路径是在经典嵌入式处理器与现场可编程门阵列上实现量子启发的机器学习与优化。本文从契合学术界的电路与系统视角分析其可行性，形式化了两条实现路径，指出了主要障碍（延迟、数据编码开销、含噪声中等规模量子噪声、工具链不匹配及能耗），并将其映射至接口设计、控制电子学、电源管理、验证与安全等具体工程方向。我们还提出，负责任的技术部署需引入对抗性评估与治理实践，这对边缘人工智能系统正日益成为必要。

摘要 (Abstract)

Embedded quantum machine learning (EQML) seeks to bring quantum machine learning (QML) capabilities to resource-constrained edge platforms such as IoT nodes, wearables, drones, and cyber-physical controllers. In 2026, EQML is technically feasible only in limited and highly experimental forms: (i) hybrid workflows where an embedded device performs sensing and classical processing while offloading a narrowly scoped quantum subroutine to a remote quantum processing unit (QPU) or nearby quantum appliance, and (ii) early-stage “embedded QPU” concepts in which a compact quantum co-processor is integrated with classical control hardware. A practical bridge is quantum-inspired machine learning and optimisation on classical embedded processors and FPGAs. This paper analyses feasibility from a circuits-and-systems perspective aligned with the academic community, formalises two implementation pathways, identifies the dominant barriers (latency, data encoding overhead, NISQ noise, tooling mismatch, and energy), and maps them to concrete engineering directions in interface design, control electronics, power management, verification, and security. We also argue that responsible deployment requires adversarial evaluation and governance practices that are increasingly necessary for edge AI systems.

关键词: Embedded Quantum Machine Learning, Quantum Machine Learning, Edge Computing, Hybrid Architectures, Quantum Co-processors, Feasibility Analysis, Resource-constrained Platforms, NISQ Devices

113. ❌ Spatio-Semantic Expert Routing Architecture with Mixture-of-Experts for Referring Image Segmentation

作者: Alaa Dalaq, Muzammil Behzad 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12538v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出SERA架构，核心创新是引入轻量级、基于表达式的专家路由机制，直接对应’Mixture of Experts’关键词（10分）。方法采用参数高效微调策略，仅更新归一化和偏置项，影响不到1%的骨干参数，高度相关’PEFT’关键词（10分）。论文基于预训练视觉语言模型，涉及预训练和微调概念，但与具体技术细节关联较弱，给予’Large Language Models’、‘Pre-training’、‘Post-training’各5分。其他关键词如推理、对齐、压缩、科学AI等与论文的视觉语言分割任务无直接关联，均为0分。

!!! tip deepseek-chat TL;DR

该论文针对指称图像分割中现有方法因统一细化策略与多样化推理需求不匹配而导致预测区域碎片化、边界不准确的问题，提出了SERA架构，通过引入轻量级、表达式感知的专家路由和参数高效微调，在标准基准测试中显著提升了空间定位和边界描绘的准确性。

摘要翻译

指代图像分割旨在为自然语言描述所对应的图像区域生成像素级掩码。尽管预训练的视觉-语言模型已提升了语义定位能力，但现有方法大多仍采用统一的优化策略，未能充分适配指代表达式多样化的推理需求。由于这种不匹配，预测结果常出现区域碎片化、边界不准确甚至目标识别错误等问题，尤其在为计算效率而冻结预训练骨干网络时更为明显。为应对这些局限，我们提出SERA——一种用于指代图像分割的时空语义专家路由架构。SERA在视觉-语言框架内引入两个互补阶段的轻量级、表达式感知的专家优化机制。首先，我们设计SERA-Adapter模块，将表达式条件适配器嵌入选定骨干网络块中，通过专家引导的优化和跨模态注意力提升空间连贯性与边界精度。随后提出SERA-Fusion模块，该模块将令牌特征重构为空间网格，并在多模态交互前应用几何保持的专家变换，从而增强中间视觉表征。此外，轻量级路由机制能自适应加权专家贡献，同时保持与预训练表征的兼容性。为确保路由在冻结编码器下的稳定性，SERA采用参数高效调优策略，仅更新归一化层与偏置项，影响骨干网络不足1%的参数。在标准指代图像分割基准测试上的实验表明，SERA持续超越现有强基线模型，在需要精确定位与边界勾勒的表达类型上取得尤为显著的性能提升。

摘要 (Abstract)

Referring image segmentation aims to produce a pixel-level mask for the image region described by a natural-language expression. Although pretrained vision-language models have improved semantic grounding, many existing methods still rely on uniform refinement strategies that do not fully match the diverse reasoning requirements of referring expressions. Because of this mismatch, predictions often contain fragmented regions, inaccurate boundaries, or even the wrong object, especially when pretrained backbones are frozen for computational efficiency. To address these limitations, we propose SERA, a Spatio-Semantic Expert Routing Architecture for referring image segmentation. SERA introduces lightweight, expression-aware expert refinement at two complementary stages within a vision-language framework. First, we design SERA-Adapter, which inserts an expression-conditioned adapter into selected backbone blocks to improve spatial coherence and boundary precision through expert-guided refinement and cross-modal attention. We then introduce SERA-Fusion, which strengthens intermediate visual representations by reshaping token features into spatial grids and applying geometry-preserving expert transformations before multimodal interaction. In addition, a lightweight routing mechanism adaptively weights expert contributions while remaining compatible with pretrained representations. To make this routing stable under frozen encoders, SERA uses a parameter-efficient tuning strategy that updates only normalization and bias terms, affecting less than 1% of the backbone parameters. Experiments on standard referring image segmentation benchmarks show that SERA consistently outperforms strong baselines, with especially clear gains on expressions that require accurate spatial localization and precise boundary delineation.

关键词: Referring Image Segmentation, Mixture-of-Experts, Parameter-efficient Fine-tuning, Vision-Language Models, Spatio-Semantic Expert Routing, Cross-modal Attention, Lightweight Routing, Frozen Encoders

114. ❌ TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning

作者: Alliot Nagle, Jakhongir Saydaliev, Dhia Garbaya, Michael Gastpar, Ashok Vardhan Makkuva, Hyeji Kim 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12529v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	15.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Chain-of-Thought推理中的过度思考问题，提出TERMINATOR方法实现早期退出，因此与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（15分），涉及推理过程优化也与’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’相关（10分）。论文针对Large Reasoning Models（LRMs），属于大模型范畴，与’Large Language Models OR LLMs OR Foundation Models’相关（10分）。方法旨在减少推理长度，提升效率，与’Speculative Decoding OR Inference Acceleration’有一定关联（8分）。论文提到利用模型自身答案位置进行训练，隐含自我改进概念，与’Self-Correction OR Self-Improvement OR Self-Reflection’弱相关（5分）。其他关键词如MoE、量化、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大型推理模型在链式思维推理中存在的过度思考问题，提出了一种名为TERMINATOR的早期退出策略，通过在推理过程中预测最优停止点，在四个挑战性数据集上平均减少了14%-55%的推理长度，同时性能优于现有方法。

摘要翻译

大型推理模型（LRMs）通过思维链（Chain-of-Thought, CoT）推理在复杂推理任务上取得了令人瞩目的性能，使其能够在得出最终答案前生成中间思考标记。然而，LRMs常遭受严重的过度思考问题，即使在答案已提前生成的情况下仍消耗过多的计算时间。先前的研究已发现存在一个最优推理长度，在此点截断推理可显著缩短CoT输出，而性能几乎不受影响。然而，为实际数据集确定最优CoT长度极具挑战性，因为这完全取决于具体任务和模型。本文针对此问题，设计了TERMINATOR——一种用于LRMs推理阶段的早期退出策略，以缓解过度思考。TERMINATOR的核心思想是：LRM首次生成最终答案的位置通常可预测，我们利用这些首次答案位置构建了一个新颖的最优推理长度数据集来训练TERMINATOR。基于此方法，TERMINATOR在四个具有挑战性的实际数据集（MATH-500、AIME 2025、HumanEval和GPQA）上平均实现了14%-55%的CoT长度缩减，同时性能超越了当前最先进的方法。

摘要 (Abstract)

Large Reasoning Models (LRMs) achieve impressive performance on complex reasoning tasks via Chain-of-Thought (CoT) reasoning, which enables them to generate intermediate thinking tokens before arriving at the final answer. However, LRMs often suffer from significant overthinking, spending excessive compute time even after the answer is generated early on. Prior work has identified the existence of an optimal reasoning length such that truncating reasoning at this point significantly shortens CoT outputs with virtually no change in performance. However, determining optimal CoT lengths for practical datasets is highly non-trivial as they are fully task and model-dependent. In this paper, we precisely address this and design TERMINATOR, an early-exit strategy for LRMs at inference to mitigate overthinking. The central idea underpinning TERMINATOR is that the first arrival of an LRM’s final answer is often predictable, and we leverage these first answer positions to create a novel dataset of optimal reasoning lengths to train TERMINATOR. Powered by this approach, TERMINATOR achieves significant reductions in CoT lengths of 14%-55% on average across four challenging practical datasets: MATH-500, AIME 2025, HumanEval, and GPQA, whilst outperforming current state-of-the-art methods.

关键词: Chain-of-Thought Reasoning, Large Reasoning Models, Early Stopping, Overthinking, Inference Efficiency, Optimal Exit Points, TERMINATOR, Reasoning Length Reduction

115. ❌ LLM BiasScope: A Real-Time Bias Analysis Platform for Comparative LLM Evaluation

作者: Himel Ghosh, Nick Elias Werner 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12522v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文LLM BiasScope专注于大语言模型（LLMs）的偏见检测和比较评估，与关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为这是论文的核心研究对象。然而，论文主要涉及LLM的应用（偏见分析平台），而非技术原理的创新，因此与其他关键词（如MoE、Scaling Laws、Pre-training、RLHF、RAG、Quantization等）无关（0分），这些关键词涉及模型架构、训练方法、推理优化、对齐技术等具体技术细节，论文未涉及。

!!! tip deepseek-chat TL;DR

该论文开发了LLM BiasScope平台，用于实时比较不同大语言模型的输出并进行偏见分析，解决了LLM部署中偏见检测和评估的实践问题。

摘要翻译

随着大语言模型（LLM）的广泛部署，检测并理解其输出中的偏见变得至关重要。我们推出LLM BiasScope，这是一个用于并排比较大语言模型输出并进行实时偏见分析的网络应用程序。该系统支持多个提供商（Google Gemini、DeepSeek、MiniMax、Mistral、Meituan、Meta Llama），使研究人员和实践者能够在相同提示下比较不同模型，同时分析偏见模式。LLM BiasScope采用两阶段偏见检测流程：首先进行句子级偏见检测，随后对存在偏见的句子进行偏见类型分类。该分析在用户提示和模型响应上自动运行，提供统计数据、可视化图表以及偏见类型的详细分类。界面并排显示两个模型，支持同步流式响应、各模型的偏见摘要，以及一个突出显示偏见分布差异的比较视图。该系统基于Next.js与React构建，集成了Hugging Face推理端点用于偏见检测，并使用Vercel AI SDK实现多提供商大语言模型访问。其功能包括实时流式传输、导出至JSON/PDF格式，以及用于偏见分析的交互式可视化图表（条形图、雷达图）。LLM BiasScope作为一个开源网络应用程序提供，为大语言模型行为的偏见评估与比较分析提供了一个实用工具。

摘要 (Abstract)

As large language models (LLMs) are deployed widely, detecting and understanding bias in their outputs is critical. We present LLM BiasScope, a web application for side-by-side comparison of LLM outputs with real-time bias analysis. The system supports multiple providers (Google Gemini, DeepSeek, MiniMax, Mistral, Meituan, Meta Llama) and enables researchers and practitioners to compare models on the same prompts while analyzing bias patterns. LLM BiasScope uses a two-stage bias detection pipeline: sentence-level bias detection followed by bias type classification for biased sentences. The analysis runs automatically on both user prompts and model responses, providing statistics, visualizations, and detailed breakdowns of bias types. The interface displays two models side-by-side with synchronized streaming responses, per-model bias summaries, and a comparison view highlighting differences in bias distributions. The system is built on Next.js with React, integrates Hugging Face inference endpoints for bias detection, and uses the Vercel AI SDK for multi-provider LLM access. Features include real-time streaming, export to JSON/PDF, and interactive visualizations (bar charts, radar charts) for bias analysis. LLM BiasScope is available as an open-source web application, providing a practical tool for bias evaluation and comparative analysis of LLM behaviour.

关键词: LLM bias detection, comparative evaluation, real-time analysis, bias classification, multi-provider support, web application, visualization, open-source tool

116. ❌ Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

作者: Yifeng Liu, Siqi Ouyang, Yatish Hosmane Revanasiddappa, Lei Li 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13045v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在低资源语言翻译中的应用，提出WALAR强化学习方法，直接涉及LLMs关键词（10分）；使用持续训练和仅单语文本，与预训练/持续预训练（5分）和训练后方法（5分）相关；其他关键词如MoE、量化、推理加速等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对LLMs在低资源语言翻译中性能不足的问题，提出WALAR强化学习方法，仅使用单语文本提升翻译能力，实验表明新模型在Flores-101数据集上大幅超越现有最强开源多语言LLM。

摘要翻译

大语言模型（LLMs）在高资源语言对的机器翻译任务中展现出卓越能力，但其在低资源翻译方面的表现仍显不足。现有的后训练方法严重依赖高质量平行数据，而这些数据对于低资源语言往往稀缺或难以获取。本文提出WALAR，一种仅使用单语文本的强化训练方法，旨在提升大语言模型对海量低资源语言的翻译能力，同时保持其在高资源语言上的性能。我们的核心思路源于对现有基于源语言的多语言质量评估（Quality Estimation, QE）模型中失效模式（或称“漏洞”）的观察。使用这些质量评估模型进行强化学习（Reinforcement Learning, RL）往往会放大此类漏洞，导致多语言大语言模型性能下降。我们开发了包括词对齐和语言对齐在内的技术，以减轻WALAR用于强化学习训练奖励中的此类漏洞。我们使用WALAR对一个支持101种语言翻译的大语言模型进行了持续训练。实验表明，在Flores-101数据集的1400个语言方向上，我们的新模型大幅超越了目前最强的开源多语言大语言模型之一——LLaMAX。

摘要 (Abstract)

Large Language Models (LLMs) have demonstrated remarkable capability in machine translation on high-resource language pairs, yet their performance on low-resource translation still lags behind. Existing post-training methods rely heavily on high-quality parallel data, which are often scarce or unavailable for low-resource languages. In this paper, we introduce WALAR, a reinforcement training method using only monolingual text to elevate LLMs’ translation capabilities on massive low-resource languages while retaining their performance on high-resource languages. Our key insight is based on the observation of failure modes (or “holes”) in existing source-based multilingual quality estimation (QE) models. Reinforcement learning (RL) using these QE models tends to amplify such holes, resulting in poorer multilingual LLMs. We develop techniques including word alignment and language alignment to mitigate such holes in WALAR’s reward for RL training. We continually trained an LLM supporting translation of 101 languages using WALAR. The experiments show that our new model outperforms LLaMAX, one of the strongest open-source multilingual LLMs by a large margin on 1400 language directions on Flores-101 dataset.

关键词: Large Language Models, machine translation, low-resource languages, reinforcement learning, multilingual translation, WALAR, monolingual text, quality estimation

117. ❌ Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

作者: Xin Chen, Junchao Wu, Shu Yang, Runzhe Zhan, Zeyu Wu, Min Yang, Shujian Huang, Lidia S. Chao, Derek F. Wong 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13201v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	15.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究指令微调（Instruction Tuning）中的数据选择问题，与’Instruction Tuning OR Alignment OR Value Alignment’高度相关（15分），是论文的核心技术。论文明确研究LLMs和SFT，分别给10分。论文涉及数据质量对模型能力的影响，与’Scaling Laws AND Data Quality’有一定关联（5分）。论文通过分析神经元激活模式来理解模型能力，与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分）。论文提到逻辑推理和编程特征，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’有一定关联（5分）。其他关键词如MoE、SLMs、RLHF、RAG等未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为NAIT的神经元感知数据选择框架，用于在指令微调中高效选择高质量数据子集，实验表明仅使用10%的数据即可在多种任务上超越现有方法，并揭示了神经元激活特征在不同能力间的可迁移性。

摘要翻译

指令微调（Instruction Tuning, IT）已被证明是解锁大语言模型（Large Language Models, LLMs）强大能力的有效方法。近期研究表明，过量的IT数据会降低LLMs的性能，而精心挑选一小部分高质量的IT数据则能显著提升其能力。因此，如何从IT数据集中识别出最高效的子集数据，以有效培养LLMs的特定或通用能力，已成为一个关键挑战。为此，我们提出了一种新颖高效的框架NAIT。NAIT通过分析IT数据集与目标领域能力之间神经元激活模式的相似性，来评估IT数据对LLMs性能的影响。具体而言，NAIT从目标领域能力的领域内数据集中捕获神经元激活模式，构建可复用、可迁移的神经元激活特征。随后，它根据候选样本与目标能力预期激活特征之间的相似性来评估并选择最优样本。实验结果表明，在由NAIT选出的10% Alpaca-GPT4 IT数据子集上进行训练，在各种任务上均持续优于依赖外部先进模型或基于不确定性特征的方法。我们的研究还揭示了神经元激活特征在LLMs不同能力间的可迁移性。特别地，具有更强逻辑推理和程序特征的IT数据拥有广泛的通用可迁移性，能使模型在多项任务中发展出更强能力；而一个稳定的核心数据子集足以持续激活模型的基础能力，并在多样任务中普遍提升性能。

摘要 (Abstract)

Instruction Tuning (IT) has been proven to be an effective approach to unlock the powerful capabilities of large language models (LLMs). Recent studies indicate that excessive IT data can degrade LLMs performance, while carefully selecting a small subset of high-quality IT data can significantly enhance their capabilities. Therefore, identifying the most efficient subset data from the IT dataset to effectively develop either specific or general abilities in LLMs has become a critical challenge. To address this, we propose a novel and efficient framework called NAIT. NAIT evaluates the impact of IT data on LLMs performance by analyzing the similarity of neuron activation patterns between the IT dataset and the target domain capability. Specifically, NAIT captures neuron activation patterns from in-domain datasets of target domain capabilities to construct reusable and transferable neuron activation features. It then evaluates and selects optimal samples based on the similarity between candidate samples and the expected activation features of the target capabilities. Experimental results show that training on the 10% Alpaca-GPT4 IT data subset selected by NAIT consistently outperforms methods that rely on external advanced models or uncertainty-based features across various tasks. Our findings also reveal the transferability of neuron activation features across different capabilities of LLMs. In particular, IT data with more logical reasoning and programmatic features possesses strong general transferability, enabling models to develop stronger capabilities across multiple tasks, while a stable core subset of data is sufficient to consistently activate fundamental model capabilities and universally improve performance across diverse tasks.

关键词: Instruction Tuning, Large Language Models, Data Selection, Neuron Activation Patterns, Supervised Fine-tuning, Model Capabilities, Transferability, Alpaca-GPT4

118. ❌ Interpretable Semantic Gradients in SSD: A PCA Sweep Approach and a Case Study on AI Discourse

作者: Hubert Plisiecki, Maria Leniarska, Jan Piotrowski, Marcin Zajenkowski 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13038v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种用于Supervised Semantic Differential（SSD）方法的PCA维度选择新程序（PCA sweep），并以人工智能话语分析为案例进行验证。论文核心是方法论改进（PCA维度选择）和文本分析应用，而非大模型技术本身。与绝大多数关键词（涉及大模型架构、训练、推理、对齐、应用等）完全无关。仅与’Mechanistic Interpretability OR Explainable AI’有弱关联（5分），因为论文涉及文本语义解释方法；与’AI for Science OR Bioinformatics OR Cheminformatics’有弱关联（5分），因为案例研究分析了AI相关文本，属于AI在社会科学分析中的应用，但并非核心科学发现或生物/化学信息学应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种PCA sweep方法来改进Supervised Semantic Differential（SSD）中的维度选择问题，减少了研究自由度，并通过分析人工智能相关文本的案例，展示该方法能产生稳定、可解释的语义梯度（如将AI话语分为乐观协作与不信任嘲讽两类）。

摘要翻译

监督语义差异法（Supervised Semantic Differential，SSD）是一种融合定量与解释的混合方法，它通过在嵌入空间中估计语义梯度并借助聚类与文本检索技术解释其两极，从而建模文本意义如何随连续的个体差异变量而变化。SSD在回归分析前应用主成分分析（PCA），但目前尚无系统方法来确定应保留的主成分数量，这导致分析流程中引入了可避免的研究者自由度。我们提出一种PCA扫描程序，将维度选择视为对表征能力、梯度可解释性以及相邻K值间稳定性的联合评判准则。我们以一组由Prolific平台参与者撰写的关于人工智能的短文为例进行方法演示，这些参与者同时完成了自恋倾向中的“钦佩”与“竞争”维度量表。扫描程序产生了一个稳定且可解释的、与“钦佩”维度相关的语义梯度，该梯度将乐观协作的人工智能论述框架与不信任及嘲讽式话语形成对比，而“竞争”维度则未呈现稳健的对应关系。我们还证明，若采用高PCA维度的解决方案启发式进行反事实分析，反而会产生分散且结构薄弱的聚类，这进一步凸显了基于扫描程序选择K值的重要性。本案例研究表明，PCA扫描程序如何在保持SSD解释目标的同时约束研究者自由度，从而支持对内涵意义进行透明且具有心理学意义的分析。

摘要 (Abstract)

Supervised Semantic Differential (SSD) is a mixed quantitative-interpretive method that models how text meaning varies with continuous individual-difference variables by estimating a semantic gradient in an embedding space and interpreting its poles through clustering and text retrieval. SSD applies PCA before regression, but currently no systematic method exists for choosing the number of retained components, introducing avoidable researcher degrees of freedom in the analysis pipeline. We propose a PCA sweep procedure that treats dimensionality selection as a joint criterion over representation capacity, gradient interpretability, and stability across nearby values of K. We illustrate the method on a corpus of short posts about artificial intelligence written by Prolific participants who also completed Admiration and Rivalry narcissism scales. The sweep yields a stable, interpretable Admiration-related gradient contrasting optimistic, collaborative framings of AI with distrustful and derisive discourse, while no robust alignment emerges for Rivalry. We also show that a counterfactual using a high-PCA dimension solution heuristic produces diffuse, weakly structured clusters instead, reinforcing the value of the sweep-based choice of K. The case study shows how the PCA sweep constrains researcher degrees of freedom while preserving SSD’s interpretive aims, supporting transparent and psychologically meaningful analyses of connotative meaning.

关键词: Supervised Semantic Differential, PCA sweep, semantic gradient, interpretability, AI discourse, dimensionality selection, text analysis, researcher degrees of freedom

119. ❌ DS$^2$-Instruct: Domain-Specific Data Synthesis for Large Language Models Instruction Tuning

作者: Ruiyao Xu, Noelle I. Samia, Han Liu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12932v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的指令调优数据生成方法，与’Large Language Models’、‘Post-training/SFT’、‘Instruction Tuning’高度相关（10分），因为这些是论文的直接研究对象和方法。其他关键词如MoE、SLMs、RAG等未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLMs在专业领域指令调优数据稀缺且人工标注成本高的问题，提出了一个零监督的领域特定指令数据集生成框架DS²-Instruct，实验表明使用该框架生成数据微调的模型性能显著优于现有数据生成方法。

摘要翻译

将大型语言模型（LLMs）适配到专业领域需要高质量的指令微调数据集，而通过人工标注创建这些数据集的成本高昂。现有的数据合成方法主要关注通用任务，难以捕捉特定领域的术语和推理模式。为此，我们提出DS$^2$-Instruct，一种无需人工监督即可生成领域特定指令数据集的零样本框架。我们的方法首先生成任务导向的关键词以确保全面的领域覆盖，随后将这些关键词与布鲁姆分类法（Bloom’s Taxonomy）中的不同认知层级相结合，以创建多样化的指令。最后，通过自洽性验证来确保数据质量。我们将此框架应用于数学、金融和逻辑推理等七个具有挑战性的领域以生成数据集。综合评估表明，基于我们生成的数据进行微调的模型，相比现有数据生成方法取得了显著提升。

摘要 (Abstract)

Adapting Large Language Models (LLMs) to specialized domains requires high-quality instruction tuning datasets, which are expensive to create through human annotation. Existing data synthesis methods focus on general-purpose tasks and fail to capture domain-specific terminology and reasoning patterns. To address this, we introduce DS$^2$-Instruct, a zero-shot framework that generates domain-specific instruction datasets without human supervision. Our approach first generates task-informed keywords to ensure comprehensive domain coverage. It then creates diverse instructions by pairing these keywords with different cognitive levels from Bloom’s Taxonomy. Finally, it uses self-consistency validation to ensure data quality. We apply this framework to generate datasets across seven challenging domains, such as mathematics, finance, and logical reasoning. Comprehensive evaluation demonstrates that models fine-tuned on our generated data achieve substantial improvements over existing data generation methods.

关键词: Large Language Models, Instruction Tuning, Domain Adaptation, Data Synthesis, Supervised Fine-tuning, Zero-shot Framework, Domain-specific Datasets, Self-consistency Validation

120. ❌ Long-form RewardBench: Evaluating Reward Models for Long-form Generation

作者: Hui Huang, Yancheng He, Wei Liu, Muyun Yang, Jiaheng Liu, Kehai Chen, Bing Xu, Conghui Zhu, Hailong Cao, Tiejun Zhao 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12963v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于评估长文本生成的奖励模型，与强化学习对齐（RLHF）和指令调优/对齐高度相关（10分），因为奖励模型是RLHF的关键组成部分。论文涉及大语言模型（8分），因为奖励模型通常用于对齐LLM。RAG和推理（5分）是评估的五个子任务之一，因此有一定关联。其他关键词如MoE、量化、推理加速等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了首个专门评估长文本生成奖励模型的基准Long-form RewardBench，发现当前模型在长文本奖励建模能力上仍有不足，且分类器比生成模型具有更好的泛化性。

摘要翻译

基于强化学习的对齐方法被广泛采用，凸显了奖励模型日益增长的重要性。目前已有多种基准被构建出来，用于评估不同领域和场景下的奖励模型。然而，尽管长文本生成在实际应用中至关重要，但针对其奖励模型的评估仍存在显著空白。为弥补这一不足，我们推出了Long-form RewardBench，这是首个专门为长文本生成设计的奖励建模测试平台。我们的基准涵盖五个关键子任务：问答（QA）、检索增强生成（RAG）、对话（Chat）、写作（Writing）和推理（Reasoning）。我们通过精心设计的多阶段数据收集流程，收集了指令和偏好数据，并对20多个主流奖励模型（包括分类器和生成模型）进行了广泛实验。我们的研究结果表明，当前模型仍缺乏长文本奖励建模能力。此外，我们设计了一种新颖的“长文本大海捞针测试”（Long-form Needle-in-a-Haystack Test），该测试揭示了奖励模型性能与错误在回答中的位置以及回答总长度之间的相关性，并且分类器模型与生成模型在此方面表现出不同的特性。最后，我们证明，在相同数据上训练时，分类器相比生成模型展现出更好的泛化能力。作为长文本奖励建模领域的首个基准，本工作旨在为这一关键领域的进展可视化提供一个稳健的平台。

摘要 (Abstract)

The widespread adoption of reinforcement learning-based alignment highlights the growing importance of reward models. Various benchmarks have been built to evaluate reward models in various domains and scenarios. However, a significant gap remains in assessing reward models for long-form generation, despite its critical role in real-world applications. To bridge this, we introduce Long-form RewardBench, the first reward modeling testbed specifically designed for long-form generation. Our benchmark encompasses five key subtasks: QA, RAG, Chat, Writing, and Reasoning. We collected instruction and preference data through a meticulously designed multi-stage data collection process, and conducted extensive experiments on 20+ mainstream reward models, including both classifiers and generative models. Our findings reveal that current models still lack long-form reward modeling capabilities. Furthermore, we designed a novel Long-form Needle-in-a-Haystack Test, which revealed a correlation between reward modeling performance and the error’s position within a response, as well as the overall response length, with distinct characteristics observed between classification and generative models. Finally, we demonstrate that classifiers exhibit better generalizability compared to generative models trained on the same data. As the first benchmark for long-form reward modeling, this work aims to offer a robust platform for visualizing progress in this crucial area.

关键词: reward models, long-form generation, benchmark, reinforcement learning alignment, evaluation, preference data, classifiers, generative models

121. ❌ CLARIN-PT-LDB: An Open LLM Leaderboard for Portuguese to assess Language, Culture and Civility

作者: João Silva, Luís Gomes, António Branco 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12872v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是开发针对欧洲葡萄牙语的开放大语言模型（LLM）排行榜及相关基准测试，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文提到评估模型保障措施和与葡萄牙文化的对齐，这与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理技术、代理系统、模型压缩、科学AI应用等，论文均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文开发了一个针对欧洲葡萄牙语的开放大语言模型（LLM）排行榜及相关基准测试，以填补该语言变体在LLM评估方面的空白，并引入了包括模型保障措施和文化对齐在内的新评估维度。

摘要翻译

本文报告了针对欧洲葡萄牙语（PT-PT）的开放大语言模型（LLM）排行榜及其相关基准测试的开发工作。该排行榜的建立旨在填补欧洲葡萄牙语大语言模型评估领域的空白——此前该语言变体尚无专门的性能排名体系。论文同时介绍了一系列新颖的基准测试，其中部分测试关注了欧洲葡萄牙语基准中尚未覆盖的性能维度，特别是模型安全防护机制与葡萄牙文化适配性。该排行榜可通过 https://huggingface.co/spaces/PORTULAN/portuguese-llm-leaderboard 访问。

摘要 (Abstract)

This paper reports on the development of a leaderboard of Open Large Language Models (LLM) for European Portuguese (PT-PT), and on its associated benchmarks. This leaderboard comes as a way to address a gap in the evaluation of LLM for European Portuguese, which so far had no leaderboard dedicated to this variant of the language. The paper also reports on novel benchmarks, including some that address aspects of performance that so far have not been available in benchmarks for European Portuguese, namely model safeguards and alignment to Portuguese culture. The leaderboard is available at https://huggingface.co/spaces/PORTULAN/portuguese-llm-leaderboard.

关键词: Large Language Models, LLM leaderboard, European Portuguese, benchmarks, model evaluation, model safeguards, cultural alignment, open LLMs

122. ❌ HMS-BERT: Hybrid Multi-Task Self-Training for Multilingual and Multi-Label Cyberbullying Detection

作者: Zixin Feng, Xinying Cui, Yifan Sun, Zheng Wei, Jiachen Yuan, Jiazhen Hu, Ning Xin, Md Maruf Hasan 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12920v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文HMS-BERT专注于多语言多标签网络欺凌检测，使用预训练的多语言BERT作为基础模型，通过微调（fine-tuning）和自训练（self-training）策略提升性能。因此，与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（8分），因为论文明确涉及对预训练模型的监督微调；与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为使用了预训练的BERT模型，但未涉及预训练过程本身或持续预训练。其他关键词主要涉及大模型技术原理、推理、对齐、压缩等前沿创新，或特定科学领域应用，与本论文的特定应用场景（网络欺凌检测）无直接关联，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于预训练多语言BERT的混合多任务自训练框架HMS-BERT，用于解决多语言和多标签网络欺凌检测问题，在公开数据集上取得了优异的性能表现。

摘要翻译

社交媒体上的网络欺凌本质上是多语言且多层面的，其辱骂行为常跨越多个类别重叠出现。现有方法通常受限于单语言假设或单任务框架，这制约了其在现实多语言、多标签场景中的有效性。本文提出HMS-BERT——一种用于多语言多标签网络欺凌检测的混合多任务自训练框架。该框架基于预训练的多语言BERT主干网络，将上下文表征与人工构建的语言特征相融合，并联合优化细粒度多标签辱骂分类任务与三分类主任务。针对低资源语言标注数据稀缺的问题，我们引入了基于置信度的伪标注迭代自训练策略，以促进跨语言知识迁移。在四个公开数据集上的实验表明，HMS-BERT取得了显著性能，在多标签任务上宏观F1分数最高达0.9847，在主分类任务上准确率达0.6775。消融实验进一步验证了所提出组件的有效性。

摘要 (Abstract)

Cyberbullying on social media is inherently multilingual and multi-faceted, where abusive behaviors often overlap across multiple categories. Existing methods are commonly limited by monolingual assumptions or single-task formulations, which restrict their effectiveness in realistic multilingual and multi-label scenarios. In this paper, we propose HMS-BERT, a hybrid multi-task self-training framework for multilingual and multi-label cyberbullying detection. Built upon a pretrained multilingual BERT backbone, HMS-BERT integrates contextual representations with handcrafted linguistic features and jointly optimizes a fine-grained multi-label abuse classification task and a three-class main classification task. To address labeled data scarcity in low-resource languages, an iterative self-training strategy with confidence-based pseudo-labeling is introduced to facilitate cross-lingual knowledge transfer. Experiments on four public datasets demonstrate that HMS-BERT achieves strong performance, attaining a macro F1-score of up to 0.9847 on the multi-label task and an accuracy of 0.6775 on the main classification task. Ablation studies further verify the effectiveness of the proposed components.

关键词: multilingual BERT, multi-task learning, self-training, cyberbullying detection, multi-label classification, pseudo-labeling, cross-lingual transfer, fine-tuning

123. ❌ Rethinking Multiple-Choice Questions for RLVR: Unlocking Potential via Distractor Design

作者: Xu Guo, Qiming Ge, Jian Tong, Kedi Chen, Jin Zhang, Xiaogui Yang, Xuan Gao, Haijun Lv, Zhihui Lu, Yicheng Zou, Qipeng Guo 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12826v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RLVR（Reinforcement Learning with Verifiable Rewards）如何提升大语言模型的推理能力，直接涉及’Large Language Models’（10分）和’Chain of Thought/System 2 Thinking’（各10分），因为研究重点是通过优化多选题的干扰项设计来促进深度推理、防止随机猜测。其他关键词如MoE、量化、RAG等未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了在强化学习可验证奖励框架中，如何通过优化多选题的干扰项设计来防止模型走捷径、促进深度推理，并提出了迭代干扰项筛选方法以提升训练效果。

摘要翻译

强化学习与可验证奖励机制显著提升了大型语言模型的推理能力。在应用该框架时，多项选择题因其可扩展的可验证数据来源而受到关注，但存在诱发奖励破解的风险——模型可能通过随机猜测或简单排除法来规避深度推理。现有方法通常将选择题转换为开放式问题以缓解此问题，但这同时舍弃了专家设计的干扰项所提供的对比信号。本研究系统性地探讨了选项设计对强化学习与可验证奖励机制的影响。分析揭示了两项核心发现：（1）训练与测试阶段选项数量的不匹配会导致性能下降；（2）强干扰项能有效抑制随机猜测行为，即使在二选一问题中也能实现有效的强化学习与可验证奖励机制训练。基于这些发现，我们提出迭代干扰项优化框架，该框架通过主动构建高质量干扰项来阻断排除法捷径，促进深度推理。在多个基准测试上的实验表明，相较于原始数据，我们的方法能有效提升干扰项质量，并在强化学习与可验证奖励机制训练中取得显著性能提升。

摘要 (Abstract)

Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capabilities of Large Language Models. When applied to RLVR, Multiple-Choice Questions (MCQs) offer a scalable source of verifiable data but risk inducing reward hacking, where models shortcut reasoning via random guessing or simple elimination. Current approaches often mitigate this by converting MCQs to open-ended formats, thereby discarding the contrastive signal provided by expert-designed distractors. In this work, we systematically investigate the impact of option design on RLVR. Our analysis highlights two primary insights: (1) Mismatches in option counts between training and testing degrade performance. (2) Strong distractors effectively mitigate random guessing, enabling effective RLVR training even with 2-way questions. Motivated by these findings, we propose Iterative Distractor Curation (IDC), a framework that actively constructs high-quality distractors to block elimination shortcuts and promote deep reasoning. Experiments on various benchmarks demonstrate that our method effectively enhances distractor quality and yields significant gains in RLVR training compared to the original data.

关键词: Reinforcement Learning with Verifiable Rewards, Large Language Models, Multiple-Choice Questions, distractor design, reasoning capabilities, reward hacking, Iterative Distractor Curation, deep reasoning

124. ❌ Adaptive Vision-Language Model Routing for Computer Use Agents

作者: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12823v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种自适应视觉语言模型路由框架（AVR），用于计算机使用代理（CUAs），核心是智能路由机制以平衡准确性和成本。与关键词高度相关的是：1）‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（10分）：论文直接研究计算机使用代理（CUAs），属于自主代理和代理工作流范畴；2）‘Tool Use OR Function Calling OR API Tool Use’（10分）：CUAs依赖VLM解释截图并预测基础工具调用（如点击、按键），涉及工具使用。与’Large Language Models OR LLMs OR Foundation Models’和’Small Language Models OR SLMs OR On-device AI’有一定关联（各5分），因为论文涉及路由到不同规模的VLM（包括小型和大型模型），但未深入探讨LLM/SLM技术本身。其他关键词如MoE、Scaling Laws、训练方法、推理优化、科学AI等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了计算机使用代理中视觉语言模型路由效率低下的问题，通过提出自适应VLM路由框架，在保持高准确性的同时将推理成本降低高达78%。

摘要翻译

计算机使用代理（Computer Use Agents，CUAs）通过依赖视觉语言模型（Vision-Language Model，VLM）来解读屏幕截图并预测基于工具调用的操作，从而将自然语言指令转化为图形用户界面（Graphical User Interface，GUI）动作，例如点击、按键和滚动。然而，不同VLM的定位准确性差异显著，而当前的CUA系统通常将所有操作路由至单一固定模型，未考虑任务难度差异。我们提出自适应视觉语言模型路由（Adaptive VLM Routing，AVR）框架，该框架在CUA编排器与VLM池之间插入一个轻量级语义路由层。对于每个工具调用，AVR从多模态嵌入中估计操作难度，通过小型VLM探测置信度，并将操作路由至预测准确性满足目标可靠性阈值的最经济模型。对于具备先前用户界面交互记忆的热代理，检索到的上下文进一步缩小了小型与大型模型之间的能力差距，使得许多操作无需升级即可处理。我们将路由形式化为成本与准确性的权衡问题，推导出基于阈值的模型选择策略，并使用ScreenSpot-Pro定位数据及OpenClaw代理路由基准对AVR进行评估。在这些设定下，AVR预计可将推理成本降低高达78%，同时与全大型模型基线的性能差距保持在2个百分点以内。当与视觉混淆代理防护机制结合时，AVR还能将高风险操作直接升级至可用的最强模型，从而在单一路由框架内统一效率与安全性。相关材料（模型、基准及代码）已提供：https://github.com/vllm-project/semantic-router。

摘要 (Abstract)

Computer Use Agents (CUAs) translate natural-language instructions into Graphical User Interface (GUI) actions such as clicks, keystrokes, and scrolls by relying on a Vision-Language Model (VLM) to interpret screenshots and predict grounded tool calls. However, grounding accuracy varies dramatically across VLMs, while current CUA systems typically route every action to a single fixed model regardless of difficulty. We propose \textbf{Adaptive VLM Routing} (AVR), a framework that inserts a lightweight semantic routing layer between the CUA orchestrator and a pool of VLMs. For each tool call, AVR estimates action difficulty from multimodal embeddings, probes a small VLM to measure confidence, and routes the action to the cheapest model whose predicted accuracy satisfies a target reliability threshold. For \textit{warm} agents with memory of prior UI interactions, retrieved context further narrows the capability gap between small and large models, allowing many actions to be handled without escalation. We formalize routing as a cost–accuracy trade-off, derive a threshold-based policy for model selection, and evaluate AVR using ScreenSpot-Pro grounding data together with the OpenClaw agent routing benchmark. Across these settings, AVR projects inference cost reductions of up to 78% while staying within 2 percentage points of an all-large-model baseline. When combined with the Visual Confused Deputy guardrail, AVR also escalates high-risk actions directly to the strongest available model, unifying efficiency and safety within a single routing framework. Materials are also provided Model, benchmark, and code: https://github.com/vllm-project/semantic-router.

关键词: Computer Use Agents, Vision-Language Model, Adaptive Routing, Tool Calling, GUI Actions, Inference Cost Reduction, Model Selection, Efficiency-Safety Trade-off

125. ❌ SectEval: Evaluating the Latent Sectarian Preferences of Large Language Models

作者: Aditya Maheshwari, Amit Gajkeshwar, Kaushal Sharma, Vivek Patel 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12768v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在宗教知识领域的偏见评估，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。研究涉及模型对不同宗教派别的偏好，这与’Alignment’（对齐）和’Factuality’（事实性）有一定关联（各5分），因为评估模型输出的一致性和潜在偏见。同时，研究揭示了模型行为的不一致性，与’Explainable AI’（可解释AI）相关（5分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理技术、压缩加速、AI for Science等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该研究首次评估了15个大语言模型对伊斯兰教逊尼派和什叶派的潜在偏见，发现模型回答存在严重的语言和地域依赖性，例如同一模型在英语中偏向什叶派而在印地语中偏向逊尼派，揭示了AI宗教知识输出的非中立性。

摘要翻译

随着大型语言模型（LLM）日益成为宗教知识的重要来源，评估其是否公平对待不同群体至关重要。本研究首次系统衡量了LLM如何处理伊斯兰教两大主要教派——逊尼派与什叶派——之间的差异。我们提出了一个名为SectEval的测试集（提供英语和印地语版本，共包含88个问题），用于评估15个顶尖的专有模型和开源权重模型的偏见程度。研究结果揭示了显著的基于语言的不一致性：在英语语境中，DeepSeek-v3、GPT-4o等高性能模型往往倾向于什叶派观点；然而当使用印地语提出完全相同的问题时，这些模型却转向支持逊尼派立场。这意味着用户仅通过切换语言就可能获得完全相悖的宗教建议。我们还考察了模型对地理位置的反应：先进模型如Claude-3.5会根据用户所在国家调整答案——对伊朗用户给出什叶派倾向的回答，对沙特阿拉伯用户则提供逊尼派倾向的回答。相比之下，较小规模的模型（尤其在印地语中）会忽略用户地理位置，始终坚持逊尼派观点。这些发现表明人工智能并非中立，其提供的宗教“真相”会随着用户使用的语言和声称所属的国家而改变。本数据集已公开于https://github.com/secteval/SectEval/。

摘要 (Abstract)

As Large Language Models (LLMs) becomes a popular source for religious knowledge, it is important to know if it treats different groups fairly. This study is the first to measure how LLMs handle the differences between the two main sects of Islam: Sunni and Shia. We present a test called SectEval, available in both English and Hindi, consisting of 88 questions, to check the bias-ness of 15 top LLM models, both proprietary and open-weights. Our results show a major inconsistency based on language. In English, many powerful models DeepSeek-v3 and GPT-4o often favored Shia answers. However, when asked the exact same questions in Hindi, these models switched to favoring Sunni answers. This means a user could get completely different religious advice just by changing languages. We also looked at how models react to location. Advanced models Claude-3.5 changed their answers to match the user’s country-giving Shia answers to a user from Iran and Sunni answers to a user from Saudi Arabia. In contrast, smaller models (especially in Hindi) ignored the user’s location and stuck to a Sunni viewpoint. These findings show that AI is not neutral; its religious ``truth’’ changes depending on the language you speak and the country you claim to be from. The data set is available at https://github.com/secteval/SectEval/

关键词: Large Language Models, bias evaluation, religious knowledge, sectarian preferences, language dependency, geographic bias, Sunni-Shia, AI fairness

126. ❌ A Method for Learning Large-Scale Computational Construction Grammars from Semantically Annotated Corpora

作者: Paul Van Eecke, Katrien Beuls 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12754v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是从语义标注语料库中学习大规模计算构式语法的方法，属于计算语言学、自然语言处理中的语法学习领域。论文内容涉及构式语法、语义框架、句法结构分析等传统NLP技术，但完全不涉及大模型、深度学习、LLM技术原理、模型训练优化、推理加速、AI智能体等现代大模型相关技术。所有评分关键词都聚焦于大模型技术栈，而该论文使用的是基于规则和统计的传统语法学习方法，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种从语义标注语料库中学习大规模计算构式语法的方法，能够生成包含数万构式的人类可解释语法网络，支持开放领域文本的框架语义分析，并验证了构式语法方法的可扩展性。

摘要翻译

本文提出一种从语言使用语料库中习得大规模、广覆盖构式语法的方法。该方法以标注了句法成分结构与语义框架的语句为起点，能够习得人类可解读的计算构式语法，从而捕捉句法结构与其所表达语义关系之间的复杂关联。所得语法包含数以万计的构式所形成的网络，这些构式均在流体构式语法框架内形式化。此类语法不仅支持开放域文本的框架语义分析，还蕴藏着从其学习数据中提取的丰富句法-语义使用模式信息。该方法及所习得的语法推动了基于使用的构式主义语言研究范式的规模化发展，既证实了若干基础构式语法假说的可扩展性，也为在广覆盖语料库中对英语论元结构开展构式主义研究提供了实用工具。

摘要 (Abstract)

We present a method for learning large-scale, broad-coverage construction grammars from corpora of language use. Starting from utterances annotated with constituency structure and semantic frames, the method facilitates the learning of human-interpretable computational construction grammars that capture the intricate relationship between syntactic structures and the semantic relations they express. The resulting grammars consist of networks of tens of thousands of constructions formalised within the Fluid Construction Grammar framework. Not only do these grammars support the frame-semantic analysis of open-domain text, they also house a trove of information about the syntactico-semantic usage patterns present in the data they were learnt from. The method and learnt grammars contribute to the scaling of usage-based, constructionist approaches to language, as they corroborate the scalability of a number of fundamental construction grammar conjectures while also providing a practical instrument for the constructionist study of English argument structure in broad-coverage corpora.

关键词: computational construction grammars, semantically annotated corpora, Fluid Construction Grammar, syntactic structures, semantic frames, large-scale grammar learning, usage-based constructionist approaches, argument structure analysis

127. ❌ EvolveCoder: Evolving Test Cases via Adversarial Verification for Code Reinforcement Learning

作者: Chi Ruan, Dongfu Jiang, Huaye Zeng, Ping Nie, Wenhu Chen 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12698v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于通过强化学习改进大语言模型的代码生成能力，核心贡献是提出了一种对抗性验证框架来演化测试用例，并构建了大规模数据集EvolveCoder-22k。论文明确涉及大语言模型（LLMs）在代码生成中的应用，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术或应用领域，如MoE、SLMs、Scaling Laws、各种训练方法（预训练、微调、对齐等）、推理优化、代理系统、模型压缩、科学AI等，因此这些关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于对抗性验证的测试用例演化框架EvolveCoder，用于增强代码生成中强化学习的验证信号，并构建了大规模数据集EvolveCoder-22k，实验表明该方法能有效提升大语言模型在代码生成任务上的性能。

摘要翻译

具备可验证奖励的强化学习是提升大语言模型代码生成能力的一种前景广阔的方法，但其效果受限于现有编码强化学习数据集中验证信号薄弱且静态的问题。本文提出一种基于解决方案条件与对抗性验证的框架，该框架依据候选解决方案的执行行为迭代优化测试用例，旨在提升测试难度、增强判别能力并减少冗余。基于此框架，我们构建了EvolveCoder-22k——一个通过多轮对抗性测试用例演化生成的大规模编码强化学习数据集。实证分析表明，迭代优化显著增强了验证强度，pass@1指标从43.80降至31.22。在EvolveCoder-22k上进行的强化学习实现了稳定的优化和持续的性能提升，使Qwen3-4B模型在四项下游基准测试中平均提高4.2分，并超越了多个强大的4B规模基线模型。我们的研究结果凸显了对抗性、基于解决方案条件的验证机制对于实现高效且可扩展的代码生成强化学习的重要性。

摘要 (Abstract)

Reinforcement learning with verifiable rewards (RLVR) is a promising approach for improving code generation in large language models, but its effectiveness is limited by weak and static verification signals in existing coding RL datasets. In this paper, we propose a solution-conditioned and adversarial verification framework that iteratively refines test cases based on the execution behaviors of candidate solutions, with the goal of increasing difficulty, improving discriminative power, and reducing redundancy. Based on this framework, we introduce EvolveCoder-22k, a large-scale coding reinforcement learning dataset constructed through multiple rounds of adversarial test case evolution. Empirical analysis shows that iterative refinement substantially strengthens verification, with pass@1 decreasing from 43.80 to 31.22. Reinforcement learning on EvolveCoder-22k yields stable optimization and consistent performance gains, improving Qwen3-4B by an average of 4.2 points across four downstream benchmarks and outperforming strong 4B-scale baselines. Our results highlight the importance of adversarial, solution-conditioned verification for effective and scalable reinforcement learning in code generation.

关键词: code generation, reinforcement learning, adversarial verification, test case evolution, large language models, EvolveCoder, RLVR, verifiable rewards

128. ❌ FGTR: Fine-Grained Multi-Table Retrieval via Hierarchical LLM Reasoning

作者: Chaojie Sun, Bin Cao, Tiantian Li, Chenyu Hou, Ruizhe Li, Qing Fan 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12702v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文FGTR提出了一种基于LLM的分层多表检索方法，核心是利用LLM的推理能力进行细粒度检索。因此，与’Large Language Models’高度相关（10分），因为LLM是方法的基础。与’Retrieval-Augmented Generation’高度相关（10分），因为论文本质上是检索增强的生成任务（检索子表）。与’Chain of Thought’和’System 2 Thinking’高度相关（各10分），因为方法采用分层推理策略，模拟人类逐步推理过程。其他关键词如MoE、SFT、量化等与论文内容无关，得0分。论文未涉及特定科学领域应用，因此’AI for Science’也得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有基于LLM的表检索方法在细粒度多表查询上准确率低的问题，提出了一种分层推理的细粒度多表检索方法FGTR，实验表明其在Spider和BIRD基准上显著提升了检索性能。

摘要翻译

随着大语言模型（LLM）的快速发展，基于LLM的表格检索研究日益增多。然而，现有研究通常聚焦于单表查询，并通过编码整个表格后进行相似度匹配来实现。这些方法由于采用了包含大量无关查询数据的粗粒度编码，通常导致准确率较低，且在处理大型表格时效率低下，未能充分利用LLM的推理能力。此外，检索任务中的多表查询研究尚不充分。为此，我们提出一种基于LLM的分层多表查询方法：细粒度多表检索（Fine-Grained Multi-Table Retrieval, FGTR），这是一种采用类人推理策略的新型检索范式。通过分层推理，FGTR首先识别相关的模式元素，然后检索对应的单元格内容，最终构建一个与给定查询匹配的简洁而准确的子表。为了全面评估FGTR的性能，我们基于Spider和BIRD构建了两个新的基准数据集。实验结果表明，FGTR优于以往最先进的方法，在Spider上将F_2指标提升了18%，在BIRD上提升了21%，证明了其在增强细粒度检索方面的有效性，以及在提升基于表格的下游任务端到端性能方面的潜力。

摘要 (Abstract)

With the rapid advancement of large language models (LLMs), growing efforts have been made on LLM-based table retrieval. However, existing studies typically focus on single-table query, and implement it by similarity matching after encoding the entire table. These methods usually result in low accuracy due to their coarse-grained encoding which incorporates much query-irrelated data, and are also inefficient when dealing with large tables, failing to fully utilize the reasoning capabilities of LLM. Further, multi-table query is under-explored in retrieval tasks. To this end, we propose a hierarchical multi-table query method based on LLM: Fine-Grained Multi-Table Retrieval FGTR, a new retrieval paradigm that employs a human-like reasoning strategy. Through hierarchical reasoning, FGTR first identifies relevant schema elements and then retrieves the corresponding cell contents, ultimately constructing a concise and accurate sub-table that aligns with the given query. To comprehensively evaluate the performance of FGTR, we construct two new benchmark datasets based on Spider and BIRD . Experimental results show that FGTR outperforms previous state-of-the-art methods, improving the F_2 metric by 18% on Spider and 21% on BIRD, demonstrating its effectiveness in enhancing fine-grained retrieval and its potential to improve end-to-end performance on table-based downstream tasks.

关键词: Large Language Models, Table Retrieval, Multi-table Query, Hierarchical Reasoning, Fine-grained Retrieval, Retrieval-Augmented Generation, LLM Reasoning, Benchmark Datasets

129. ❌ Self-Supervised Speech Models Encode Phonetic Context via Position-dependent Orthogonal Subspaces

作者: Kwanghee Choi, Eunjung Yeo, Cheol Jun Cho, David R. Mortensen, David Harwath 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12642v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究自监督语音模型（S3Ms）中语音信息的编码机制，特别是如何通过位置相关的正交子空间编码语音上下文。虽然论文涉及Transformer架构和自监督学习，但所有关键词都专门针对大语言模型（LLMs）及其相关技术（如微调、对齐、推理、代理等），而该论文专注于语音处理领域，未涉及任何LLM技术、应用或创新。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了自监督语音模型如何通过位置相关的正交子空间在单个帧级表示中组合编码相邻语音的语音信息，揭示了上下文依赖表示的结构特性。

摘要翻译

基于Transformer的自监督语音模型（S3Ms）常被描述为具有上下文表征能力，但其具体内涵仍不明确。本文重点探讨单个帧级S3M表征如何编码音素及其相邻语境。先前研究表明，S3Ms以组合方式表征音素；例如，[m]的S3M表征中叠加了浊音性、双唇性和鼻音性等音系向量。我们拓展了这一观点，提出相邻音素序列的音系信息同样以组合方式编码于单个帧中，使得对应于前序、当前及后续音素的向量在单个帧级表征中叠加。我们证明这种结构具有多重特性，包括相对位置间的正交性以及隐式音素边界的涌现。这些发现共同推进了我们对上下文相关S3M表征的理解。

摘要 (Abstract)

Transformer-based self-supervised speech models (S3Ms) are often described as contextualized, yet what this entails remains unclear. Here, we focus on how a single frame-level S3M representation can encode phones and their surrounding context. Prior work has shown that S3Ms represent phones compositionally; for example, phonological vectors such as voicing, bilabiality, and nasality vectors are superposed in the S3M representation of [m]. We extend this view by proposing that phonological information from a sequence of neighboring phones is also compositionally encoded in a single frame, such that vectors corresponding to previous, current, and next phones are superposed within a single frame-level representation. We show that this structure has several properties, including orthogonality between relative positions, and emergence of implicit phonetic boundaries. Together, our findings advance our understanding of context-dependent S3M representations.

关键词: self-supervised speech models, phonetic context encoding, position-dependent orthogonal subspaces, frame-level representation, phonological vectors, context-dependent representations, Transformer-based models, implicit phonetic boundaries

130. ❌ 98$\times$ Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

作者: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12646v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM路由系统的优化，属于大模型技术应用领域。高度相关关键词：1) ‘Large Language Models’ (论文围绕LLM路由系统)；2) ‘KV Cache Compression OR Linear Attention OR FlashAttention’ (Stage 1使用Flash Attention优化)；3) ‘Context Window Extension OR Long Context LLMs’ (处理8K-32K长上下文)；4) ‘Speculative Decoding OR Inference Acceleration’ (实现98倍加速)。其他关键词如MoE、SFT、RAG等未涉及，评分为0。

!!! tip deepseek-chat TL;DR

该论文解决了LLM路由系统在长上下文(8K-32K tokens)处理中的高延迟和大内存占用问题，通过Flash Attention优化、提示压缩和近流式处理三阶段方法，实现了98倍加速并将GPU内存占用降至800MB以下。

摘要翻译

用于安全分类、领域路由和PII检测的LLM请求拦截系统级路由器必须兼具高速与操作轻量化特性：其应为每个请求增加极小的延迟，且无需专用GPU——这一昂贵资源更应留给LLM推理本身。当路由器与vLLM服务实例共置于同一GPU时，标准注意力机制的$O(n^2)$内存开销使得长上下文分类（8K–32K词元）无法实现：在8K词元长度下，三个并发分类器仅注意力掩码就需占用约4.5GB内存，远超vLLM运行后剩余的内存空间。本文针对vLLM语义路由器提出三阶段优化方案（基于AMD Instinct MI300X平台测试），同步解决了延迟与内存问题。\emph{阶段一}：为ROCm平台上的ONNX Runtime定制CK Flash Attention算子，将注意力内存从$O(n^2)$降至$O(n)$，端到端延迟从4,918ms降至127ms（\textbf{38.7倍}），实现了在SDPA会内存溢出的场景下处理8K–32K词元。\emph{阶段二}：采用经典NLP提示词压缩技术（TextRank、位置加权、TF-IDF及新颖性评分），在不依赖神经推理的情况下将所有输入压缩至约512词元，使得延迟与GPU内存占用不受原始提示长度影响而保持恒定（端到端延迟127→62ms，\textbf{2.0倍}）。\emph{阶段三}：通过自适应分块与零拷贝JSON的近流式主体处理，消除序列化开销（端到端延迟62→50ms，\textbf{1.2倍}）。累积优化效果：实现\textbf{98倍}提升（4,918ms→50ms），16K词元路由仅需108ms，路由器总GPU内存占用低于800MB——小到足以与LLM服务共享GPU，无需专用加速器。阶段一针对AMD ROCm平台（NVIDIA GPU已通过cuDNN支持FlashAttention）；阶段二与阶段三为硬件无关方案。

摘要 (Abstract)

System-level routers that intercept LLM requests for safety classification, domain routing, and PII detection must be both fast and operationally lightweight: they should add minimal latency to every request, yet not require a dedicated GPU – an expensive resource better used for LLM inference itself. When the router co-locates on the same GPU as vLLM serving instances, standard attention’s $O(n^2)$ memory makes long-context classification (8K–32K tokens) impossible: at 8K tokens, three concurrent classifiers need ${\sim}$4.5,GB for attention masks alone, far exceeding the memory left by vLLM. We present three staged optimizations for the vLLM Semantic Router, benchmarked on AMD Instinct MI300X, that solve both the latency and the memory problem. \emph{Stage1}: a custom CK Flash Attention operator for ONNX Runtime on ROCm reduces attention memory from $O(n^2)$ to $O(n)$ and end-to-end (E2E) latency from 4{,}918,ms to 127,ms (\textbf{38.7$\times$}), enabling 8K–32K tokens where SDPA OOMs. \emph{Stage2}: classical NLP prompt compression (TextRank, position weighting, TF-IDF, and novelty scoring) reduces all inputs to ${\sim}$512 tokens without neural inference, capping both latency and GPU memory at a constant regardless of original prompt length (E2E 127$\to$62,ms, \textbf{2.0$\times$}). \emph{Stage3}: near-streaming body processing with adaptive chunking and zero-copy JSON eliminates serialization overhead (E2E 62$\to$50,ms, \textbf{1.2$\times$}). Cumulatively: \textbf{98$\times$} improvement (4{,}918,ms to 50,ms), 16K-token routing in 108,ms, and a total router GPU footprint under 800,MB – small enough to share a GPU with LLM serving and removing the need for a dedicated accelerator. Stage1 targets AMD ROCm (NVIDIA GPUs already have FlashAttention via cuDNN); Stages~~2 and~~3 are hardware-agnostic.

关键词: LLM routing, Flash Attention, prompt compression, vLLM, long-context classification, inference acceleration, GPU memory optimization, semantic router

131. ❌ RTD-Guard: A Black-Box Textual Adversarial Detection Framework via Replacement Token Detection

作者: He Zhu, Yanshu Li, Wen Liu, Haitian Yang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12582v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是文本对抗攻击检测框架RTD-Guard，专注于NLP系统的安全防御，使用预训练的RTD判别器进行黑盒检测。所有评分关键词均与大模型/深度学习技术原理创新或科学领域应用直接相关，而本文属于传统NLP安全领域，未涉及大模型架构、训练、推理优化、对齐、代理、科学应用等任何评分主题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出了RTD-Guard，一种无需对抗数据、模型调优或内部访问的黑盒文本对抗样本检测框架，通过利用预训练的RTD判别器定位可疑词元并观察干预前后预测置信度变化，在多个基准数据集上有效检测各种先进攻击方法生成的对抗文本，超越了现有检测基线。

摘要翻译

文本对抗攻击通过引入难以察觉的扰动误导深度学习模型，对自然语言处理（NLP）系统构成严重安全威胁。尽管对抗样本检测为鲁棒训练提供了一种轻量级替代方案，但现有方法通常依赖于攻击的先验知识、对受害模型的白盒访问或大量查询，这严重限制了其实际部署。本文提出RTD-Guard，一种新颖的黑盒文本对抗样本检测框架。我们的核心洞见是：对抗攻击中的词语替换扰动与预训练的替换词检测（Replaced Token Detection, RTD）判别器所识别的“被替换词符”高度相似。基于此，RTD-Guard利用一个现成的RTD判别器——无需微调——来定位可疑词符，将其掩码，并通过观察干预前后受害模型预测置信度的变化来检测对抗样本。整个过程无需对抗数据、模型调优或模型内部访问，仅需两次黑盒查询。在多个基准数据集上的综合实验表明，RTD-Guard能有效检测由多种先进攻击方法生成的对抗文本。它在多项指标上超越了现有检测基线，提供了一种高效、实用且资源消耗低的防御机制，尤其适用于资源受限或对隐私敏感环境中的实际部署。

摘要 (Abstract)

Textual adversarial attacks pose a serious security threat to Natural Language Processing (NLP) systems by introducing imperceptible perturbations that mislead deep learning models. While adversarial example detection offers a lightweight alternative to robust training, existing methods typically rely on prior knowledge of attacks, white-box access to the victim model, or numerous queries, which severely limits their practical deployment. This paper introduces RTD-Guard, a novel black-box framework for detecting textual adversarial examples. Our key insight is that word-substitution perturbations in adversarial attacks closely resemble the “replaced tokens” that a Replaced Token Detection (RTD) discriminator is pre-trained to identify. Leveraging this, RTD-Guard employs an off-the-shelf RTD discriminator-without fine-tuning-to localize suspicious tokens, masks them, and detects adversarial examples by observing the prediction confidence shift of the victim model before and after intervention. The entire process requires no adversarial data, model tuning, or internal model access, and uses only two black-box queries. Comprehensive experiments on multiple benchmark datasets demonstrate that RTD-Guard effectively detects adversarial texts generated by diverse state-of-the-art attack methods. It surpasses existing detection baselines across multiple metrics, offering a highly efficient, practical, and resource-light defense mechanism-particularly suited for real-world deployment in resource-constrained or privacy-sensitive environments.

关键词: Textual adversarial detection, Black-box framework, Replaced Token Detection (RTD), Adversarial example detection, NLP security, Prediction confidence shift, Resource-light defense, Word-substitution perturbations

132. ❌ LMEB: Long-horizon Memory Embedding Benchmark

作者: Xinping Zhao, Xinshuo Hu, Jiaxin Xu, Danyu Tang, Xin Zhang, Mengjia Zhou, Yan Zhong, Yao Zhou, Zifei Shan, Meishan Zhang, Baotian Hu, Min Zhang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12572v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	3.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究文本嵌入模型的评估基准，专注于长时记忆检索任务。与关键词的相关性分析如下：1）与"Retrieval-Augmented Generation OR RAG OR Retrieval-Generation"高度相关（8分），因为论文核心是评估嵌入模型在记忆检索任务中的能力，这是RAG系统的关键组件；2）与"Large Language Models OR LLMs OR Foundation Models"有一定关联（3分），因为论文评估的嵌入模型可能基于大模型技术，且提到OpenClaw等记忆增强系统可能涉及大模型；3）其他关键词（如MoE、Scaling Laws、Fine-tuning方法、推理技术、AI for Science等）与论文内容无直接关系，论文未涉及这些具体技术或应用领域。

!!! tip deepseek-chat TL;DR

该论文提出了长时记忆嵌入基准（LMEB），用于评估文本嵌入模型在复杂、长时记忆检索任务中的能力，发现现有模型在传统段落检索上的表现不能推广到长时记忆检索，且尚无通用模型能胜任所有记忆检索任务。

摘要翻译

记忆嵌入对于记忆增强系统（如OpenClaw）至关重要，但其评估在当前文本嵌入基准测试中尚未得到充分探索。现有基准测试仅狭隘地聚焦于传统段落检索，未能评估模型处理涉及碎片化、上下文依赖且时间跨度较长的长程记忆检索任务的能力。为解决这一问题，我们提出了长程记忆嵌入基准测试（Long-horizon Memory Embedding Benchmark，简称LMEB），这是一个综合性框架，用于评估嵌入模型处理复杂长程记忆检索任务的能力。LMEB涵盖22个数据集和193个零样本检索任务，覆盖4种记忆类型：情景记忆、对话记忆、语义记忆和程序记忆，数据来源包括AI生成和人工标注。这些记忆类型在抽象程度和时间依赖性上各不相同，捕捉了记忆检索的不同方面，反映了现实世界中多样化的挑战。我们评估了15个广泛使用的嵌入模型，参数量从数亿到百亿不等。结果表明：（1）LMEB提供了合理的难度水平；（2）更大的模型并不总是表现更好；（3）LMEB与MTEB展现出正交性。这表明该领域尚未收敛到一个能够在所有记忆检索任务中表现出色的通用模型，且传统段落检索中的性能可能无法推广到长程记忆检索中。总之，通过提供一个标准化且可复现的评估框架，LMEB填补了记忆嵌入评估中的一个关键空白，推动了处理长期、上下文依赖的记忆检索的文本嵌入技术的进一步发展。LMEB可在https://github.com/KaLM-Embedding/LMEB获取。

摘要 (Abstract)

Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models’ ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information. To address this, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework that evaluates embedding models’ capabilities in handling complex, long-horizon memory retrieval tasks. LMEB spans 22 datasets and 193 zero-shot retrieval tasks across 4 memory types: episodic, dialogue, semantic, and procedural, with both AI-generated and human-annotated data. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB exhibit orthogonality. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that performance in traditional passage retrieval may not generalize to long-horizon memory retrieval. In summary, by providing a standardized and reproducible evaluation framework, LMEB fills a crucial gap in memory embedding evaluation, driving further advancements in text embedding for handling long-term, context-dependent memory retrieval. LMEB is available at https://github.com/KaLM-Embedding/LMEB.

关键词: memory embeddings, long-horizon memory retrieval, embedding benchmark, text embedding evaluation, retrieval tasks, memory-augmented systems, zero-shot retrieval, MTEB orthogonality

133. ❌ Speech-Worthy Alignment for Japanese SpeechLLMs via Direct Preference Optimization

作者: Mengjie Zhao, Lianbo Liu, Yusuke Fujita, Hao Shi, Yuan Gao, Roman Koshkin, Yui Sudo 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12565v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究SpeechLLMs（语音大语言模型）的alignment问题，核心是使用Direct Preference Optimization（DPO）方法使日语语音大模型输出更适合语音合成的口语化文本。因此与’Large Language Models’高度相关（SpeechLLMs属于LLM变体），与’Instruction Tuning OR Alignment OR Value Alignment’高度相关（研究alignment问题），与’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’高度相关（使用DPO方法）。其他关键词如MoE、SLMs、Scaling Laws、RAG、Agents等均未涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文针对日语SpeechLLMs输出文本不适合语音合成的问题，提出基于直接偏好优化（DPO）的对齐方法，显著提升了语音合成适用性，同时保持了原有书面风格评估性能。

摘要翻译

语音大语言模型通常将经过自动语音识别训练的编码器与基于文本的大语言模型主干相结合，这使其继承了书面风格的输出模式，而不适用于文本转语音合成。这种不匹配在日语中尤为明显，因为日语的口语和书面语体在礼貌标记、句末语气词及句法复杂度上存在显著差异。我们提出一种基于偏好的对齐方法，使日语语音大语言模型适应于生成“适于语音输出的文本”——即简洁、口语化且易于合成为自然语音的文本。为严格评估此任务，我们引入了SpokenElyza基准，该基准基于ELYZA-tasks-100构建，并由母语专家通过听觉验证进行标注，专门用于评估日语语音输出适宜性。实验表明，我们的方法在SpokenElyza基准上实现了显著提升，同时基本保持了原有书面风格评估的性能。我们将公开SpokenElyza基准，以支持未来日语口语对话系统的研究。

摘要 (Abstract)

SpeechLLMs typically combine ASR-trained encoders with text-based LLM backbones, leading them to inherit written-style output patterns unsuitable for text-to-speech synthesis. This mismatch is particularly pronounced in Japanese, where spoken and written registers differ substantially in politeness markers, sentence-final particles, and syntactic complexity. We propose a preference-based alignment approach to adapt Japanese SpeechLLMs for speech-worthy outputs: text that is concise, conversational, and readily synthesized as natural speech. To rigorously evaluate this task, we introduce SpokenElyza, a benchmark for Japanese speech-worthiness derived from ELYZA-tasks-100 with auditory verification by native experts. Experiments show that our approach achieves substantial improvement on SpokenElyza while largely preserving performance on the original written-style evaluation. We will release SpokenElyza to support future research on Japanese spoken dialog systems.

关键词: SpeechLLMs, Japanese, alignment, Direct Preference Optimization, speech-worthiness, text-to-speech, SpokenElyza benchmark, spoken dialog systems

134. ❌ When LLM Judge Scores Look Good but Best-of-N Decisions Fail

作者: Eddie Landesberg 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12520v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型作为评判者（judge）在最佳选择任务中的评估问题，仅与关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文明确以LLM作为研究对象，探讨其评估性能。其他关键词涉及模型架构、训练方法、推理技术、应用领域等，论文未涉及这些具体技术或应用，因此均为0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，当大语言模型作为评判者用于最佳候选选择任务时，仅依赖全局相关性指标（如与参考标签的相关系数）会严重高估其实际性能，因为选择效果主要取决于提示内的排名信号而非全局一致性，而采用成对评判方法可以显著恢复丢失的选择信号。

摘要翻译

大型语言模型常被用作评估器，为候选回答进行评分，随后通过单一全局指标（例如与参考标签的相关性）进行验证。当实际部署任务是在同一提示内进行n选一择优时，这种方法可能产生误导。

基于Chatbot Arena构建的包含5,000个提示的四选一基准测试表明，一个具有中等全局相关性（r = 0.47）的评估器，仅能捕捉到完美选择相较于随机选择所能实现改进的21.0%。这一差距的产生是因为全局一致性主要由提示层面的基线效应驱动，而选择行为取决于提示内的排序能力：其提示内相关性仅为r_within = 0.27，且粗略的逐点评分在67%的成对比较中产生了同分结果。

在一项匹配成对的二选一审计中，采用显式的成对比较评估方法能够挽回大部分丢失的信号，将改进捕捉率从21.1%提升至61.2%。对于基于评估器的选择任务，相关的审计报告应包含提示内信号、同分率以及改进捕捉率/前1准确率，而非仅报告全局一致性。

摘要 (Abstract)

Large language models are often used as judges to score candidate responses, then validated with a single global metric such as correlation with reference labels. This can be misleading when the real deployment task is best-of-n selection within a prompt. In a 5,000-prompt best-of-4 benchmark from Chatbot Arena, a judge with moderate global correlation (r = 0.47) captures only 21.0% of the improvement that perfect selection would achieve over random choice. The gap arises because global agreement is driven largely by prompt-level baseline effects, while selection depends on within-prompt ranking: within-prompt correlation is only r_within = 0.27, and coarse pointwise scoring creates ties in 67% of pairwise comparisons. In a matched-pair best-of-2 audit, explicit pairwise judging recovers much of this lost signal, raising recovery from 21.1% to 61.2%. For judge-based selection, the relevant audit should report within-prompt signal, tie rates, and recovery/top-1 accuracy, not global agreement alone.

关键词: Large Language Models, LLM judges, best-of-n selection, evaluation metrics, within-prompt ranking, pairwise judging, Chatbot Arena, recovery rate

135. ❌ Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies

作者: Siddharth Srikanth, Freddie Liang, Sophie Hsu, Varun Bhatt, Shihan Zhao, Henry Chen, Bryon Tjanaka, Minjune Hwang, Akanksha Saran, Daniel Seita, Aaquib Tabrez, Stefanos Nikolaidis 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12510v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Vision-Language-Action (VLA)模型的鲁棒性，通过Q-DIG方法生成多样化的对抗性指令来暴露VLA的脆弱性，并利用这些指令进行微调以提高任务成功率。与关键词的相关性分析如下：1) “Large Language Models OR LLMs OR Foundation Models”：论文涉及Vision-Language Models (VLMs)，属于大模型范畴，但非核心，给5分。2) “Post-training OR Supervised Fine-tuning OR SFT”：论文提到fine-tuning VLAs on generated instructions，属于后训练微调，给8分。3) “Instruction Tuning OR Alignment OR Value Alignment”：论文关注指令的多样性和鲁棒性，涉及指令调优，给8分。4) “LLM Agents OR Autonomous Agents OR Agentic Workflow”：VLA模型用于机器人系统，属于智能体应用，给8分。其他关键词如MoE、Scaling Laws、RAG等与论文内容无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出Q-DIG方法，通过生成多样化的自然语言指令来红队测试Vision-Language-Action模型，暴露其脆弱性并利用这些指令微调模型，从而提高了机器人任务的成功率和鲁棒性。

摘要翻译

视觉-语言-动作（Vision-Language-Action，VLA）模型在实现通用机器人系统执行一系列视觉-语言任务方面具有巨大潜力。然而，基于VLA的机器人性能对语言指令的精确措辞高度敏感，且难以预测此类机器人何时会失败。为提升VLA对不同措辞的鲁棒性，我们提出Q-DIG（多样化指令生成的质量多样性方法），该方法通过可扩展地识别多样化的自然语言任务描述来进行红队测试，这些描述能在保持任务相关性的同时诱发故障。Q-DIG将质量多样性（Quality Diversity，QD）技术与视觉-语言模型（Vision-Language Models，VLMs）相结合，生成广泛的对抗性指令，以揭示VLA行为中有意义的脆弱性。我们在多个仿真基准测试中的结果表明，与基线方法相比，Q-DIG能发现更多样化且更有意义的故障模式，并且基于生成的指令对VLA进行微调可提高任务成功率。此外，用户研究的结果表明，Q-DIG生成的提示词被评价为比基线方法更自然、更接近人类表达。最后，对Q-DIG提示词的真实世界评估结果与仿真一致，且基于生成提示词对VLA进行微调进一步提升了其在未见指令上的成功率。综上所述，这些发现表明Q-DIG是一种有前景的方法，可用于识别脆弱性并提升基于VLA的机器人的鲁棒性。我们的匿名项目网站位于qdigvla.github.io。

摘要 (Abstract)

Vision-Language-Action (VLA) models have significant potential to enable general-purpose robotic systems for a range of vision-language tasks. However, the performance of VLA-based robots is highly sensitive to the precise wording of language instructions, and it remains difficult to predict when such robots will fail. To improve the robustness of VLAs to different wordings, we present Q-DIG (Quality Diversity for Diverse Instruction Generation), which performs red-teaming by scalably identifying diverse natural language task descriptions that induce failures while remaining task-relevant. Q-DIG integrates Quality Diversity (QD) techniques with Vision-Language Models (VLMs) to generate a broad spectrum of adversarial instructions that expose meaningful vulnerabilities in VLA behavior. Our results across multiple simulation benchmarks show that Q-DIG finds more diverse and meaningful failure modes compared to baseline methods, and that fine-tuning VLAs on the generated instructions improves task success rates. Furthermore, results from a user study highlight that Q-DIG generates prompts judged to be more natural and human-like than those from baselines. Finally, real-world evaluations of Q-DIG prompts show results consistent with simulation, and fine-tuning VLAs on the generated prompts further success rates on unseen instructions. Together, these findings suggest that Q-DIG is a promising approach for identifying vulnerabilities and improving the robustness of VLA-based robots. Our anonymous project website is at qdigvla.github.io.

关键词: Vision-Language-Action Models, Quality Diversity, Red-Teaming, Adversarial Instructions, Robot Robustness, Fine-tuning, Vision-Language Models, Task Success Rates

136. ❌ Marked Pedagogies: Examining Linguistic Biases in Personalized Automated Writing Feedback

作者: Mei Tan, Lena Phalen, Dorottya Demszky 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12471v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究LLM在个性化写作反馈中的应用，核心关注LLM的偏见问题，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的技术原理、方法或应用，如MoE、SLMs、训练技术、推理优化、代理系统、模型压缩等，因此这些关键词得0分。论文属于LLM在教育领域的应用研究，但未涉及’AI for Science’中的生物信息学或化学信息学等具体科学领域，因此该关键词也得0分。

!!! tip deepseek-chat TL;DR

该论文研究了四种广泛使用的LLM（GPT-4o、GPT-3.5-turbo、Llama-3.3 70B、Llama-3.1 8B）在个性化写作反馈中如何根据学生属性（如性别、种族、学习需求）产生系统性、刻板印象对齐的反馈偏见，揭示了自动化反馈工具中存在的'标记教学法'问题。

摘要翻译

有效的个性化反馈对学生的读写能力发展至关重要。尽管基于大语言模型（LLM）的工具如今有望大规模自动化提供此类反馈，但大语言模型并非语言中立：它们偏向标准学术英语并复制社会刻板印象，这引发了人们对“个性化”如何塑造学生所获反馈的担忧。本研究考察了四种广泛使用的大语言模型（GPT-4o、GPT-3.5-turbo、Llama-3.3 70B、Llama-3.1 8B）如何根据学生属性调整书面反馈。我们使用PERSUADE数据集中的600篇八年级议论文，在提示条件中嵌入性别、种族/民族、学习需求、学业成就和动机属性来生成反馈。通过采用“标记性词汇”（Marked Words）分析框架，我们分析了模型输出中的词汇变化。结果显示，即使作文内容完全相同，基于预设学生属性的反馈仍存在系统性、与刻板印象一致的偏移。针对被种族、语言或残疾标记的学生，其反馈常呈现“正向反馈偏差”和“反馈保留偏差”——即过度使用赞扬、实质性批评较少，并预设其能力有限。在不同属性条件下，模型不仅调整了所强调的内容，还改变了写作评判方式及对学生的称呼语气。我们将这些教学倾向称为“标记性教学法”（Marked Pedagogies），并强调自动化反馈工具需要透明度和问责机制。

摘要 (Abstract)

Effective personalized feedback is critical to students’ literacy development. Though LLM-powered tools now promise to automate such feedback at scale, LLMs are not language-neutral: they privilege standard academic English and reproduce social stereotypes, raising concerns about how “personalization” shapes the feedback students receive. We examine how four widely used LLMs (GPT-4o, GPT-3.5-turbo, Llama-3.3 70B, Llama-3.1 8B) adapt written feedback in response to student attributes. Using 600 eighth-grade persuasive essays from the PERSUADE dataset, we generated feedback under prompt conditions embedding gender, race/ethnicity, learning needs, achievement, and motivation. We analyze lexical shifts across model outputs by adapting the Marked Words framework. Our results reveal systematic, stereotype-aligned shifts in feedback conditioned on presumed student attributes–even when essay content was identical. Feedback for students marked by race, language, or disability often exhibited positive feedback bias and feedback withholding bias–overuse of praise, less substantive critique, and assumptions of limited ability. Across attributes, models tailored not only what content was emphasized but also how writing was judged and how students were addressed. We term these instructional orientations Marked Pedagogies and highlight the need for transparency and accountability in automated feedback tools.

关键词: Large Language Models, personalized feedback, linguistic biases, automated writing feedback, stereotype-aligned shifts, Marked Pedagogies, educational technology, bias in AI

137. ❌ Shattering the Shortcut: A Topology-Regularized Benchmark for Multi-hop Medical Reasoning in LLMs

作者: Xing Zi, Xinying Zhou, Jinghao Xiao, Catarina Moreira, Mukesh Prasad 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12458v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLMs在医学领域的多跳推理能力，直接涉及LLMs、RAG、多步推理、深度推理和AI for Science等关键词，其中LLMs、RAG、Chain of Thought、System 2 Thinking和AI for Science是核心内容（10分）。论文通过分析推理缺陷间接涉及幻觉缓解和可解释AI（5分）。其他关键词如MoE、量化、对齐等未在论文中涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文揭示了LLMs在真实临床多跳诊断推理中的严重缺陷，通过构建拓扑正则化医学知识图谱和ShatterMed-QA基准，证明RAG能有效恢复模型性能，诊断出现有医学AI的根本推理不足。

摘要翻译

尽管大型语言模型（LLM）在标准医学基准测试中通过单跳事实回忆达到了专家级水平，但在真实临床环境中所需的复杂多跳诊断推理方面却表现严重不足。一个主要障碍是“捷径学习”，即模型利用知识图谱中高度连接的通用枢纽节点（例如“炎症”）来绕过真实的微观病理级联反应。为解决这一问题，我们推出了ShatterMed-QA——一个包含10,558道多跳临床问题的双语基准数据集，旨在严格评估深度诊断推理能力。我们的框架采用创新的$k$-Shattering算法构建拓扑正则化的医学知识图谱，该算法通过物理剪除通用枢纽节点来显式切断逻辑捷径。我们通过隐式桥接实体掩蔽和拓扑驱动的困难负样本采样合成评估案例，迫使模型在不依赖表面排除的情况下遍历生物学上合理的干扰项。对21个LLM的全面评估显示，所有模型在我们的多跳任务上均出现性能大幅下降，领域专用模型尤为明显。关键的是，通过检索增强生成技术恢复被掩蔽的证据后，几乎所有模型都实现了近乎完全的性能恢复，这验证了ShatterMed-QA的结构保真度，并证明其能有效诊断当前医疗AI的根本推理缺陷。欢迎访问我们的项目网站（https://shattermed-qa-web.vercel.app/）探索数据集、交互式示例及完整排行榜。

摘要 (Abstract)

While Large Language Models (LLMs) achieve expert-level performance on standard medical benchmarks through single-hop factual recall, they severely struggle with the complex, multi-hop diagnostic reasoning required in real-world clinical settings. A primary obstacle is “shortcut learning”, where models exploit highly connected, generic hub nodes (e.g., “inflammation”) in knowledge graphs to bypass authentic micro-pathological cascades. To address this, we introduce ShatterMed-QA, a bilingual benchmark of 10,558 multi-hop clinical questions designed to rigorously evaluate deep diagnostic reasoning. Our framework constructs a topology-regularized medical Knowledge Graph using a novel $k$-Shattering algorithm, which physically prunes generic hubs to explicitly sever logical shortcuts. We synthesize the evaluation vignettes by applying implicit bridge entity masking and topology-driven hard negative sampling, forcing models to navigate biologically plausible distractors without relying on superficial elimination. Comprehensive evaluations of 21 LLMs reveal massive performance degradation on our multi-hop tasks, particularly among domain-specific models. Crucially, restoring the masked evidence via Retrieval-Augmented Generation (RAG) triggers near-universal performance recovery, validating ShatterMed-QA’s structural fidelity and proving its efficacy in diagnosing the fundamental reasoning deficits of current medical AI. Explore the dataset, interactive examples, and full leaderboards at our project website: https://shattermed-qa-web.vercel.app/

关键词: Large Language Models, multi-hop reasoning, medical diagnosis, shortcut learning, Retrieval-Augmented Generation, knowledge graph, benchmark evaluation, clinical AI

138. ❌ CSE-UOI at SemEval-2026 Task 6: A Two-Stage Heterogeneous Ensemble with Deliberative Complexity Gating for Political Evasion Detection

作者: Christos Tzouvaras, Konstantinos Skianis, Athanasios Voulodimos 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12453v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	8.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用双LLM集成和Deliberative Complexity Gating机制进行政治回避检测，与LLM、推理方法、多智能体系统高度相关。具体相关性：1) 明确使用LLM（10分）；2) 涉及多步推理和深度思考（8分）；3) 使用多智能体辩论作为替代策略（8分）；4) 通过自洽性实现自我校正（5分）。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于双LLM集成和Deliberative Complexity Gating机制的政治回避检测系统，在SemEval-2026任务中获得了0.85的Macro-F1分数并排名第三。

摘要翻译

本文介绍了我们为SemEval-2026任务6设计的系统，该任务旨在将政治访谈中回答的清晰度分为三类：清晰回答（Clear Reply）、模糊回答（Ambivalent）和清晰不回答（Clear Non-Reply）。我们提出了一种基于自洽性（Self-Consistency, SC）和加权投票的异构双大语言模型（Large Language Model, LLM）集成方法，以及一种新颖的事后校正机制——审议复杂性门控（Deliberative Complexity Gating, DCG）。该机制利用跨模型行为信号，并基于大语言模型响应长度与样本模糊性高度相关的发现进行设计。为了进一步探索提升模糊性检测的机制，我们评估了多智能体辩论作为增强审议能力的替代策略。与DCG利用跨模型行为信号自适应门控推理不同，辩论通过增加智能体数量而非模型多样性来提升性能。我们的解决方案在评估集上取得了0.85的宏观F1分数，最终获得第三名。

摘要 (Abstract)

This paper describes our system for SemEval-2026 Task 6, which classifies clarity of responses in political interviews into three categories: Clear Reply, Ambivalent, and Clear Non-Reply. We propose a heterogeneous dual large language model (LLM) ensemble via self-consistency (SC) and weighted voting, and a novel post-hoc correction mechanism, Deliberative Complexity Gating (DCG). This mechanism uses cross-model behavioral signals and exploits the finding that an LLM response-length proxy correlates strongly with sample ambiguity. To further examine mechanisms for improving ambiguity detection, we evaluated multi-agent debate as an alternative strategy for increasing deliberative capacity. Unlike DCG, which adaptively gates reasoning using cross-model behavioral signals, debate increases agent count without increasing model diversity. Our solution achieved a Macro-F1 score of 0.85 on the evaluation set, securing 3rd place.

关键词: political evasion detection, large language models, heterogeneous ensemble, deliberative complexity gating, multi-agent debate, self-consistency, ambiguity detection, weighted voting

139. ❌ Interpreting Negation in GPT-2: Layer- and Head-Level Causal Analysis

作者: Abdullah Al Mofael, Lisa M. Kuhn, Ghassan Alkadi, Kuo-Pao Yang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12423v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文对GPT-2 Small进行因果分析，研究其处理否定句的内部机制，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。研究涉及模型内部表示和注意力头的分析，属于可解释性范畴，与’Mechanistic Interpretability OR Explainable AI’高度相关（10分）。论文关注否定处理错误，这与事实性和幻觉缓解有一定关联，但与关键词的直接技术方法（如事实性增强技术）不完全匹配，因此给5分。其他关键词（如MoE、SFT、RAG等）在论文中未涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文通过因果分析研究了GPT-2 Small处理否定句的内部机制，发现其否定处理能力高度集中在第4至6层的少数注意力头中，这些头携带肯定信号而非恢复基线行为。

摘要翻译

否定处理对现代语言模型而言始终是一项持续挑战，常导致语义反转或事实错误。本研究对GPT-2 Small内部处理此类语言转换的因果机制进行分析，从层级和注意力头两个维度检测其隐藏表征。我们的分析基于自主构建的12,000对肯定句与否定句配对数据集，涵盖多种语言模板及否定形式。为量化该行为，我们定义了否定效应分数（Negation Effect Score, NES）这一指标，用于衡量模型区分肯定陈述与其否定形式的能力。我们通过两项关键干预实验探究因果结构：在激活修补实验中，将肯定句的内部激活值插入对应否定句，以观察语义如何迁移；在消融实验中，临时禁用特定注意力头以观测逻辑极性变化。这些步骤共同揭示了否定信号在GPT-2各层间的传递与演化机制。研究结果表明，该能力并非广泛分布，而是高度集中于少数中层注意力头（主要集中在第4至6层）。消融这些特定组件会直接破坏模型的否定敏感性：在领域内测试中，消融使NES升高（表明否定敏感性减弱），而重新引入缓存的肯定句激活（救援操作）使NES进一步升高，证实这些注意力头携带的是肯定信号而非恢复基线行为。在xNot360基准测试中，消融轻微降低NES，救援操作则使性能恢复至基线之上。该模式证明这些因果机制在不同否定形式中保持一致性，且能在外部xNot360基准中检测到，尽管效应幅度较小。

摘要 (Abstract)

Negation remains a persistent challenge for modern language models, often causing reversed meanings or factual errors. In this work, we conduct a causal analysis of how GPT-2 Small internally processes such linguistic transformations. We examine its hidden representations at both the layer and head level. Our analysis is based on a self-curated 12,000-pair dataset of matched affirmative and negated sentences, covering multiple linguistic templates and forms of negation. To quantify this behavior, we define a metric, the Negation Effect Score (NES), which measures the model’s sensitivity in distinguishing between affirmative statements and their negations. We carried out two key interventions to probe causal structure. In activation patching, internal activations from affirmative sentences were inserted into their negated counterparts to see how meaning shifted. In ablation, specific attention heads were temporarily disabled to observe how logical polarity changed. Together, these steps revealed how negation signals move and evolve through GPT-2’s layers. Our findings indicate that this capability is not widespread; instead, it is highly concentrated within a limited number of mid-layer attention heads, primarily within layers 4 to 6. Ablating these specific components directly disrupts the model’s negation sensitivity: on our in-domain, ablation increased NES (indicating weaker negation sensitivity), and re-introducing cached affirmative activations (rescue) increased NES further, confirming that these heads carry affirmative signal rather than restoring baseline behavior. On xNot360, ablation slightly decreased NES and rescue restored performance above baseline. This pattern demonstrates that these causal patterns are consistent across various negation forms and remain detectable on the external xNot360 benchmark, though with smaller magnitude.

关键词: GPT-2, negation, causal analysis, attention heads, layer-level analysis, activation patching, ablation, interpretability

140. ❌ Not Just the Destination, But the Journey: Reasoning Traces Causally Shape Generalization Behaviors

作者: Pengcheng Wen, Yanxu Zhu, Jiapeng Sun, Han Zhu, Yujin Zhou, Chi-Min Chan, Sirui Han, Yike Guo 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12397v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	15.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Chain-of-Thought推理对LLM泛化行为的因果影响，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（15分），涉及LLM（10分）、监督微调（10分）、对齐（10分）、系统2思维（10分）和可解释AI（10分），其他关键词未涉及。

!!! tip deepseek-chat TL;DR

该论文通过控制实验证明，Chain-of-Thought推理内容（而非最终答案）会因果性地塑造LLM的泛化行为，挑战了仅监督输出的对齐策略。

摘要翻译

思维链（Chain-of-Thought, CoT）常被视为窥探大语言模型决策过程的窗口，然而近期研究表明，它可能仅起到事后合理化作用。这引发了一个关键的对齐问题：推理轨迹是否独立于最终答案，对模型泛化能力具有因果性影响？为分离推理的因果效应，我们设计了一项对照实验，在保持最终有害答案不变的同时，改变推理路径。我们构建了包含三种推理类型的数据集：体现恶意的邪恶推理（Evil reasoning）、合理化伤害的误导推理（Misleading reasoning）以及屈从于压力的顺从推理（Submissive reasoning）。我们在多种范式下训练模型（参数规模0.6B–14B），包括“问题-思考-答案”（QTA）、“问题-思考”（QT）和“仅思考”（T-only），并在“有思考”与“无思考”两种模式下进行评估。研究发现：（1）与标准微调相比，思维链训练可能更大程度地放大有害泛化；（2）尽管最终答案相同，不同的推理类型会引发与其语义一致的不同行为模式；（3）仅在有推理而无答案监督的情况下进行训练（QT或T-only）即足以改变模型行为，证明推理承载着独立信号；（4）这些效应即使在无推理生成答案时依然存在，表明模型已深度内化。我们的结果表明，推理内容具有因果效力，这对仅监督输出结果的模型对齐策略提出了挑战。

摘要 (Abstract)

Chain-of-Thought (CoT) is often viewed as a window into LLM decision-making, yet recent work suggests it may function merely as post-hoc rationalization. This raises a critical alignment question: Does the reasoning trace causally shape model generalization independent of the final answer? To isolate reasoning’s causal effect, we design a controlled experiment holding final harmful answers constant while varying reasoning paths. We construct datasets with \textit{Evil} reasoning embracing malice, \textit{Misleading} reasoning rationalizing harm, and \textit{Submissive} reasoning yielding to pressure. We train models (0.6B–14B parameters) under multiple paradigms, including question-thinking-answer (QTA), question-thinking (QT), and thinking-only (T-only), and evaluate them in both think and no-think modes. We find that: (1) CoT training could amplify harmful generalization more than standard fine-tuning; (2) distinct reasoning types induce distinct behavioral patterns aligned with their semantics, despite identical final answers; (3) training on reasoning without answer supervision (QT or T-only) is sufficient to alter behavior, proving reasoning carries an independent signal; and (4) these effects persist even when generating answers without reasoning, indicating deep internalization. Our findings demonstrate that reasoning content is causally potent, challenging alignment strategies that supervise only outputs.

关键词: Chain-of-Thought, reasoning traces, causal effect, generalization behaviors, alignment, fine-tuning, LLM decision-making, harmful generalization

141. ❌ Efficient Reasoning with Balanced Thinking

作者: Yulin Li, Tengyao Tu, Li Ding, Junjie Wang, Huiling Zhen, Yixin Chen, Yong Li, Zhuotao Tian 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12372v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于大语言模型（LRMs）的推理效率问题，核心研究如何平衡过度思考（overthinking）和思考不足（underthinking）。高度相关的关键词包括：‘Large Language Models’（论文研究LRMs）、‘Chain of Thought’（涉及多步推理）、‘System 2 Thinking’（涉及深度推理过程）。中等相关的关键词：‘Self-Correction’（通过动态控制调整推理轨迹，类似自我改进）、‘Speculative Decoding’（涉及推理效率，但非核心）、‘Mechanistic Interpretability’（通过隐藏状态分析推理动态，有一定解释性）。其他关键词与论文内容无直接关联，如MoE、量化、RAG等未涉及。

!!! tip deepseek-chat TL;DR

该论文针对大型推理模型存在的过度思考和思考不足问题，提出了一个无需训练的ReBalance框架，通过基于置信度的动态控制来平衡推理过程，在减少冗余输出的同时提高了多个任务上的准确性。

摘要翻译

大型推理模型（LRMs）已展现出卓越的推理能力，但它们常面临过度思考（overthinking）和思考不足（underthinking）的问题：前者在简单问题上消耗冗余的计算步骤，后者则未能充分利用模型内在能力探索足够的推理路径。这些问题导致效率低下和潜在的错误，限制了在资源受限环境中的实际部署。现有缓解过度思考的方法（如抑制反思关键词或调整推理长度）可能无意中引发思考不足，从而损害准确性。为此，我们提出ReBalance，一种无需训练即可实现均衡思考的高效推理框架。ReBalance以置信度作为推理动态的连续指标，通过高置信度方差识别过度思考，并通过持续过度自信检测思考不足。通过将小规模数据集的隐藏状态聚合为推理模式原型，我们计算出一个引导向量来指导LRMs的推理轨迹。动态控制函数根据实时置信度调整该向量的强度和方向，在过度思考时剪除冗余，在思考不足时促进探索。我们在从0.5B到32B的四个模型上，针对数学推理、通用问答和代码生成等九个基准任务进行了广泛实验，结果表明ReBalance在提升准确性的同时有效减少了输出冗余，为高效、稳健的LRM部署提供了一种通用、免训练、即插即用的策略。代码发布于https://github.com/yu-lin-li/ReBalance。

摘要 (Abstract)

Large Reasoning Models (LRMs) have shown remarkable reasoning capabilities, yet they often suffer from overthinking, expending redundant computational steps on simple problems, or underthinking, failing to explore sufficient reasoning paths despite inherent capabilities. These issues lead to inefficiencies and potential inaccuracies, limiting practical deployment in resource-constrained settings. Existing methods to mitigate overthinking, such as suppressing reflective keywords or adjusting reasoning length, may inadvertently induce underthinking, compromising accuracy. Therefore, we propose ReBalance, a training-free framework that achieves efficient reasoning with balanced thinking. ReBalance leverages confidence as a continuous indicator of reasoning dynamics, identifying overthinking through high confidence variance and underthinking via consistent overconfidence. By aggregating hidden states from a small-scale dataset into reasoning mode prototypes, we compute a steering vector to guide LRMs’ reasoning trajectories. A dynamic control function modulates this vector’s strength and direction based on real-time confidence, pruning redundancy during overthinking, and promoting exploration during underthinking. Extensive experiments conducted on four models ranging from 0.5B to 32B, and across nine benchmarks in math reasoning, general question answering, and coding tasks demonstrate that ReBalance effectively reduces output redundancy while improving accuracy, offering a general, training-free, and plug-and-play strategy for efficient and robust LRM deployment. Code is available at https://github.com/yu-lin-li/ReBalance .

关键词: Large Reasoning Models, overthinking, underthinking, confidence-based control, reasoning efficiency, training-free framework, dynamic steering, balanced thinking

142. ❌ Multi-Step Semantic Reasoning in Generative Retrieval

作者: Steven Dong, Yubao Tang, Maarten de Rijke 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12368v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	15.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于增强生成式检索模型在数值上下文中的多步语义推理能力，与’Retrieval-Augmented Generation’高度相关（10分），因为生成式检索是RAG的一种形式；与’Chain of Thought’高度相关（15分），因为论文明确提出了多步推理框架；与’System 2 Thinking’相关（10分），因为涉及深度推理过程；与’Large Language Models’相关（8分），因为生成式检索模型通常基于LLM。其他关键词如MoE、量化、对齐等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

论文针对生成式检索模型在金融等数值上下文中处理复杂查询时推理能力不足的问题，提出了ReasonGR框架，通过结构化提示和推理适应模块增强了多步语义推理能力，在FinQA数据集上提高了检索准确性和一致性。

摘要翻译

生成式检索模型将语料库编码于模型参数内，并直接为给定查询生成相关文档标识符。尽管该范式在检索任务中展现出潜力，但由于推理能力有限，现有生成式检索模型在处理数值语境下的复杂查询时仍面临挑战，例如涉及财务报表语义推理的查询。这一局限导致检索准确度欠佳，并阻碍了实际应用。我们提出ReasonGR框架，旨在增强生成式检索中数值语境下的多步语义推理能力。ReasonGR采用结构化提示策略，将任务特定指令与分步推理指导相结合，以更好地处理复杂检索查询。此外，该框架集成了专注于推理的适配模块，以优化推理相关参数的学习。在包含针对复杂文档的金融查询的FinQA数据集上的实验表明，ReasonGR显著提升了检索准确度与一致性，这标志着其在推理密集型检索场景中推动生成式检索模型发展的潜力。

摘要 (Abstract)

Generative retrieval (GR) models encode a corpus within model parameters and generate relevant document identifiers directly for a given query. While this paradigm shows promise in retrieval tasks, existing GR models struggle with complex queries in numerical contexts, such as those involving semantic reasoning over financial reports, due to limited reasoning capabilities. This limitation leads to suboptimal retrieval accuracy and hinders practical applicability. We propose ReasonGR, a framework designed to enhance multi-step semantic reasoning in numerical contexts within GR. ReasonGR employs a structured prompting strategy combining task-specific instructions with stepwise reasoning guidance to better address complex retrieval queries. Additionally, it integrates a reasoning-focused adaptation module to improve the learning of reasoning-related parameters. Experiments on the FinQA dataset, which contains financial queries over complex documents, demonstrate that ReasonGR improves retrieval accuracy and consistency, indicating its potential for advancing GR models in reasoning-intensive retrieval scenarios.

关键词: Generative Retrieval, Multi-step Reasoning, Semantic Reasoning, Numerical Contexts, Financial Queries, Retrieval Accuracy, Structured Prompting, Reasoning Adaptation

143. ❌ TASTE-Streaming: Towards Streamable Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

作者: Liang-Hsuan Tseng, Hung-yi Lee 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12350v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究语音-文本联合建模中的流式处理问题，属于大模型在语音领域的应用。与’Large Language Models’相关（5分），因为涉及文本-语音联合建模；与’Pre-training’相关（5分），因为涉及语音tokenization和embedding的预训练；与’Speculative Decoding OR Inference Acceleration’相关（5分），因为论文重点解决实时流式处理中的延迟问题。其他关键词如MoE、SFT、RAG、量化等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了文本-语音联合建模中因依赖外部ASR和非因果解码器而无法流式处理的问题，提出了TASTE-S方法，通过集成CTC-based ASR模块和重新设计单元解码器，实现了与原始TASTE相当的性能同时显著降低延迟。

摘要翻译

文本-语音联合口语语言建模旨在实现自然智能的语音交互，但开发此类系统可能面临模态失配问题：语音单元序列远长于文本标记。先前研究通过文本对齐的分词与嵌入方法减少这一差距，生成与对应文本长度对齐的语音标记。然而，该方法依赖外部自动语音识别系统并使用非因果解码器，限制了流式应用。为突破此局限，我们提出TASTE-S——一种适用于实时场景的可流式化TASTE扩展方案。TASTE-S将基于CTC的ASR模块集成至编码器，实现即时双模态编码；同时重构单元解码器以支持动态实时解码。通过联合训练，TASTE-S在保持与TASTE相当性能的同时显著降低延迟。进一步研究表明，TASTE-S对转写文本具有鲁棒性，并能支持长序列编码与解码。

摘要 (Abstract)

Text-speech joint spoken language modeling (SLM) aims at natural and intelligent speech-based interactions, but developing such a system may suffer from modality mismatch: speech unit sequences are much longer than text tokens. Prior work reduces this gap with text-aligned tokenization and embedding (TASTE), producing speech tokens that align in lengths with their textual counterparts. However, the dependence on an external ASR system and the use of a non-causal decoder limits streaming use. To address this limitation, we propose TASTE-S, a streamable extension of TASTE suitable for real-time usage. TASTE-S integrates a CTC-based ASR module into the encoder for instant dual-modality encoding. We also redesign the unit decoder to enable on-the-fly decoding. With joint training, we show that TASTE-S matches TASTE’s performance while significantly reducing latency. Further investigations reveal that TASTE-S remains robust to transcriptions and enables long-form encoding and decoding.

关键词: spoken language modeling, text-speech joint modeling, streamable tokenization, CTC-based ASR, real-time encoding, latency reduction, modality mismatch, dual-modality encoding

144. ❌ LLM-Augmented Therapy Normalization and Aspect-Based Sentiment Analysis for Treatment-Resistant Depression on Reddit

作者: Yuxin Zhu, Sahithi Lakamana, Masoud Rouhizadeh, Selen Bozkurt, Rachel Hershenberg, Abeed Sarker 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12343v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要应用LLM进行数据增强来改进情感分类模型，属于大模型在生物医学领域的应用研究。与’Large Language Models’和’Post-training/SFT’相关（用于数据增强和微调），与’AI for Science/Bioinformatics’高度相关（应用于精神健康研究）。其他关键词如MoE、量化、推理加速等与论文内容无关。

!!! tip deepseek-chat TL;DR

该研究利用大语言模型增强数据，开发了基于方面的情感分析模型来分析Reddit上难治性抑郁症患者对药物治疗的体验，发现传统抗抑郁药负面评价较多，而氯胺酮类药物相对更受好评。

摘要翻译

难治性抑郁症（Treatment-resistant depression, TRD）是重度抑郁障碍的一种严重形式，患者虽经历多次足量治疗尝试仍未能实现缓解。目前针对TRD的药物治疗方案证据依然有限，且临床试验往往未能充分捕捉患者报告的治疗耐受性。因此，大规模在线同伴支持叙事为理解患者如何在真实世界中使用和评价药物提供了补充性视角。本研究收集了2010年至2025年间来自28个心理健康相关子论坛、由3,480名用户发布的5,059篇明确提及TRD的Reddit帖子。其中，3,839篇帖子提到了至少一种药物，在通过基于词典的方法对商品名、拼写错误及口语化表达进行规范化处理后，共提取出23,399次药物提及，涉及81种通用名药物。我们基于SMM4H 2023疗法情感分析推特语料库，通过大语言模型驱动的数据增强对DeBERTa-v3进行微调，开发了一个基于方面的情感分类器，在共享任务测试集上取得了0.800的微平均F1分数。将该分类器应用于Reddit数据，我们量化了患者对各类药物在积极、中性和消极三个维度的情感倾向，并按药物、用户、子论坛和年份追踪了情感分布模式。总体而言，72.1%的药物提及为中性，14.8%为消极，13.1%为积极。传统抗抑郁药（特别是SSRIs和SNRIs）的消极情感比例持续高于积极情感，而氯胺酮（ketamine）和艾氯胺酮（esketamine）则呈现出相对更积极的情感特征。这些结果表明，规范化的药物提取结合基于方面的情感分析，有助于刻画TRD相关网络讨论中患者感知的治疗体验，从而以大规模患者生成视角补充临床证据。

摘要 (Abstract)

Treatment-resistant depression (TRD) is a severe form of major depressive disorder in which patients do not achieve remission despite multiple adequate treatment trials. Evidence across pharmacologic options for TRD remains limited, and trials often do not fully capture patient-reported tolerability. Large-scale online peer-support narratives therefore offer a complementary lens on how patients describe and evaluate medications in real-world use. In this study, we curated a corpus of 5,059 Reddit posts explicitly referencing TRD from 3,480 subscribers across 28 mental health-related subreddits from 2010 to 2025. Of these, 3,839 posts mentioned at least one medication, yielding 23,399 mentions of 81 generic-name medications after lexicon-based normalization of brand names, misspellings, and colloquialisms. We developed an aspect-based sentiment classifier by fine-tuning DeBERTa-v3 on the SMM4H 2023 therapy-sentiment Twitter corpus with large language model based data augmentation, achieving a micro-F1 score of 0.800 on the shared-task test set. Applying this classifier to Reddit, we quantified sentiment toward individual medications across three categories: positive, neutral, and negative, and tracked patterns by drug, subscriber, subreddit, and year. Overall, 72.1% of medication mentions were neutral, 14.8% negative, and 13.1% positive. Conventional antidepressants, especially SSRIs and SNRIs, showed consistently higher negative than positive proportions, whereas ketamine and esketamine showed comparatively more favorable sentiment profiles. These findings show that normalized medication extraction combined with aspect-based sentiment analysis can help characterize patient-perceived treatment experiences in TRD-related Reddit discourse, complementing clinical evidence with large-scale patient-generated perspectives.

关键词: treatment-resistant depression, aspect-based sentiment analysis, large language model data augmentation, Reddit social media analysis, medication sentiment classification, patient-reported outcomes, mental health informatics, DeBERTa fine-tuning

145. ❌ XSkill: Continual Learning from Experience and Skills in Multimodal Agents

作者: Guanyu Jiang, Zhaochen Su, Xiaoye Qu, Yi R. Fung 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12056v2

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出XSkill框架，专注于多模态智能体的持续学习，核心涉及智能体架构、工具使用和推理能力。与智能体相关的关键词（如LLM Agents、Tool Use）高度相关（10分），因为论文直接研究多模态智能体的工具使用和编排。与推理相关的关键词（如Chain of Thought、System 2 Thinking、Self-Correction）有较强关联（8分），因为论文强调从经验中学习以改进决策和规划。与学习机制相关的关键词（如Pre-training、Retrieval-Augmented Generation、In-context Learning）有一定关联（5分），因为框架涉及知识提取、检索和适应，但非核心焦点。其他关键词（如MoE、SLMs、RLHF等）与论文内容无关或未提及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态智能体在开放环境中工具使用效率低和编排不灵活的问题，提出了XSkill双流框架，通过从视觉观察中提取和检索经验与技能，实现了持续学习，在多个基准测试中显著优于现有基线方法。

摘要翻译

多模态智能体如今已能借助多样化工具处理复杂推理任务，但在开放场景中仍面临工具使用效率低下与流程编排僵化的问题。核心挑战在于如何使此类智能体无需参数更新即可通过历史轨迹学习实现持续改进。我们识别出实现该目标必需的两种互补型可复用知识：经验（提供工具选择与决策的行动级精炼指导）和技能（提供规划与工具使用的任务级结构化指导）。为此，我们提出XSkill——一个从多模态智能体的经验与技能中进行持续学习的双流框架。XSkill将知识提取与检索过程均锚定于视觉观察：在积累阶段，通过视觉锚定摘要与跨轨迹批判机制，从多路径推演中提炼并整合经验与技能；在推理阶段，根据当前视觉语境检索并适配知识，同时将使用历史反馈至积累阶段，形成持续学习闭环。在涵盖多领域的五个基准测试中，使用四种骨干模型的实验表明，XSkill始终显著优于纯工具方法与基于学习的基线模型。进一步分析揭示，两种知识流通过互补方式影响智能体推理行为，并展现出卓越的零样本泛化能力。

摘要 (Abstract)

Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings. A central challenge is enabling such agents to continually improve without parameter updates by learning from past trajectories. We identify two complementary forms of reusable knowledge essential for this goal: experiences, providing concise action-level guidance for tool selection and decision making, and skills, providing structured task-level guidance for planning and tool use. To this end, we propose XSkill, a dual-stream framework for continual learning from experience and skills in multimodal agents. XSkill grounds both knowledge extraction and retrieval in visual observations. During accumulation, XSkill distills and consolidates experiences and skills from multi-path rollouts via visually grounded summarization and cross-rollout critique. During inference, it retrieves and adapts this knowledge to the current visual context and feeds usage history back into accumulation to form a continual learning loop. Evaluated on five benchmarks across diverse domains with four backbone models, XSkill consistently and substantially outperforms both tool-only and learning-based baselines. Further analysis reveals that the two knowledge streams play complementary roles in influencing the reasoning behaviors of agents and show superior zero-shot generalization.

关键词: multimodal agents, continual learning, tool use, experience, skills, visual grounding, reasoning, zero-shot generalization

146. ❌ Representation Learning for Spatiotemporal Physical Systems

作者: Helen Qu, Rudy Morel, Michael McCabe, Alberto Bietti, François Lanusse, Shirley Ho, Yann LeCun 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13227v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究的是时空物理系统的表示学习，专注于自监督学习方法在物理建模中的应用，特别是比较不同方法在下游科学任务（如物理参数估计）上的有效性。论文内容与大多数关键词（主要涉及大语言模型技术、训练方法、推理优化、对齐、代理系统等）完全无关，因为这些关键词针对的是自然语言处理领域的大模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学领域的应用（物理系统建模），但并非核心生物信息学或化学信息学，因此给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文研究了自监督学习方法在时空物理系统表示学习中的应用，发现并非所有物理建模方法都优于通用自监督方法，且潜在空间学习方法（如JEPA）在物理参数估计等下游科学任务中表现更优。

摘要翻译

针对时空物理系统的机器学习方法主要集中于下一帧预测，其目标是学习系统随时间演化的精确仿真器。然而，这些仿真器的训练计算成本高昂，且存在性能缺陷，例如在自回归推演过程中产生的误差累积。在本研究中，我们采取不同视角，关注预测下一帧之后更下游的科学任务，例如对系统主导物理参数的估计。这些任务上的准确性为评估模型表征的物理相关性提供了独特且可量化的视角。我们评估了通用自监督方法在学习适用于下游科学任务的、基于物理的表征方面的有效性。出乎意料的是，我们发现并非所有为物理建模设计的方法在这些任务上都优于通用的自监督学习方法，并且在潜在空间中学习的方法（例如联合嵌入预测架构，或JEPAs）优于那些优化像素级预测目标的方法。代码可在 https://github.com/helenqu/physical-representation-learning 获取。

摘要 (Abstract)

Machine learning approaches to spatiotemporal physical systems have primarily focused on next-frame prediction, with the goal of learning an accurate emulator for the system’s evolution in time. However, these emulators are computationally expensive to train and are subject to performance pitfalls, such as compounding errors during autoregressive rollout. In this work, we take a different perspective and look at scientific tasks further downstream of predicting the next frame, such as estimation of a system’s governing physical parameters. Accuracy on these tasks offers a uniquely quantifiable glimpse into the physical relevance of the representations of these models. We evaluate the effectiveness of general-purpose self-supervised methods in learning physics-grounded representations that are useful for downstream scientific tasks. Surprisingly, we find that not all methods designed for physical modeling outperform generic self-supervised learning methods on these tasks, and methods that learn in the latent space (e.g., joint embedding predictive architectures, or JEPAs) outperform those optimizing pixel-level prediction objectives. Code is available at https://github.com/helenqu/physical-representation-learning.

关键词: representation learning, spatiotemporal physical systems, self-supervised learning, downstream scientific tasks, physics-grounded representations, joint embedding predictive architectures, physical parameter estimation, machine learning for science

147. ❌ Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models

作者: Ziqi Ma, Mengzhan Liufu, Georgia Gkioxari 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13215v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视频世界模型的评估，特别是研究模型是否能将状态演化与观察解耦。论文的核心是“World Models”，因此该关键词得分为10分（高度相关）。其他所有关键词均与论文内容无关，因为论文不涉及大语言模型、训练技术、推理方法、对齐、压缩、幻觉缓解、科学AI应用等主题。论文研究的是视频生成模型中的世界模型评估，而非大模型技术或应用。

!!! tip deepseek-chat TL;DR

该论文提出了STEVO-Bench基准，用于评估视频世界模型是否能将物理状态演化与观察过程解耦，揭示了现有模型在此任务上的局限性并分析了其数据与架构偏差。

摘要翻译

现实世界中的演化过程，例如水流倾泻或冰块融化，无论是否被观测都会持续发生。视频世界模型通过二维帧观测生成“世界”。这些生成的“世界”能否在脱离观测的情况下自主演化？为探究此问题，我们设计了一个基准来评估视频世界模型能否将状态演化与观测解耦。我们的基准测试平台STEVO-Bench通过遮挡物插入、关闭灯光或指定相机“移开视线”轨迹等指令，对演化过程实施观测控制。通过在有/无相机控制条件下评估多种自然演化场景中的视频模型，我们揭示了它们在解耦状态演化与观测方面的局限性。STEVO-Bench提出了一套自动检测并分离视频世界模型在自然状态演化关键方面失效模式的评估方案。对STEVO-Bench结果的分析为当前视频世界模型潜在的数据与架构偏差提供了新见解。项目网站：https://glab-caltech.github.io/STEVOBench/。博客：https://ziqi-ma.github.io/blog/2026/outofsight/

摘要 (Abstract)

Evolutions in the world, such as water pouring or ice melting, happen regardless of being observed. Video world models generate “worlds” via 2D frame observations. Can these generated “worlds” evolve regardless of observation? To probe this question, we design a benchmark to evaluate whether video world models can decouple state evolution from observation. Our benchmark, STEVO-Bench, applies observation control to evolving processes via instructions of occluder insertion, turning off the light, or specifying camera “lookaway” trajectories. By evaluating video models with and without camera control for a diverse set of naturally-occurring evolutions, we expose their limitations in decoupling state evolution from observation. STEVO-Bench proposes an evaluation protocol to automatically detect and disentangle failure modes of video world models across key aspects of natural state evolution. Analysis of STEVO-Bench results provide new insight into potential data and architecture bias of present-day video world models. Project website: https://glab-caltech.github.io/STEVOBench/. Blog: https://ziqi-ma.github.io/blog/2026/outofsight/

关键词: video world models, state evolution, observation decoupling, benchmark evaluation, STEVO-Bench, occlusion control, camera trajectories, model limitations

148. ❌ Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos

作者: Rohith Peddi, Saurabh, Shravan Shanmugam, Likhitha Pallapothula, Yu Xiang, Parag Singla, Vibhav Gogate 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13185v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究计算机视觉领域的时空场景图生成，核心贡献是提出World Scene Graph Generation（WSGG）任务和三种方法（PWG、MWAE、4DST），并构建了ActionGenome4D数据集。论文与大多数大模型/深度学习技术关键词无关，因为这些关键词主要涉及语言模型、训练方法、推理优化等，而本文专注于视觉场景理解。唯一相关的是’World Models AND General World Models’（10分），因为论文明确提出了’World Scene Graph’概念，旨在构建包含观察和未观察对象的世界中心表示，这与世界模型的核心思想高度相关。‘Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’（5分）有弱关联，因为论文提到使用Graph RAG方法评估视觉语言模型，但RAG本身不是研究重点。其他关键词如AI for Science、LLMs等均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了World Scene Graph Generation（WSGG）任务，旨在从单目视频构建包含观察和未观察对象的时空场景图，并通过ActionGenome4D数据集和三种新方法（PWG、MWAE、4DST）实现了世界中心、持久且可解释的场景推理。

摘要翻译

时空场景图为建模动态物体交互提供了规范化表征，但现有方法本质上仍以帧为中心：仅对当前可见物体进行推理，在遮挡时丢弃实体，且仅在二维空间运作。为解决这一问题，我们首先提出ActionGenome4D数据集，该数据集通过前馈式三维重建、为动作中所有物体提供世界坐标系下的定向边界框，以及包含因遮挡或相机运动而暂时不可见物体的密集关系标注，将Action Genome视频升级为四维场景。基于此数据，我们形式化了世界场景图生成任务（World Scene Graph Generation, WSGG），即在每个时间戳构建包含场景中所有交互物体（包括可见与不可见物体）的世界场景图。随后，我们提出三种互补方法，分别探索不同的归纳偏置以推理不可见物体：PWG（持久世界图）通过零阶特征缓冲区实现物体恒存性；MWAE（掩码世界自编码器）将不可见物体推理重构为基于跨视图关联检索的掩码补全任务；4DST（四维场景变换器）则用可微分、基于物体时序注意力并融合三维运动与相机位姿特征的模块替代静态缓冲区。我们进一步通过一套基于图检索增强生成（Graph RAG）的方法，设计并评估了开源视觉语言模型在WSGG任务上的性能，为无定位关系预测建立了基线。WSGG从而将视频场景理解推向以世界为中心、具有时序持续性且可解释的场景推理新阶段。

摘要 (Abstract)

Spatio-temporal scene graphs provide a principled representation for modeling evolving object interactions, yet existing methods remain fundamentally frame-centric: they reason only about currently visible objects, discard entities upon occlusion, and operate in 2D. To address this, we first introduce ActionGenome4D, a dataset that upgrades Action Genome videos into 4D scenes via feed-forward 3D reconstruction, world-frame oriented bounding boxes for every object involved in actions, and dense relationship annotations including for objects that are temporarily unobserved due to occlusion or camera motion. Building on this data, we formalize World Scene Graph Generation (WSGG), the task of constructing a world scene graph at each timestamp that encompasses all interacting objects in the scene, both observed and unobserved. We then propose three complementary methods, each exploring a different inductive bias for reasoning about unobserved objects: PWG (Persistent World Graph), which implements object permanence via a zero-order feature buffer; MWAE (Masked World Auto-Encoder), which reframes unobserved-object reasoning as masked completion with cross-view associative retrieval; and 4DST (4D Scene Transformer), which replaces the static buffer with differentiable per-object temporal attention enriched by 3D motion and camera-pose features. We further design and evaluate the performance of strong open-source Vision-Language Models on the WSGG task via a suite of Graph RAG-based approaches, establishing baselines for unlocalized relationship prediction. WSGG thus advances video scene understanding toward world-centric, temporally persistent, and interpretable scene reasoning.

关键词: World Scene Graph Generation, Spatio-temporal scene graphs, 4D scenes, Object permanence, Masked completion, Vision-Language Models, Graph RAG, Temporal attention

149. ❌ Diffusion-Based Feature Denoising and Using NNMF for Robust Brain Tumor Classification

作者: Hiba Adil Al-kharsan, Róbert Rajkó 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13182v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于医学图像分类（脑肿瘤MRI分类），使用传统深度学习技术（CNN、NNMF、扩散模型）而非大语言模型或相关技术。与大多数关键词（如LLMs、MoE、RLHF等）完全无关。仅与’Explainable AI’有中等关联（因使用NNMF提取可解释特征），与’AI for Science’有较强关联（属于生物信息学/医学AI应用）。

!!! tip deepseek-chat TL;DR

该研究提出了一种结合非负矩阵分解、轻量级卷积神经网络和扩散特征净化的框架，用于提高脑肿瘤MRI分类在对抗性攻击下的鲁棒性，实验表明该方法在保持分类性能的同时显著增强了对抗鲁棒性。

摘要翻译

磁共振成像（MRI）中的脑肿瘤分类在计算机辅助诊断系统中扮演着关键角色。近年来，深度学习模型已实现较高的分类准确率。然而，其对对抗性扰动的敏感性已成为医疗应用中重要的可靠性问题。本研究提出一种鲁棒的脑肿瘤分类框架，该框架结合了非负矩阵分解（Non-Negative Matrix Factorization, NNMF或NMF）、轻量级卷积神经网络（Convolutional Neural Networks, CNNs）以及基于扩散的特征净化技术。首先，对MRI图像进行预处理并转换为非负数据矩阵，从中提取紧凑且可解释的NNMF特征表示。通过包括AUC、Cohen’s d和p值在内的统计指标对特征成分进行排序，并选择最具判别力的成分。随后，直接在所选特征组上训练一个轻量级CNN分类器。为提升对抗鲁棒性，引入了基于扩散的特征空间净化模块：在分类前采用前向加噪方法，再通过一个训练好的去噪网络进行处理。系统性能通过干净准确率以及在AutoAttack生成的强对抗攻击下的鲁棒准确率进行评估。实验结果表明，所提出的框架在实现有竞争力的分类性能的同时，显著增强了对对抗性扰动的鲁棒性。研究结果表明，将可解释的基于NNMF的特征表示与轻量级深度学习方法及基于扩散的防御技术相结合，为对抗环境下的医学图像分类提供了一种有效且可靠的解决方案。

摘要 (Abstract)

Brain tumor classification from magnetic resonance imaging, which is also known as MRI, plays a sensitive role in computer-assisted diagnosis systems. In recent years, deep learning models have achieved high classification accuracy. However, their sensitivity to adversarial perturbations has become an important reliability concern in medical applications. This study suggests a robust brain tumor classification framework that combines Non-Negative Matrix Factorization (NNMF or NMF), lightweight convolutional neural networks (CNNs), and diffusion-based feature purification. Initially, MRI images are preprocessed and converted into a non-negative data matrix, from which compact and interpretable NNMF feature representations are extracted. Statistical metrics, including AUC, Cohen’s d, and p-values, are used to rank and choose the most discriminative components. Then, a lightweight CNN classifier is trained directly on the selected feature groups. To improve adversarial robustness, a diffusion-based feature-space purification module is introduced. A forward noise method followed by a learned denoiser network is used before classification. System performance is estimated using both clean accuracy and robust accuracy under powerful adversarial attacks created by AutoAttack. The experimental results show that the proposed framework achieves competitive classification performance while significantly enhancing robustness against adversarial perturbations.The findings presuppose that combining interpretable NNMF-based representations with a lightweight deep approach and diffusion-based defense technique supplies an effective and reliable solution for medical image classification under adversarial conditions.

关键词: brain tumor classification, MRI, non-negative matrix factorization, convolutional neural networks, diffusion-based feature purification, adversarial robustness, medical image analysis, AutoAttack

150. ❌ Perceive What Matters: Relevance-Driven Scheduling for Multimodal Streaming Perception

作者: Dingcheng Huang, Xiaotong Zhang, Kamal Youcef-Toumi 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13176v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 该论文研究的是人机协作（HRC）中的多模态流式感知调度问题，提出了一种轻量级感知调度框架，通过利用先前帧的输出和场景上下文来实时估计和调度必要的感知模块，以减少计算延迟并提高效率。论文的核心内容涉及实时感知、计算资源调度、多模态感知（视觉、听觉、上下文）和系统优化，但并未涉及大模型（LLMs）、深度学习技术原理创新或任何评分关键词中列出的具体技术（如MoE、Scaling Laws、RLHF、RAG等）。所有关键词均与大模型技术、训练方法、推理优化、AI代理或科学AI应用相关，而该论文专注于传统的感知模块调度和系统效率问题，与这些关键词无直接关联。因此，所有关键词的相关度评分均为0。

!!! tip deepseek-chat TL;DR

该论文针对人机协作中多模态流式感知的延迟问题，提出了一种基于相关性的轻量级感知调度框架，实验表明该框架能减少高达27.52%的计算延迟，同时提高感知激活召回率和关键帧准确性。

摘要翻译

在现代人机协作应用中，多个感知模块通过联合提取视觉、听觉及上下文线索来实现全面的场景理解，使机器人能够智能地为人类主体提供适切协助。虽然在离线场景中逐帧执行多个感知模块可提升感知质量，但这种方式在流式感知场景中不可避免地会累积延迟，导致系统性能显著下降。近期在场景理解领域提出的“关联性”研究，已为开发高效的人机协作方法奠定了坚实基础。然而，现代感知流程仍面临信息冗余与计算资源分配欠优的挑战。受“关联性”概念及人机协作事件中信息稀疏性的启发，我们提出了一种新颖的轻量级感知调度框架，该框架能有效利用历史帧的输出结果，依据场景上下文实时估计并调度必要的感知模块。实验结果表明，与传统并行感知流程相比，所提出的感知调度框架将计算延迟降低了最高27.52%，同时将MMPose激活召回率提升了72.73%。此外，该框架展现出高达98%的关键帧准确率。这些结果验证了该框架能在不显著牺牲准确性的前提下，有效提升实时感知效率。该框架有望成为人机协作中多模态流式感知系统的可扩展系统性解决方案。

摘要 (Abstract)

In modern human-robot collaboration (HRC) applications, multiple perception modules jointly extract visual, auditory, and contextual cues to achieve comprehensive scene understanding, enabling the robot to provide appropriate assistance to human agents intelligently. While executing multiple perception modules on a frame-by-frame basis enhances perception quality in offline settings, it inevitably accumulates latency, leading to a substantial decline in system performance in streaming perception scenarios. Recent work in scene understanding, termed Relevance, has established a solid foundation for developing efficient methodologies in HRC. However, modern perception pipelines still face challenges related to information redundancy and suboptimal allocation of computational resources. Drawing inspiration from the Relevance concept and the information sparsity in HRC events, we propose a novel lightweight perception scheduling framework that efficiently leverages output from previous frames to estimate and schedule necessary perception modules in real-time based on scene context. The experimental results demonstrate that the proposed perception scheduling framework effectively reduces computational latency by up to 27.52% compared to conventional parallel perception pipelines, while also achieving a 72.73% improvement in MMPose activation recall. Additionally, the framework demonstrates high keyframe accuracy, achieving rates of up to 98%. The results validate the framework’s capability to enhance real-time perception efficiency without significantly compromising accuracy. The framework shows potential as a scalable and systematic solution for multimodal streaming perception systems in HRC.

关键词: multimodal streaming perception, human-robot collaboration, perception scheduling, computational latency reduction, real-time perception, scene understanding, lightweight framework, relevance-driven scheduling

151. ❌ Towards Faithful Multimodal Concept Bottleneck Models

作者: Pierre Moreau, Emeline Pineau Ferrand, Yann Choho, Benjamin Wong, Annabelle Blangero, Milan Bhan 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13163v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多模态概念瓶颈模型（f-CBM），主要关注可解释AI和概念泄漏问题。与’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为CBMs是专门的可解释模型框架。与’Hallucination Mitigation OR Factuality OR Truthfulness’有一定关联（5分），因为概念泄漏可能导致不忠实解释，类似事实性问题。与’Large Language Models OR LLMs OR Foundation Models’有弱关联（5分），因为论文使用视觉-语言骨干网络，可能涉及基础模型技术。其他关键词（如MoE、SFT、RAG等）在摘要中未提及，完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对多模态概念瓶颈模型中的概念泄漏和检测问题，提出了f-CBM框架，通过泄漏损失和KAN预测头实现了任务准确性、概念检测和泄漏减少的最佳平衡。

摘要翻译

概念瓶颈模型（Concept Bottleneck Models, CBMs）是一种可解释模型，其预测通过一层人类可理解的概念进行传递。尽管在视觉领域已得到广泛研究，并逐渐扩展至自然语言处理领域，但CBM在多模态场景中的应用仍鲜有探索。为确保解释的忠实性，CBM需满足两个条件：概念必须被准确检测，且概念表征应仅编码其预设语义，避免将额外的任务相关信息或概念间信息渗入最终预测——这种现象称为信息泄漏。现有方法通常将概念检测与泄漏缓解视为独立问题，且往往以牺牲预测准确性为代价来改进其中一方面。本研究提出f-CBM，这是一个基于视觉-语言骨干网络构建的忠实多模态CBM框架，通过两种互补策略协同处理上述两方面问题：采用可微泄漏损失函数以抑制信息泄漏，并引入柯尔莫哥洛夫-阿诺德网络预测头以增强表达能力从而提升概念检测性能。实验表明，f-CBM在任务准确性、概念检测与泄漏控制之间实现了最佳平衡，同时可无缝应用于图像-文本或纯文本数据集，展现出跨模态的通用性。

摘要 (Abstract)

Concept Bottleneck Models (CBMs) are interpretable models that route predictions through a layer of human-interpretable concepts. While widely studied in vision and, more recently, in NLP, CBMs remain largely unexplored in multimodal settings. For their explanations to be faithful, CBMs must satisfy two conditions: concepts must be properly detected, and concept representations must encode only their intended semantics, without smuggling extraneous task-relevant or inter-concept information into final predictions, a phenomenon known as leakage. Existing approaches treat concept detection and leakage mitigation as separate problems, and typically improve one at the expense of predictive accuracy. In this work, we introduce f-CBM, a faithful multimodal CBM framework built on a vision-language backbone that jointly targets both aspects through two complementary strategies: a differentiable leakage loss to mitigate leakage, and a Kolmogorov-Arnold Network prediction head that provides sufficient expressiveness to improve concept detection. Experiments demonstrate that f-CBM achieves the best trade-off between task accuracy, concept detection, and leakage reduction, while applying seamlessly to both image and text or text-only datasets, making it versatile across modalities.

关键词: Concept Bottleneck Models, Multimodal, Interpretable Models, Leakage Mitigation, Faithful Explanations, Vision-Language Backbone, Kolmogorov-Arnold Network, Concept Detection

152. ❌ DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression

作者: Junqi Shi, Ming Lu, Xingchen Li, Anle Ke, Ruiqi Zhang, Zhan Ma 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13162v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于图像压缩领域，使用扩散变换器（DiT）技术，与大多数大语言模型（LLM）关键词无关。仅与三个关键词有弱关联：1）‘Pre-training’（5分）：论文提到使用预训练的文本到图像DiT模型进行适应；2）‘Alignment’（5分）：论文提出了三种对齐机制（variance-guided reconstruction flow, self-distillation alignment, latent-conditioned guidance）；3）‘Inference Acceleration’（5分）：论文实现了30倍更快的解码速度。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文解决了基于扩散的图像压缩方法计算开销大、内存使用高的问题，通过提出DiT-IC（一种对齐的扩散变换器），在32倍下采样的潜在空间中实现高效扩散，实现了最先进的感知质量，同时解码速度提升30倍并大幅降低内存使用。

摘要翻译

基于扩散模型的图像压缩方法近期展现出卓越的感知保真度，但其实际应用受限于高昂的采样开销与内存占用。现有扩散编码器大多采用U-Net架构，其层级下采样机制迫使扩散过程在浅层潜在空间中进行（通常仅实现8倍空间下采样），导致计算量过大。相比之下，传统基于VAE的编码器可在更深层的潜在域工作（16倍至64倍下采样），这引出一个关键问题：扩散模型能否在此类紧凑潜在空间中有效运行，同时保持重建质量？为此，我们提出DiT-IC——一种用于图像压缩的对齐扩散变换器（Aligned Diffusion Transformer for Image Compression），该模型用扩散变换器（Diffusion Transformer）替代U-Net，能够在32倍下采样分辨率的潜在空间中完整执行扩散过程。DiT-IC通过三项关键对齐机制，将预训练的文本到图像多步DiT适配为单步重建模型：（1）方差引导重建流，根据潜在不确定性调整去噪强度以实现高效重建；（2）自蒸馏对齐，强制模型与编码器定义的潜在几何保持一致，实现一步扩散；（3）潜在条件引导机制，用语义对齐的潜在条件替代文本提示，实现无文本推理。凭借这些设计，DiT-IC在达到最先进感知质量的同时，解码速度比现有扩散编码器提升高达30倍，并大幅降低内存占用。值得注意的是，该模型可在16GB笔记本电脑GPU上重建2048x2048分辨率的图像。

摘要 (Abstract)

Diffusion-based image compression has recently shown outstanding perceptual fidelity, yet its practicality is hindered by prohibitive sampling overhead and high memory usage. Most existing diffusion codecs employ U-Net architectures, where hierarchical downsampling forces diffusion to operate in shallow latent spaces (typically with only 8x spatial downscaling), resulting in excessive computation. In contrast, conventional VAE-based codecs work in much deeper latent domains (16x - 64x downscaled), motivating a key question: Can diffusion operate effectively in such compact latent spaces without compromising reconstruction quality? To address this, we introduce DiT-IC, an Aligned Diffusion Transformer for Image Compression, which replaces the U-Net with a Diffusion Transformer capable of performing diffusion in latent space entirely at 32x downscaled resolution. DiT-IC adapts a pretrained text-to-image multi-step DiT into a single-step reconstruction model through three key alignment mechanisms: (1) a variance-guided reconstruction flow that adapts denoising strength to latent uncertainty for efficient reconstruction; (2) a self-distillation alignment that enforces consistency with encoder-defined latent geometry to enable one-step diffusion; and (3) a latent-conditioned guidance that replaces text prompts with semantically aligned latent conditions, enabling text-free inference. With these designs, DiT-IC achieves state-of-the-art perceptual quality while offering up to 30x faster decoding and drastically lower memory usage than existing diffusion-based codecs. Remarkably, it can reconstruct 2048x2048 images on a 16 GB laptop GPU.

关键词: Diffusion Transformer, Image Compression, Latent Space, Efficient Decoding, Perceptual Quality, Single-step Reconstruction, Memory Usage Reduction, Alignment Mechanisms

153. ❌ FDeID-Toolbox: Face De-Identification Toolbox

作者: Hui Wei, Hao Yu, Guoying Zhao 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13121v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的人脸去识别（FDeID）工具箱开发，主要涉及隐私保护、图像处理、标准化评估等传统计算机视觉任务。论文摘要和标题中完全没有提及大语言模型、深度学习技术原理、科学领域AI应用等关键词相关的内容。所有评分关键词均与大模型技术、深度学习创新、科学AI应用相关，而本论文研究的是特定计算机视觉任务的工具箱实现，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文针对人脸去识别研究领域存在的实现碎片化、评估不一致等问题，开发了一个包含标准化数据加载、统一方法实现、灵活推理管道和系统评估协议的综合工具箱FDeID-Toolbox，实现了在一致条件下对不同方法的公平可复现比较。

摘要翻译

面部去身份识别（FDeID）旨在从面部图像中移除个人身份信息，同时保留与任务相关的实用属性，如年龄、性别和表情。这对于隐私保护的计算机视觉至关重要，然而该领域目前存在实现方案碎片化、评估协议不一致以及不同研究结果难以比较的问题。这些挑战源于任务固有的复杂性：FDeID涉及多个下游应用（例如年龄估计、性别识别、表情分析），并需要在三个维度（如隐私保护、效用保持和视觉质量）上进行评估，这使得现有代码库难以使用和扩展。为解决这些问题，我们提出了FDeID-Toolbox，一个为可复现的FDeID研究设计的综合性工具箱。该工具箱采用模块化架构，包含四个核心组件：（1）针对主流基准数据集的标准化数据加载器，（2）涵盖经典方法至最先进（SOTA）生成模型的统一方法实现，（3）灵活的推理流程，以及（4）覆盖隐私性、效用性和质量指标的系统化评估协议。通过实验，我们证明FDeID-Toolbox能够在一致条件下，对多种FDeID方法进行公平且可复现的比较。

摘要 (Abstract)

Face de-identification (FDeID) aims to remove personally identifiable information from facial images while preserving task-relevant utility attributes such as age, gender, and expression. It is critical for privacy-preserving computer vision, yet the field suffers from fragmented implementations, inconsistent evaluation protocols, and incomparable results across studies. These challenges stem from the inherent complexity of the task: FDeID spans multiple downstream applications (e.g., age estimation, gender recognition, expression analysis) and requires evaluation across three dimensions (e.g., privacy protection, utility preservation, and visual quality), making existing codebases difficult to use and extend. To address these issues, we present FDeID-Toolbox, a comprehensive toolbox designed for reproducible FDeID research. Our toolbox features a modular architecture comprising four core components: (1) standardized data loaders for mainstream benchmark datasets, (2) unified method implementations spanning classical approaches to SOTA generative models, (3) flexible inference pipelines, and (4) systematic evaluation protocols covering privacy, utility, and quality metrics. Through experiments, we demonstrate that FDeID-Toolbox enables fair and reproducible comparison of diverse FDeID methods under consistent conditions.

关键词: Face de-identification, Privacy-preserving computer vision, Toolbox, Reproducible research, Standardized evaluation, Generative models, Utility preservation, Visual quality

154. ❌ NOIR: Neural Operator mapping for Implicit Representations

作者: Sidaty El Hadramy, Nazim Haouchine, Michael Wehrli, Philippe C. Cattin 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13118v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文NOIR专注于医学影像的神经算子学习框架，将离散信号嵌入隐式神经表示并学习算子映射，应用于分割、形状补全等任务。所有关键词均与大语言模型、对齐、推理、代理、优化等技术直接相关，而论文完全不涉及这些内容。唯一相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物医学影像（科学领域）的应用，但并非核心创新点，只是应用领域，因此给5分（有一定关联）。其他关键词均得0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文提出了NOIR框架，将医学影像任务重构为连续函数空间之间的算子学习，通过隐式神经表示实现分辨率无关的函数到函数映射，在多个2D和3D下游任务中取得了竞争性性能并展示了强鲁棒性。

摘要翻译

本文提出NOIR框架，该框架将核心医学影像任务重新定义为连续函数空间之间的算子学习，从而挑战了当前主流的基于离散网格的深度学习范式。与在固定像素或体素网格上操作不同，NOIR将离散医学信号嵌入共享的隐式神经表示，并学习一个在其潜在调制之间映射的神经算子，从而实现与分辨率无关的函数到函数变换。我们在多个公开数据集（如Shenzhen、OASIS-4、SkullBreak、fastMRI）以及内部临床数据集上，通过多种2D与3D下游任务（包括分割、形状补全、图像到图像转换和图像合成）评估NOIR。该框架在原始分辨率下取得了具有竞争力的性能，同时对未见过的离散化方案表现出强大的鲁棒性，并在经验上满足了神经算子的关键理论性质。项目页面详见：https://github.com/Sidaty1/NOIR-io。

摘要 (Abstract)

This paper presents NOIR, a framework that reframes core medical imaging tasks as operator learning between continuous function spaces, challenging the prevailing paradigm of discrete grid-based deep learning. Instead of operating on fixed pixel or voxel grids, NOIR embeds discrete medical signals into shared Implicit Neural Representations and learns a Neural Operator that maps between their latent modulations, enabling resolution-independent function-to-function transformations. We evaluate NOIR across multiple 2D and 3D downstream tasks, including segmentation, shape completion, image-to-image translation, and image synthesis, on several public datasets such as Shenzhen, OASIS-4, SkullBreak, fastMRI, as well as an in-house clinical dataset. It achieves competitive performance at native resolution while demonstrating strong robustness to unseen discretizations, and empirically satisfies key theoretical properties of neural operators. The project page is available here: https://github.com/Sidaty1/NOIR-io.

关键词: Neural Operator, Implicit Neural Representations, Medical Imaging, Function-to-function Mapping, Resolution-independent, Segmentation, Image Synthesis, Operator Learning

155. ❌ Panoramic Multimodal Semantic Occupancy Prediction for Quadruped Robots

作者: Guoqiang Zhao, Zhe Yang, Sheng Wu, Fei Teng, Mengfei Duan, Yuanfan Zheng, Kai Luo, Kailun Yang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13108v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于四足机器人的全景多模态语义占据预测，属于计算机视觉和机器人感知领域。论文提出了PanoMMOcc数据集和VoxelHound框架，涉及垂直抖动补偿和多模态信息提示融合等技术。所有评分关键词均与大语言模型、深度学习技术原理、AI科学应用等主题相关，而本论文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对四足机器人在复杂环境中现有占据预测方法依赖RGB线索且鲁棒性不足的问题，提出了首个真实世界全景多模态占据数据集PanoMMOcc和专门为腿式移动设计的VoxelHound框架，通过垂直抖动补偿和多模态信息提示融合模块，在mIoU指标上实现了4.16%的性能提升。

摘要翻译

全景影像为四足机器人提供了完整的360°视觉覆盖感知能力。然而，现有的占据预测方法主要针对轮式自动驾驶设计，且严重依赖RGB视觉线索，在复杂环境中的鲁棒性受限。为弥补这一差距，(1) 我们提出了首个面向四足机器人的真实世界全景多模态占据数据集PanoMMOcc，涵盖多样化场景下的四种传感模态。(2) 我们提出专为足式移动与球面成像设计的全景多模态占据感知框架VoxelHound。具体而言，我们设计了(i) 垂直抖动补偿模块，以缓解移动过程中因机体俯仰和横滚导致的剧烈视点扰动，实现更一致的空间推理；以及(ii) 高效的多模态信息提示融合模块，联合利用全景视觉线索与辅助模态以增强体素占据预测能力。(3) 我们基于PanoMMOcc建立了基准测试体系，并提供详细数据分析，以支持在具身挑战场景下对感知方法进行系统评估。大量实验表明，VoxelHound在PanoMMOcc上实现了最先进的性能表现（mIoU提升+4.16%）。数据集与代码将通过https://github.com/SXDR/PanoMMOcc公开，同时发布于https://github.com/losehu/CameraLiDAR-Calib的标定工具也将一并发布，以促进具身机器人系统全景多模态三维感知领域的未来研究。

摘要 (Abstract)

Panoramic imagery provides holistic 360° visual coverage for perception in quadruped robots. However, existing occupancy prediction methods are mainly designed for wheeled autonomous driving and rely heavily on RGB cues, limiting their robustness in complex environments. To bridge this gap, (1) we present PanoMMOcc, the first real-world panoramic multimodal occupancy dataset for quadruped robots, featuring four sensing modalities across diverse scenes. (2) We propose a panoramic multimodal occupancy perception framework, VoxelHound, tailored for legged mobility and spherical imaging. Specifically, we design (i) a Vertical Jitter Compensation (VJC) module to mitigate severe viewpoint perturbations caused by body pitch and roll during mobility, enabling more consistent spatial reasoning, and (ii) an effective Multimodal Information Prompt Fusion (MIPF) module that jointly leverages panoramic visual cues and auxiliary modalities to enhance volumetric occupancy prediction. (3) We establish a benchmark based on PanoMMOcc and provide detailed data analysis to enable systematic evaluation of perception methods under challenging embodied scenarios. Extensive experiments demonstrate that VoxelHound achieves state-of-the-art performance on PanoMMOcc (+4.16%} in mIoU). The dataset and code will be publicly released to facilitate future research on panoramic multimodal 3D perception for embodied robotic systems at https://github.com/SXDR/PanoMMOcc, along with the calibration tools released at https://github.com/losehu/CameraLiDAR-Calib.

关键词: panoramic multimodal occupancy prediction, quadruped robots, PanoMMOcc dataset, VoxelHound framework, vertical jitter compensation, multimodal information prompt fusion, 3D perception, embodied robotic systems

156. ❌ BenDFM: A taxonomy and synthetic CAD dataset for manufacturability assessment in sheet metal bending

作者: Matteo Ballegeer, Dries F. Benoit 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13102v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于使用深度学习进行CAD设计的可制造性评估，属于AI在工程/制造科学领域的应用。与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理等）完全无关。仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为该关键词广义上涵盖了AI在科学（包括工程科学）中的应用，但论文未涉及生物信息学或化学信息学。

!!! tip deepseek-chat TL;DR

该论文针对钣金弯曲工艺的可制造性评估，提出了一个分类法并创建了首个包含可制造与不可制造零件的合成数据集BenDFM，用于系统研究基于学习的DFM挑战，并验证了基于图的3D学习架构在预测准确性上的优势。

摘要翻译

在设计可制造性（DFM）领域，早期预测CAD设计的可制造性（包括可行性与所需成本）是一个核心目标。尽管深度学习在CAD领域已取得进展，并在制造工艺选择中得到广泛应用，但针对特定工艺的可制造性预测，基于学习的方法仍较为有限。两大关键挑战制约了进展：先前研究对可制造性的定义不一致，导致学习目标各异；以及缺乏合适的数据集。现有标签差异显著：它们可能反映内在的设计约束，也可能依赖于特定的制造能力（如可用工具），且范围从离散的可行性检查到连续的复杂度度量不等。此外，工业数据集通常仅包含可制造零件，对不可行案例的参考信息极少；而现有的合成数据集则侧重于简单几何形状和减材工艺。为弥补这些不足，我们提出了一种沿配置依赖性和度量类型两个维度划分的可制造性度量分类法，从而更清晰地界定泛化范围与学习目标。接着，我们推出了BenDFM——首个用于钣金弯曲可制造性评估的合成数据集。BenDFM包含20,000个可制造与不可制造零件，通过工艺感知的弯曲仿真生成，同时提供折叠与展开的几何形状，以及涵盖分类法中多个维度的可制造性标签，从而能够系统研究先前未被探索的基于学习的DFM挑战。我们在BenDFM上对两种先进的3D学习架构进行了基准测试，结果表明：基于图的表征方法能够捕捉零件表面间的关系，从而获得更高的预测精度；而预测依赖于特定制造配置的度量指标则仍然更具挑战性。

摘要 (Abstract)

Predicting the manufacturability of CAD designs early, in terms of both feasibility and required effort, is a key goal of Design for Manufacturing (DFM). Despite advances in deep learning for CAD and its widespread use in manufacturing process selection, learning-based approaches for predicting manufacturability within a specific process remain limited. Two key challenges limit progress: inconsistency across prior work in how manufacturability is defined and consequently in the associated learning targets, and a scarcity of suitable datasets. Existing labels vary significantly: they may reflect intrinsic design constraints or depend on specific manufacturing capabilities (such as available tools), and they range from discrete feasibility checks to continuous complexity measures. Furthermore, industrial datasets typically contain only manufacturable parts, offering little signal for infeasible cases, while existing synthetic datasets focus on simple geometries and subtractive processes. To address these gaps, we propose a taxonomy of manufacturability metrics along the axes of configuration dependence and measurement type, allowing clearer scoping of generalizability and learning objectives. Next, we introduce BenDFM, the first synthetic dataset for manufacturability assessment in sheet metal bending. BenDFM contains 20,000 parts, both manufacturable and unmanufacturable, generated with process-aware bending simulations, providing both folded and unfolded geometries and multiple manufacturability labels across the taxonomy, enabling systematic study of previously unexplored learning-based DFM challenges. We benchmark two state-of-the-art 3D learning architectures on BenDFM, showing that graph-based representations that capture relationships between part surfaces achieve better accuracy, and that predicting metrics that depend on specific manufacturing setups remains more challenging.

关键词: manufacturability assessment, sheet metal bending, synthetic CAD dataset, Design for Manufacturing (DFM), deep learning, 3D learning architectures, graph-based representations, process-aware simulation

157. ❌ SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design

作者: Ruogu Li, Sikai Li, Yao Mu, Mingyu Ding 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13098v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文主要贡献是创建了一个用于CAD生成的大规模多模态数据集SldprtNet，并使用了轻量级多模态语言模型Qwen2.5-VL-7B生成自然语言描述。与关键词的相关性分析如下：1) “Large Language Models"得5分：论文使用了Qwen2.5-VL-7B模型，但这不是论文的核心创新点；2) “Post-training"得5分：论文提到在数据集子集上微调了基线模型，涉及监督微调；3) “AI for Science"得8分：论文涉及3D设计、CAD建模等工程科学领域的AI应用，属于AI for Science范畴；4) 其他关键词得0分：论文未涉及MoE、量化、推理加速、对齐、RAG等技术原理创新。

!!! tip deepseek-chat TL;DR

该论文创建了SldprtNet大规模多模态数据集用于语言驱动的3D CAD设计，通过使用Qwen2.5-VL-7B模型生成自然语言描述并验证了多模态输入对CAD生成的有效性。

摘要翻译

我们推出SldprtNet，这是一个包含超过242,000个工业零件的大规模数据集，专为语义驱动的CAD建模、几何深度学习以及三维设计多模态模型的训练与微调而构建。该数据集提供.step和.sldprt两种格式的三维模型，以支持多样化的训练与测试需求。为实现参数化建模并促进数据集的可扩展性，我们开发了配套工具——编码器与解码器，支持13种CAD命令类型，并实现了三维模型与结构化文本表示之间的无损转换。此外，每个样本均配有一幅合成图像，该图像通过融合三维模型七个不同视角的渲染视图生成，有效缩短了输入标记长度并加速了推理过程。通过将此图像与编码器输出的参数化文本相结合，我们采用轻量级多模态语言模型Qwen2.5-VL-7B生成每个零件外观与功能的自然语言描述。为确保准确性，我们人工核验并对齐了生成的描述、渲染图像与三维模型。这些描述与参数化建模脚本、渲染图像及三维模型文件完全对齐，共同构建了SldprtNet。为评估其有效性，我们在数据集子集上对基线模型进行微调，比较了“图像加文本”输入与纯文本输入的差异。结果证实了多模态数据集对CAD生成的必要性与价值。本数据集具有以下特点：精选真实工业零件、提供支持可扩展数据集拓展的工具、包含多类数据模态，并确保模型复杂度与几何特征的多样性，从而构建了一个为语义驱动CAD建模与跨模态学习服务的综合性多模态数据集。

摘要 (Abstract)

We introduce SldprtNet, a large-scale dataset comprising over 242,000 industrial parts, designed for semantic-driven CAD modeling, geometric deep learning, and the training and fine-tuning of multimodal models for 3D design. The dataset provides 3D models in both .step and .sldprt formats to support diverse training and testing. To enable parametric modeling and facilitate dataset scalability, we developed supporting tools, an encoder and a decoder, which support 13 types of CAD commands and enable lossless transformation between 3D models and a structured text representation. Additionally, each sample is paired with a composite image created by merging seven rendered views from different viewpoints of the 3D model, effectively reducing input token length and accelerating inference. By combining this image with the parameterized text output from the encoder, we employ the lightweight multimodal language model Qwen2.5-VL-7B to generate a natural language description of each part’s appearance and functionality. To ensure accuracy, we manually verified and aligned the generated descriptions, rendered images, and 3D models. These descriptions, along with the parameterized modeling scripts, rendered images, and 3D model files, are fully aligned to construct SldprtNet. To assess its effectiveness, we fine-tuned baseline models on a dataset subset, comparing image-plus-text inputs with text-only inputs. Results confirm the necessity and value of multimodal datasets for CAD generation. It features carefully selected real-world industrial parts, supporting tools for scalable dataset expansion, diverse modalities, and ensured diversity in model complexity and geometric features, making it a comprehensive multimodal dataset built for semantic-driven CAD modeling and cross-modal learning.

关键词: CAD generation, multimodal dataset, 3D design, language-driven design, parametric modeling, geometric deep learning, Qwen2.5-VL-7B, industrial parts

158. ❌ Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence

作者: Seunghwan Bang, Hwanjun Song 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13091v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多模态大语言模型在视频时空推理中的表现，核心涉及大语言模型（LLMs）在具身智能体中的应用，因此与’Large Language Models’高度相关（10分）。论文评估模型整合分散线索、推断隐含结构的能力，这直接对应’Chain of Thought’和’System 2 Thinking’中的多步推理和深度推理概念（各10分）。研究背景明确提到’embodied agents’，与’LLM Agents’关键词高度匹配（10分）。其他关键词如MoE、量化、RAG等未在论文中涉及，均给0分。

!!! tip deepseek-chat TL;DR

该论文通过构建VAEX-BENCH基准，评估多模态大语言模型在视频抽象时空推理中的能力，发现现有模型在需要整合观察、推断隐含结构的抽象任务上存在显著局限性。

摘要翻译

对具身智能体日益增长的需求提升了对时空视频理解的要求，然而现有基准测试主要强调抽取式推理——其答案可直接呈现于时空事件中。目前尚不清楚多模态大语言模型是否能够执行抽象时空推理，这需要整合跨时间的观察、结合分散的线索，并推断隐式的空间与上下文结构。为填补这一空白，我们通过引入结构化评估分类法来形式化视频中的抽象时空推理，该方法系统性地针对其核心维度，并构建了一个可控的、场景驱动的合成第一人称视角视频数据集，专门用于评估抽象时空推理能力，涵盖物体级、房间级和平面图级场景。基于此框架，我们提出了VAEX-BENCH基准测试，包含五项抽象推理任务及其对应的抽取式任务版本。我们通过大量实验比较了前沿多模态大语言模型在抽取式与抽象式设定下的表现，揭示了它们在抽象任务上的局限性，并对潜在瓶颈进行了细粒度分析。该数据集即将公开发布。

摘要 (Abstract)

The growing interest in embodied agents increases the demand for spatiotemporal video understanding, yet existing benchmarks largely emphasize extractive reasoning, where answers can be explicitly presented within spatiotemporal events. It remains unclear whether multimodal large language models can instead perform abstractive spatiotemporal reasoning, which requires integrating observations over time, combining dispersed cues, and inferring implicit spatial and contextual structure. To address this gap, we formalize abstractive spatiotemporal reasoning from videos by introducing a structured evaluation taxonomy that systematically targets its core dimensions and construct a controllable, scenario-driven synthetic egocentric video dataset tailored to evaluate abstractive spatiotemporal reasoning capabilities, spanning object-, room-, and floor-plan-level scenarios. Based on this framework, we present VAEX-BENCH, a benchmark comprising five abstractive reasoning tasks together with their extractive counterparts. Our extensive experiments compare the performance of state-of-the-art MLLMs under extractive and abstractive settings, exposing their limitations on abstractive tasks and providing a fine-grained analysis of the underlying bottlenecks. The dataset will be released soon.

关键词: multimodal large language models, spatiotemporal reasoning, video understanding, abstractive reasoning, embodied agents, benchmark evaluation, MLLMs, VAEX-BENCH

159. ❌ V-Bridge: Bridging Video Generative Priors to Versatile Few-shot Image Restoration

作者: Shenghe Zheng, Junpeng Jiang, Wenbo Li 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13089v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视频生成模型在图像修复任务中的应用，属于计算机视觉领域，与大多数关键词（特别是语言模型相关）无关。仅与’Pre-training’（使用预训练视频模型）和’Post-training’（用少量数据进行微调）有中等关联，但非核心内容。

!!! tip deepseek-chat TL;DR

该论文提出V-Bridge框架，利用预训练视频生成模型的视觉先验，通过少量样本微调实现多任务图像修复，挑战了生成模型与低级视觉任务的传统界限。

摘要翻译

大规模视频生成模型通过海量多样化视觉数据进行训练，使其能够内化视觉世界中丰富的结构、语义与动态先验。尽管这些模型已展现出卓越的生成能力，但其作为通用视觉学习器的潜力仍未被充分挖掘。本研究提出V-Bridge框架，旨在将这种潜在能力桥接至多样化的少样本图像复原任务中。我们重新定义图像复原问题，将其不再视为静态回归任务，而是理解为渐进式生成过程，并利用视频模型模拟从退化输入到高保真输出的逐步优化路径。令人惊讶的是，仅使用1,000个多任务训练样本（不足现有复原方法数据量的2%），预训练视频模型即可被引导实现具有竞争力的图像复原性能，通过单一模型完成多项任务，其效果可与专门设计的专用架构相媲美。我们的研究揭示：视频生成模型隐式学习了强大且可迁移的复原先验，仅需极少量数据即可激活，这挑战了生成建模与底层视觉之间的传统界限，并为视觉任务基础模型开启了新的设计范式。

摘要 (Abstract)

Large-scale video generative models are trained on vast and diverse visual data, enabling them to internalize rich structural, semantic, and dynamic priors of the visual world. While these models have demonstrated impressive generative capability, their potential as general-purpose visual learners remains largely untapped. In this work, we introduce V-Bridge, a framework that bridges this latent capacity to versatile few-shot image restoration tasks. We reinterpret image restoration not as a static regression problem, but as a progressive generative process, and leverage video models to simulate the gradual refinement from degraded inputs to high-fidelity outputs. Surprisingly, with only 1,000 multi-task training samples (less than 2% of existing restoration methods), pretrained video models can be induced to perform competitive image restoration, achieving multiple tasks with a single model, rivaling specialized architectures designed explicitly for this purpose. Our findings reveal that video generative models implicitly learn powerful and transferable restoration priors that can be activated with only extremely limited data, challenging the traditional boundary between generative modeling and low-level vision, and opening a new design paradigm for foundation models in visual tasks.

关键词: video generative models, image restoration, few-shot learning, visual priors, multi-task learning, foundation models, generative process, low-level vision

160. ❌ InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing

作者: Yebin Yang, Di Wen, Lei Qi, Weitong Kong, Junwei Zheng, Ruiping Liu, Yufan Chen, Chengzhi Wu, Kailun Yang, Yuqian Fu, Danda Pani Paudel, Luc Van Gool, Kunyu Peng 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13082v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究文本引导的多人物3D运动编辑，使用扩散模型和数据集构建，属于计算机视觉和图形学领域，与所有大模型/深度学习技术原理关键词（如LLMs、MoE、Scaling Laws、RLHF等）完全无关，也与AI for Science等应用领域无关。

!!! tip deepseek-chat TL;DR

该论文提出了InterEdit，一个用于文本引导多人物3D运动编辑的同步无分类器条件扩散模型，通过新数据集和基准测试解决了多人物运动编辑中交互复杂性和数据稀缺的问题，并实现了最先进的性能。

摘要翻译

文本引导的三维动作编辑已在单人场景中取得成功，但其向多人场景的扩展因配对数据有限及人际交互的复杂性而较少被探索。我们提出了多人三维动作编辑任务，旨在根据源动作和文本指令生成目标动作。为此，我们构建了InterEdit3D——一个包含人工标注的双人动作变化信息的新数据集，并建立了文本引导多人动作编辑（TMME）基准。我们提出了InterEdit模型，这是一个用于TMME任务的同步无分类器条件扩散模型。该模型引入了语义感知规划令牌对齐机制，通过可学习令牌捕捉高层级交互线索；同时采用交互感知频率令牌对齐策略，利用离散余弦变换（DCT）与能量池化建模周期性动作动态。实验表明，InterEdit显著提升了文本-动作一致性与编辑保真度，在TMME任务中达到了最先进的性能。数据集与代码将在https://github.com/YNG916/InterEdit 公开。

摘要 (Abstract)

Text-guided 3D motion editing has seen success in single-person scenarios, but its extension to multi-person settings is less explored due to limited paired data and the complexity of inter-person interactions. We introduce the task of multi-person 3D motion editing, where a target motion is generated from a source and a text instruction. To support this, we propose InterEdit3D, a new dataset with manual two-person motion change annotations, and a Text-guided Multi-human Motion Editing (TMME) benchmark. We present InterEdit, a synchronized classifier-free conditional diffusion model for TMME. It introduces Semantic-Aware Plan Token Alignment with learnable tokens to capture high-level interaction cues and an Interaction-Aware Frequency Token Alignment strategy using DCT and energy pooling to model periodic motion dynamics. Experiments show that InterEdit improves text-to-motion consistency and edit fidelity, achieving state-of-the-art TMME performance. The dataset and code will be released at https://github.com/YNG916/InterEdit.

关键词: 3D motion editing, multi-person motion, text-guided editing, diffusion model, interaction modeling, motion dataset, semantic alignment, frequency token alignment

161. ❌ Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics

作者: Jose Marie Antonio Miñoza, Paulo Mario P. Medina, Sebastian C. Ibañez 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13085v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	8.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究线性化注意力机制的理论特性，特别是其与神经正切核（NTK）框架的关系。它主要与’KV Cache Compression OR Linear Attention OR FlashAttention’相关，因为论文直接分析线性化注意力机制，这是该关键词的核心内容。其他关键词主要涉及大模型的应用、训练、对齐、推理优化、代理系统等，而本文专注于注意力机制的理论分析，与这些应用导向的关键词无关。

!!! tip deepseek-chat TL;DR

该论文揭示了线性化注意力机制在神经正切核框架下不收敛到其无限宽度极限，并提出了影响可塑性概念，表明注意力的强大能力和脆弱性都源于其偏离核机制的特性。

摘要翻译

理解注意力机制的理论基础因其复杂的非线性动力学特性而始终充满挑战。本研究揭示了线性化注意力学习动态中的一个基本权衡。通过采用与数据依赖的格拉姆诱导核具有精确对应关系的线性化注意力机制，基于神经正切核（NTK）框架的实证与理论分析表明，即使在大宽度条件下，线性化注意力也不会收敛到其无限宽度的NTK极限。一项谱放大结果对此进行了形式化证明：注意力变换将格拉姆矩阵的条件数立方化，要求宽度 $m = Ω(κ^6)$ 才能实现收敛，这一阈值超过了自然图像数据集上任何实际可行的宽度。这种非收敛性通过影响可塑性——即动态改变对训练样本依赖程度的能力——得以表征。注意力的可塑性比ReLU网络高出6–9倍，这具有双重含义：其数据依赖的核可通过与任务结构对齐来降低近似误差，但同样的敏感性也增加了其受训练数据对抗性操纵的脆弱性。这些发现表明，注意力的强大能力与脆弱性具有共同的根源，即其偏离了核机制。

摘要 (Abstract)

Understanding the theoretical foundations of attention mechanisms remains challenging due to their complex, non-linear dynamics. This work reveals a fundamental trade-off in the learning dynamics of linearized attention. Using a linearized attention mechanism with exact correspondence to a data-dependent Gram-induced kernel, both empirical and theoretical analysis through the Neural Tangent Kernel (NTK) framework shows that linearized attention does not converge to its infinite-width NTK limit, even at large widths. A spectral amplification result establishes this formally: the attention transformation cubes the Gram matrix’s condition number, requiring width $m = Ω(κ^6)$ for convergence, a threshold that exceeds any practical width for natural image datasets. This non-convergence is characterized through influence malleability, the capacity to dynamically alter reliance on training examples. Attention exhibits 6–9$\times$ higher malleability than ReLU networks, with dual implications: its data-dependent kernel can reduce approximation error by aligning with task structure, but this same sensitivity increases susceptibility to adversarial manipulation of training data. These findings suggest that attention’s power and vulnerability share a common origin in its departure from the kernel regime.

关键词: linearized attention, Neural Tangent Kernel, NTK, influence malleability, Gram matrix, non-convergence, attention mechanisms, theoretical analysis

162. ❌ Mitigating Memorization in Text-to-Image Diffusion via Region-Aware Prompt Augmentation and Multimodal Copy Detection

作者: Yunzhuo Chen, Jordan Vice, Naveed Akhtar, Nur Al Hasan Haldar, Ajmal Mian 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13070v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究文本到图像扩散模型中的记忆化问题，提出RAPTA和ADMCD两种方法。所有关键词均与大语言模型（LLM）相关，而本文专注于扩散模型，属于不同的生成模型类别，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

本文针对文本到图像扩散模型可能记忆并复制训练图像的问题，提出了区域感知提示增强和注意力驱动的多模态复制检测两种方法，有效减少了过拟合并可靠检测复制行为。

摘要翻译

当前最先进的文生图扩散模型能够生成令人印象深刻的视觉效果，但可能记忆并复现训练图像，从而引发版权与隐私风险。现有在推理阶段应用的提示扰动方法（如随机词元插入或嵌入噪声）虽可能降低复制行为，却常损害图像与提示的对齐度及整体保真度。为解决这一问题，我们提出两种互补方法。首先，区域感知提示增强（Region-Aware Prompt Augmentation, RAPTA）利用目标检测器识别显著区域，并将其转化为语义基础的提示变体，在训练过程中随机采样以增加多样性，同时保持语义对齐。其次，注意力驱动的多模态复制检测（Attention-Driven Multimodal Copy Detection, ADMCD）通过轻量级Transformer聚合局部图像块、全局语义与纹理特征，生成融合表征，并应用简单的阈值决策规则来检测复制行为，无需依赖大规模标注数据集进行训练。实验表明，RAPTA在保持高合成质量的同时有效减少了过拟合，而ADMCD能可靠地检测复制行为，其性能优于单模态度量方法。

摘要 (Abstract)

State-of-the-art text-to-image diffusion models can produce impressive visuals but may memorize and reproduce training images, creating copyright and privacy risks. Existing prompt perturbations applied at inference time, such as random token insertion or embedding noise, may lower copying but often harm image-prompt alignment and overall fidelity. To address this, we introduce two complementary methods. First, Region-Aware Prompt Augmentation (RAPTA) uses an object detector to find salient regions and turn them into semantically grounded prompt variants, which are randomly sampled during training to increase diversity, while maintaining semantic alignment. Second, Attention-Driven Multimodal Copy Detection (ADMCD) aggregates local patch, global semantic, and texture cues with a lightweight transformer to produce a fused representation, and applies simple thresholded decision rules to detect copying without training with large annotated datasets. Experiments show that RAPTA reduces overfitting while maintaining high synthesis quality, and that ADMCD reliably detects copying, outperforming single-modal metrics.

关键词: text-to-image diffusion models, memorization mitigation, Region-Aware Prompt Augmentation, multimodal copy detection, copyright and privacy risks, semantic alignment, overfitting reduction, synthesis quality

163. ❌ Rooftop Wind Field Reconstruction Using Sparse Sensors: From Deterministic to Generative Learning Methods

作者: Yihang Zhou, Chao Lin, Hideki Kikumoto, Ryozo Ooka, Sibo Cheng 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13077v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用深度学习模型（UNet、Vision Transformer Autoencoder、Conditional Wasserstein GAN）进行屋顶风场重建，属于AI在科学领域的应用，因此与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。然而，论文未涉及大语言模型（LLMs）、MoE、小语言模型、缩放定律、预训练、后训练、指令调优、RLHF、PEFT、RAG、上下文窗口扩展、KV缓存压缩、思维链、系统2思维、MCTS、自我纠正、LLM智能体、工具使用、多智能体系统、量化、推测解码、幻觉缓解、可解释AI、世界模型、模型合并或上下文学习等主题，因此这些关键词均得0分。

!!! tip deepseek-chat TL;DR

该研究开发了一个基于深度学习的观测学习框架，用于从稀疏传感器数据中重建屋顶风场，结果表明深度学习方法比克里金插值法更有效，混合风向训练和传感器优化进一步提高了性能和鲁棒性。

摘要翻译

实时屋顶风速分布对于无人机与城市空中交通系统的安全运行、风控系统及屋顶空间利用至关重要。然而屋顶流场具有强非线性、分离及跨方向变异等特征，使得基于稀疏传感器的流场重建面临挑战。本研究基于粒子图像测速（PIV）风洞实验数据，构建了观测学习框架，并对比了克里金插值法与三种深度学习模型：UNet、视觉Transformer自编码器（ViTAE）以及条件Wasserstein生成对抗网络（CWGAN）。我们评估了单风向训练（SDT）与混合风向训练（MDT）两种策略在5至30个传感器密度下的表现，测试了传感器位置±1网格扰动下的鲁棒性，并通过结合本征正交分解与QR分解的算法优化了传感器布局。结果表明，深度学习方法能有效基于稀疏传感器数据重建屋顶风场。与克里金插值相比，深度学习模型将结构相似性指数（SSIM）提升达32.7%，两倍因子符合率（FAC2）提升24.2%，归一化均方误差（NMSE）降低27.8%。混合风向训练进一步提升了性能，较单风向训练在SSIM上增益最高达173.7%，FAC2提升16.7%，几何平均偏差（MG）改善98.3%。研究还表明，为实现可靠部署，需协同考虑传感器配置、优化与训练策略。基于QR分解的优化使传感器扰动下的鲁棒性最高提升27.8%，但具体指标间存在权衡。采用实验数据而非模拟数据进行训练，可为不同场景下的方法选择与传感器布置提供实践指导。

摘要 (Abstract)

Real-time rooftop wind-speed distribution is important for the safe operation of drones and urban air mobility systems, wind control systems, and rooftop utilization. However, rooftop flows show strong nonlinearity, separation, and cross-direction variability, which make flow field reconstruction from sparse sensors difficult. This study develops a learning-from-observation framework using wind-tunnel experimental data obtained by Particle Image Velocimetry (PIV) and compares Kriging interpolation with three deep learning models: UNet, Vision Transformer Autoencoder (ViTAE), and Conditional Wasserstein GAN (CWGAN). We evaluate two training strategies, single wind-direction training (SDT) and mixed wind-direction training (MDT), across sensor densities from 5 to 30, test robustness under sensor position perturbations of plus or minus 1 grid, and optimize sensor placement via Proper Orthogonal Decomposition with QR decomposition. Results show that deep learning methods can reconstruct rooftop wind fields from sparse sensor data effectively. Compared with Kriging interpolation, the deep learning models improved SSIM by up to 32.7%, FAC2 by 24.2%, and NMSE by 27.8%. Mixed wind-direction training further improved performance, with gains of up to 173.7% in SSIM, 16.7% in FAC2, and 98.3% in MG compared with single-direction training. The results also show that sensor configuration, optimization, and training strategy should be considered jointly for reliable deployment. QR-based optimization improved robustness by up to 27.8% under sensor perturbations, although with metric-dependent trade-offs. Training on experimental rather than simulated data also provides practical guidance for method selection and sensor placement in different scenarios.

关键词: wind field reconstruction, sparse sensors, deep learning, UNet, Vision Transformer Autoencoder, Conditional Wasserstein GAN, sensor placement optimization, rooftop wind flow

164. ❌ Fractals made Practical: Denoising Diffusion as Partitioned Iterated Function Systems

作者: Ann Dooms 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13069v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究扩散模型（特别是DDIM）的数学基础，将其解释为分区迭代函数系统（PIFS），并分析其几何性质以推导设计准则。论文主题是扩散模型的数学理论、几何分析和优化设计，不涉及大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用。所有关键词均针对LLM及相关技术（如MoE、对齐、推理、代理等），或特定科学领域AI应用，与论文内容完全无关。

!!! tip deepseek-chat TL;DR

该论文揭示了确定性DDIM反向链作为分区迭代函数系统（PIFS）的数学结构，并基于此推导了几何量来表征去噪动力学，从而为扩散模型的设计提供了理论依据和优化准则。

摘要翻译

当扩散模型将噪声转化为照片时，它实际上在做什么？
我们证明，确定性DDIM反向链作为一个分区迭代函数系统（PIFS）运行，且该框架可作为去噪扩散模型调度、架构和训练目标的统一设计语言。从PIFS结构中，我们推导出三个可计算的几何量：每步收缩阈值 $L^*_t$、对角扩张函数 $f_t(λ)$ 和全局扩张阈值 $λ^{**}$。这些量无需模型评估即可计算，并完整刻画了去噪动态过程。它们从结构上解释了扩散模型的双阶段行为：在高噪声下通过跨图像块的弥散注意力进行全局上下文整合，而在低噪声下则通过严格方差顺序逐块释放抑制来合成精细细节。自注意力机制自然成为PIFS收缩的基本单元。PIFS吸引子的Kaplan-Yorke维度通过李雅普诺夫谱上的离散Moran方程解析确定。

通过对PIFS分形几何的研究，我们推导出三个最优设计准则，并证明四种重要的经验设计选择（余弦调度偏移、分辨率依赖的对数信噪比偏移、Min-SNR损失加权以及Align Your Steps采样方法）均作为我们显式几何优化问题的近似解出现，从而将理论转化为实践。

摘要 (Abstract)

What is a diffusion model actually doing when it turns noise into a photograph? We show that the deterministic DDIM reverse chain operates as a Partitioned Iterated Function System (PIFS) and that this framework serves as a unified design language for denoising diffusion model schedules, architectures, and training objectives. From the PIFS structure we derive three computable geometric quantities: a per-step contraction threshold $L^*_t$, a diagonal expansion function $f_t(λ)$ and a global expansion threshold $λ^{**}$. These quantities require no model evaluation and fully characterize the denoising dynamics. They structurally explain the two-regime behavior of diffusion models: global context assembly at high noise via diffuse cross-patch attention and fine-detail synthesis at low noise via patch-by-patch suppression release in strict variance order. Self-attention emerges as the natural primitive for PIFS contraction. The Kaplan-Yorke dimension of the PIFS attractor is determined analytically through a discrete Moran equation on the Lyapunov spectrum. Through the study of the fractal geometry of the PIFS, we derive three optimal design criteria and show that four prominent empirical design choices (the cosine schedule offset, resolution-dependent logSNR shift, Min-SNR loss weighting, and Align Your Steps sampling) each arise as approximate solutions to our explicit geometric optimization problems tuning theory into practice.

关键词: diffusion models, DDIM, Partitioned Iterated Function Systems, PIFS, denoising dynamics, geometric analysis, design criteria, fractal geometry

165. ❌ Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback

作者: Yuki Hirakawa, Takashi Wada, Ryotaro Shimizu, Takuya Furusawa, Yuki Saito, Ryosuke Araki, Tianwei Chen, Fan Mo, Yoshimitsu Aoki 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13057v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于虚拟试穿（VTON）的图像质量评估，提出了一个无参考的评估框架VTON-IQA，并构建了大规模人工标注数据集VTON-QBench。论文的核心技术是计算机视觉和图像生成评估，使用了Transformer架构（特别是交叉注意力模块）进行质量预测。所有评分关键词均与大语言模型、深度学习技术原理、AI for Science等主题相关，但该论文完全不涉及这些领域。论文研究的是特定计算机视觉应用（虚拟试穿）的质量评估问题，没有使用或研究大模型、深度学习技术原理创新、科学AI应用等。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对虚拟试穿系统输出缺乏可靠评估标准的问题，提出了一个无参考的图像质量评估框架VTON-IQA，通过构建大规模人工标注数据集和引入交叉注意力模块，实现了与人类感知一致的质量预测。

摘要翻译

给定一张人物图像和一件服装图像，基于图像的虚拟试穿系统会合成该人物穿着目标服装的试穿图像。随着虚拟试穿系统在时尚电商等实际应用中日益重要，对其输出结果进行可靠评估已成为关键挑战。在实际场景中，通常无法获得同一人物穿着目标服装的真实图像，这使得基于参考图像的评估方法难以实施。此外，广泛使用的分布级指标（如Fréchet Inception Distance和Kernel Inception Distance）衡量的是数据集层面的相似性，无法反映单个生成图像的感知质量。为应对这些局限，我们提出了虚拟试穿图像质量评估框架，这是一种无需参考图像、对齐人类感知的图像级质量评估方法，且不依赖真实图像。为建模人类感知判断，我们构建了VTON-QBench大规模人工标注基准数据集，其中包含14个代表性虚拟试穿模型生成的62,688张试穿图像，以及从13,838名合格标注者收集的431,800条质量标注。据我们所知，这是目前虚拟试穿领域规模最大的人类主观评估数据集。评估虚拟试穿质量需要同时验证服装保真度和人物细节保留度。为显式建模这种交互关系，我们提出了交错交叉注意力模块，该模块通过在标准Transformer块的后段自注意力层与MLP之间插入交叉注意力层来扩展原有结构。大量实验表明，VTON-IQA能够实现可靠的对齐人类感知的图像级质量预测。此外，我们使用VTON-IQA对14个代表性虚拟试穿模型进行了全面的基准评估。

摘要 (Abstract)

Given a person image and a garment image, image-based Virtual Try-ON (VTON) synthesizes a try-on image of the person wearing the target garment. As VTON systems become increasingly important in practical applications such as fashion e-commerce, reliable evaluation of their outputs has emerged as a critical challenge. In real-world scenarios, ground-truth images of the same person wearing the target garment are typically unavailable, making reference-based evaluation impractical. Moreover, widely used distribution-level metrics such as Fréchet Inception Distance and Kernel Inception Distance measure dataset-level similarity and fail to reflect the perceptual quality of individual generated images. To address these limitations, we propose Image Quality Assessment for Virtual Try-On (VTON-IQA), a reference-free framework for human-aligned, image-level quality assessment without requiring ground-truth images. To model human perceptual judgments, we construct VTON-QBench, a large-scale human-annotated benchmark comprising 62,688 try-on images generated by 14 representative VTON models and 431,800 quality annotations collected from 13,838 qualified annotators. To the best of our knowledge, this is the largest dataset to date for human subjective evaluation in virtual try-on. Evaluating virtual try-on quality requires verifying both garment fidelity and the preservation of person-specific details. To explicitly model such interactions, we introduce an Interleaved Cross-Attention module that extends standard transformer blocks by inserting a cross-attention layer between self-attention and MLP in the latter blocks. Extensive experiments show that VTON-IQA achieves reliable human-aligned image-level quality prediction. Moreover, we conduct a comprehensive benchmark evaluation of 14 representative VTON models using VTON-IQA.

关键词: Virtual Try-On, Image Quality Assessment, Reference-free Evaluation, Human Feedback, VTON-IQA, Cross-Attention, Benchmark Dataset, Perceptual Quality

166. ❌ Multimodal OCR: Parse Anything from Documents

作者: Handong Zheng, Yumeng Li, Kaile Zhang, Liang Xin, Guangwei Zhao, Hao Liu, Jiayu Chen, Jie Lou, Jiyu Qiu, Qi Fu, Rui Yang, Shuo Jiang, Weijian Luo, Weijie Su, Weijun Zhang, Xingyu Zhu, Yabin Li, Yiwei ma, Yu Chen, Zhaohui Yu, Guang Yang, Colin Zhang, Lei Zhang, Yuliang Liu, Xiang Bai 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13032v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种多模态OCR方法（MOCR），专注于文档解析，将文本和图形统一解析为结构化表示。虽然论文涉及深度学习模型（3B参数模型）的训练，使用了预训练和监督微调，但核心内容并非大语言模型（LLM）技术或科学AI应用。论文主要关注计算机视觉和文档理解，与大多数关键词（如LLM、MoE、SLM、对齐、推理、代理等）无关。仅与’Pre-training’和’SFT’有中等关联（5分），因为论文提到了分阶段预训练和监督微调。其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种多模态OCR方法（dots.mocr），将文档中的文本和图形统一解析为结构化表示，通过3B参数模型在文档解析和结构化图形解析任务上取得了最先进的性能。

摘要翻译

我们提出多模态光学字符识别（Multimodal OCR，简称MOCR），这是一种将文本与图形联合解析为统一文本表示的文档解析范式。与传统OCR系统仅专注于文本识别并将图形区域裁剪为像素块不同，我们的方法（命名为dots.mocr）将图表、示意图、表格和图标等视觉元素视为一级解析目标，使系统能够在解析文档的同时保持元素间的语义关联。该方法具有以下优势：（1）将文本和图形重构为结构化输出，实现更精准的文档重建；（2）支持跨异构文档元素的端到端训练，使模型能够利用文本与视觉组件间的语义关系；（3）将以往被丢弃的图形转化为可复用的代码级监督信号，释放现有文档中嵌入的多模态监督信息。为实现该范式的大规模应用，我们基于PDF文件、渲染网页和原生可缩放矢量图形（SVG）资源构建了完整的数据引擎，并通过分阶段预训练与监督微调训练了一个紧凑的30亿参数模型。我们从文档解析和结构化图形解析两个维度评估dots.mocr的性能：在文档解析基准测试中，该方法在我们的OCR竞技场Elo排行榜上仅次于Gemini 3 Pro，超越现有开源文档解析系统，并在olmOCR基准测试中以83.9分创下最新记录；在结构化图形解析方面，dots.mocr在图像转SVG的各项基准测试中均比Gemini 3 Pro实现更高的重建质量，在图表、用户界面布局、科学图示和化学结构式解析任务中展现出强劲性能。这些结果表明了构建大规模图像到代码语料库以支持多模态预训练的可扩展路径。代码与模型已公开于https://github.com/rednote-hilab/dots.mocr。

摘要 (Abstract)

We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped pixels, our method, termed dots.mocr, treats visual elements such as charts, diagrams, tables, and icons as first-class parsing targets, enabling systems to parse documents while preserving semantic relationships across elements. It offers several advantages: (1) it reconstructs both text and graphics as structured outputs, enabling more faithful document reconstruction; (2) it supports end-to-end training over heterogeneous document elements, allowing models to exploit semantic relations between textual and visual components; and (3) it converts previously discarded graphics into reusable code-level supervision, unlocking multimodal supervision embedded in existing documents. To make this paradigm practical at scale, we build a comprehensive data engine from PDFs, rendered webpages, and native SVG assets, and train a compact 3B-parameter model through staged pretraining and supervised fine-tuning. We evaluate dots.mocr from two perspectives: document parsing and structured graphics parsing. On document parsing benchmarks, it ranks second only to Gemini 3 Pro on our OCR Arena Elo leaderboard, surpasses existing open-source document parsing systems, and sets a new state of the art of 83.9 on olmOCR Bench. On structured graphics parsing, dots.mocr achieves higher reconstruction quality than Gemini 3 Pro across image-to-SVG benchmarks, demonstrating strong performance on charts, UI layouts, scientific figures, and chemical diagrams. These results show a scalable path toward building large-scale image-to-code corpora for multimodal pretraining. Code and models are publicly available at https://github.com/rednote-hilab/dots.mocr.

关键词: Multimodal OCR, document parsing, structured graphics parsing, pretraining, supervised fine-tuning, image-to-code, 3B-parameter model, end-to-end training

167. ❌ A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks

作者: Tangzheng Lian, Guanyu Hu, Yijing Ren, Dimitrios Kollias, Oya Celiktutan 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12998v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 该论文研究的是视觉-语言模型（VLMs）的去偏方法，属于多模态模型领域。虽然VLMs与大语言模型（LLMs）在技术上有关联（如都基于Transformer架构），但论文的核心内容（去偏、公平性、跨模态空间、闭式解）与提供的所有关键词（主要针对纯语言模型的技术、训练方法、推理优化、代理系统、科学AI应用等）均无直接关联。关键词列表中没有包含"Vision-Language Models”、“Debiasing”、“Fairness”、“Multimodal"等相关术语，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种针对视觉-语言模型的去偏方法，通过闭式解在跨模态空间中实现帕累托最优的公平性，同时保证有界的效用损失，无需训练或标注数据，并在多个下游任务中验证了其有效性。

摘要翻译

尽管视觉-语言模型（VLMs）在多种下游任务中取得了显著性能，但近期研究表明，它们可能从训练数据中继承社会偏见，并进一步将其传播至下游应用。为解决这一问题，已有多种去偏方法被提出，然而大多数方法旨在提升公平性，却缺乏理论保证模型效用得以保持。本文提出一种在跨模态空间中具有闭式解的去偏方法，能够以有界的效用损失实现帕累托最优的公平性。该方法无需训练、无需标注数据，并可针对下游任务联合消除视觉与文本模态中的偏见。大量实验表明，在零样本图像分类、文本到图像检索以及文本到图像生成等下游任务中，我们的方法在群体公平性和交叉公平性方面，于多种公平性指标和数据集上均优于现有去偏方法，同时保持了任务性能。

摘要 (Abstract)

While Vision-Language Models (VLMs) have achieved remarkable performance across diverse downstream tasks, recent studies have shown that they can inherit social biases from the training data and further propagate them into downstream applications. To address this issue, various debiasing approaches have been proposed, yet most of them aim to improve fairness without having a theoretical guarantee that the utility of the model is preserved. In this paper, we introduce a debiasing method that yields a \textbf{closed-form} solution in the cross-modal space, achieving Pareto-optimal fairness with \textbf{bounded utility losses}. Our method is \textbf{training-free}, requires \textbf{no annotated data}, and can jointly debias both visual and textual modalities across downstream tasks. Extensive experiments show that our method outperforms existing methods in debiasing VLMs across diverse fairness metrics and datasets for both group and \textbf{intersectional} fairness in downstream tasks such as zero-shot image classification, text-to-image retrieval, and text-to-image generation while preserving task performance.

关键词: Vision-Language Models, Debiasing, Fairness, Cross-modal, Closed-form solution, Utility guarantees, Training-free, Intersectional fairness

168. ❌ Accelerating Stroke MRI with Diffusion Probabilistic Models through Large-Scale Pre-training and Target-Specific Fine-Tuning

作者: Yamin Arefeen, Sidharth Kumar, Steven Warach, Hamidreza Saber, Jonathan Tamir 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13007v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究扩散概率模型（DPMs）在加速MRI重建中的应用，属于AI for Science（医学影像）领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。方法上采用大规模预训练和目标特定微调，与’Pre-training OR Continual Pre-training OR Domain Adaptation’和’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分）。论文提到’foundation model paradigm’，与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分）。其他关键词主要涉及大语言模型（LLMs）的特定技术（如MoE、RLHF、RAG等）或推理方法（如CoT、Agents），与本文的扩散模型和医学影像应用无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于扩散概率模型的数据高效加速MRI重建方法，通过大规模预训练和少量目标数据微调，在临床卒中MRI中实现了与标准方法相当的图像质量，显著减少了对特定领域大数据的需求。

摘要翻译

目的：针对临床卒中磁共振成像中仅能获取有限全采样数据样本的情况，开发一种基于扩散概率生成模型的数据高效加速磁共振成像重建策略，以缩短扫描时间。

方法：受基础模型范式启发，我们提出一种简单的训练策略：首先在fastMRI的大规模多样化脑部磁共振成像公开数据集上预训练扩散概率生成模型，随后使用经精细调整的学习率与微调时长，在目标应用的小规模数据集上进行微调。该方法通过在受控的fastMRI实验以及临床卒中磁共振成像数据上进行评估，后者包含一项盲法临床阅片研究。

结果：在约4000例非FLAIR（液体衰减反转恢复序列）对比度的受试者数据上预训练，并仅用20例目标受试者的FLAIR数据微调的扩散概率生成模型，在多种加速因子下取得的重建性能与使用大量目标领域FLAIR数据训练的模型相当。实验表明，采用降低学习率的中等程度微调可提升性能，而微调不足或过度则会降低重建质量。在临床卒中磁共振成像应用中，一项由两名神经放射科医师参与的盲法阅片研究显示，使用所提方法从$2 \times$加速数据重建的图像，在图像质量与结构 delineation（轮廓显示）方面均不劣于临床标准图像。

结论：大规模预训练结合针对性微调，使得基于扩散概率生成模型的磁共振成像重建能够在数据受限的加速临床卒中扫描中实现。所提方法显著减少了对大型应用特异性数据集的需求，同时保持了临床可接受的图像质量，这支持了在目标应用中采用受基础模型启发的扩散模型进行加速磁共振成像。

摘要 (Abstract)

Purpose: To develop a data-efficient strategy for accelerated MRI reconstruction with Diffusion Probabilistic Generative Models (DPMs) that enables faster scan times in clinical stroke MRI when only limited fully-sampled data samples are available. Methods: Our simple training strategy, inspired by the foundation model paradigm, first trains a DPM on a large, diverse collection of publicly available brain MRI data in fastMRI and then fine-tunes on a small dataset from the target application using carefully selected learning rates and fine-tuning durations. The approach is evaluated on controlled fastMRI experiments and on clinical stroke MRI data with a blinded clinical reader study. Results: DPMs pre-trained on approximately 4000 subjects with non-FLAIR contrasts and fine-tuned on FLAIR data from only 20 target subjects achieve reconstruction performance comparable to models trained with substantially more target-domain FLAIR data across multiple acceleration factors. Experiments reveal that moderate fine-tuning with a reduced learning rate yields improved performance, while insufficient or excessive fine-tuning degrades reconstruction quality. When applied to clinical stroke MRI, a blinded reader study involving two neuroradiologists indicates that images reconstructed using the proposed approach from $2 \times$ accelerated data are non-inferior to standard-of-care in terms of image quality and structural delineation. Conclusion: Large-scale pre-training combined with targeted fine-tuning enables DPM-based MRI reconstruction in data-constrained, accelerated clinical stroke MRI. The proposed approach substantially reduces the need for large application-specific datasets while maintaining clinically acceptable image quality, supporting the use of foundation-inspired diffusion models for accelerated MRI in targeted applications.

关键词: Diffusion Probabilistic Models, MRI reconstruction, pre-training, fine-tuning, stroke MRI, accelerated MRI, data-efficient, foundation model paradigm

169. ❌ Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis

作者: Chen Feng, Zhuo Zhi, Zhao Huang, Jiawei Ge, Ling Xiao, Nicu Sebe, Georgios Tzimiropoulos, Ioannis Patras 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12997v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是带噪声标签学习（Learning with Noisy Labels, LNL）中的噪声校正方法，属于传统机器学习领域，主要关注噪声转换矩阵、分类器训练和理论分析。论文内容完全不涉及大模型、深度学习技术原理、大模型应用或任何评分关键词中的技术（如LLM、MoE、RLHF、RAG等）。所有关键词均与大模型、深度学习及其应用相关，而本文是传统机器学习理论分析，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文通过实验和理论分析揭示了在理想噪声转换矩阵条件下，带噪声标签学习中的噪声校正方法仍然会失败，表明问题根源不是矩阵估计误差，而是更深层的优化动态和信息理论限制。

摘要翻译

基于噪声转移矩阵($T$)的统计一致性方法为带噪标签学习提供了理论依据，其能够保证收敛至最优的干净数据分类器。然而在实际应用中，这些方法的表现往往逊于样本选择等经验性方法，这一差距通常被归因于准确估计$T$的困难。学界普遍假设，若给定一个完美的$T$，噪声校正方法理应恢复其理论优势。本研究对这一长期存在的假设进行了决定性检验。我们在理想化条件下开展实验，为校正方法提供完美且先验的转移矩阵。即使在如此理想条件下，我们仍观察到这些方法在训练过程中会出现性能崩溃。这一结果有力地证明，其失效本质上并非$T$估计问题，而是源于更深层的缺陷。为解释该现象，我们提出了一个统一的分析框架，将宏观收敛状态、微观优化动力学以及从带噪标签中可学习信息的信息论极限这三个层面联系起来。这些结果共同从形式上阐释了理想噪声校正失败的原因，并为设计更可靠的带噪标签学习方法提供了具体指导。

摘要 (Abstract)

Statistically consistent methods based on the noise transition matrix ($T$) offer a theoretically grounded solution to Learning with Noisy Labels (LNL), with guarantees of convergence to the optimal clean-data classifier. In practice, however, these methods are often outperformed by empirical approaches such as sample selection, and this gap is usually attributed to the difficulty of accurately estimating $T$. The common assumption is that, given a perfect $T$, noise-correction methods would recover their theoretical advantage. In this work, we put this longstanding hypothesis to a decisive test. We conduct experiments under idealized conditions, providing correction methods with a perfect, oracle transition matrix. Even under these ideal conditions, we observe that these methods still suffer from performance collapse during training. This compellingly demonstrates that the failure is not fundamentally a $T$-estimation problem, but stems from a more deeply rooted flaw. To explain this behaviour, we provide a unified analysis that links three levels: macroscopic convergence states, microscopic optimisation dynamics, and information-theoretic limits on what can be learned from noisy labels. Together, these results give a formal account of why ideal noise correction fails and offer concrete guidance for designing more reliable methods for learning with noisy labels.

关键词: Learning with Noisy Labels, noise transition matrix, noise correction, theoretical analysis, optimization dynamics, information-theoretic limits, performance collapse

170. ❌ SCOPE: Semantic Coreset with Orthogonal Projection Embeddings for Federated learning

作者: Md Anwar Hossen, Nathan R. Tallent, Luanzheng Guo, Ali Jannesary 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12976v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于联邦学习中的核心集选择方法（SCOPE框架），解决科学数据中的类别不平衡和通信效率问题。所有关键词均与大模型技术、训练方法、推理优化、对齐、代理系统等直接相关，而论文完全不涉及这些主题。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文提到科学发现和科学数据，但未具体涉及生物信息学或化学信息学，因此给5分（有一定关联）。其他关键词完全无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SCOPE的联邦学习核心集选择框架，通过语义分析和正交投影嵌入来过滤异常和冗余数据，以解决科学数据中的类别不平衡问题，并在实验中实现了显著的通信效率提升和计算加速。

摘要翻译

科学发现日益依赖于对联邦数据集的学习，这些数据集由高分辨率仪器产生的数据流所供给，且存在极端的类别不平衡问题。当前的机器学习方法要么需要不切实际的数据集中聚合，要么因类别不平衡而失效。现有的核心集选择方法依赖于局部启发式策略，使其无法感知全局数据分布，容易导致次优且缺乏代表性的剪枝。为克服这些挑战，我们提出SCOPE（基于正交投影嵌入的语义核心集联邦学习框架），这是一个面向联邦数据的核心集框架，能够过滤异常数据并自适应地剪枝冗余数据，以缓解长尾分布偏差。通过分析潜在空间分布，我们使用三个指标对每个数据点进行评分：衡量核心类别特征可靠性的表示分数、量化正交残差新颖性的多样性分数，以及指示与竞争类别相似度的边界邻近分数。与先前方法不同，SCOPE仅向联邦服务器共享标量指标以构建全局共识，从而确保通信效率。在全局共识的指导下，SCOPE动态过滤局部噪声并丢弃冗余样本，以抵消全局特征偏差。大量实验表明，SCOPE在实现具有竞争力的全局精度和稳健收敛性的同时，取得了卓越的效率：上行链路带宽降低128至512倍，实际运行时间加速7.72倍，并减少了本地核心集选择所需的浮点运算量和显存占用。

摘要 (Abstract)

Scientific discovery increasingly requires learning on federated datasets, fed by streams from high-resolution instruments, that have extreme class imbalance. Current ML approaches either require impractical data aggregation or fail due to class imbalance. Existing coreset selection methods rely on local heuristics, making them unaware of the global data landscape and prone to sub-optimal and non-representative pruning. To overcome these challenges, we introduce SCOPE (Semantic Coreset using Orthogonal Projection Embeddings for Federated learning), a coreset framework for federated data that filters anomalies and adaptively prunes redundant data to mitigate long-tail skew. By analyzing the latent space distribution, we score each data point using a representation score that measures the reliability of core class features, a diversity score that quantifies the novelty of orthogonal residuals, and a boundary proximity score that indicates similarity to competing classes. Unlike prior methods, SCOPE shares only scalar metrics with a federated server to construct a global consensus, ensuring communication efficiency. Guided by the global consensus, SCOPE dynamically filters local noise and discards redundant samples to counteract global feature skews. Extensive experiments demonstrate that SCOPE yields competitive global accuracy and robust convergence, all while achieving exceptional efficiency with a 128x to 512x reduction in uplink bandwidth, a 7.72x wall-clock acceleration and reduced FLOP and VRAM footprints for local coreset selection.

关键词: federated learning, coreset selection, class imbalance, semantic analysis, orthogonal projection embeddings, communication efficiency, data pruning, scientific data

171. ❌ Reinforcing the Weakest Links: Modernizing SIENA with Targeted Deep Learning Integration

作者: Riccardo Raciti, Lemuel Puglisi, Francesco Guarnera, Daniele Ravì, Sebastiano Battiato 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12951v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究使用深度学习（SynthStrip和SynthSeg）改进SIENA神经影像分析流程，属于AI在生物医学（神经影像）领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），但未涉及大模型、MoE、缩放定律、训练方法、推理技术、代理系统等关键词，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该研究通过将深度学习模块（SynthStrip和SynthSeg）集成到SIENA流程中，改进了脑萎缩估计的准确性和鲁棒性，同时保持了可解释性并减少了运行时间。

摘要翻译

磁共振成像（MRI）衍生的脑体积百分比变化（Percentage Brain Volume Change, PBVC）是一种广泛使用的脑萎缩生物标志物，其中SIENA是其评估中最成熟的方法之一。然而，SIENA依赖于经典的图像处理步骤，特别是颅骨剥离（skull stripping）和组织分割（tissue segmentation），这些步骤的失败可能在整个流程中传递并导致萎缩估计偏差。在本研究中，我们探讨了有针对性的深度学习替代方案是否能在保持SIENA成熟且可解释框架的同时提升其性能。为此，我们将SynthStrip和SynthSeg集成到SIENA中，并在ADNI和PPMI纵向队列上评估了三种流程变体。性能通过三个互补标准进行评估：与纵向临床及结构衰退的相关性、扫描顺序一致性以及端到端运行时间。替换颅骨剥离模块带来了最一致的改进：在ADNI数据中，相较于标准SIENA流程，它显著增强了PBVC与多种疾病进展指标之间的关联；同时在两个数据集中，它明显提升了扫描顺序反转下的鲁棒性。完全集成的流程实现了最强的扫描顺序一致性，将误差降低了高达99.1%。此外，支持GPU的变体在保持与标准SIENA相近的CPU运行时间的同时，将执行时间减少了高达46%。总体而言，这些发现表明，当深度学习被用于强化成熟纵向萎缩流程中最薄弱的图像处理步骤时，能够有意义地增强其性能。更广泛而言，本研究强调了在不牺牲可解释性的前提下，对临床可信赖的神经影像工具进行模块化现代化的价值。代码公开于https://github.com/Raciti/Enhanced-SIENA.git。

摘要 (Abstract)

Percentage Brain Volume Change (PBVC) derived from Magnetic Resonance Imaging (MRI) is a widely used biomarker of brain atrophy, with SIENA among the most established methods for its estimation. However, SIENA relies on classical image processing steps, particularly skull stripping and tissue segmentation, whose failures can propagate through the pipeline and bias atrophy estimates. In this work, we examine whether targeted deep learning substitutions can improve SIENA while preserving its established and interpretable framework. To this end, we integrate SynthStrip and SynthSeg into SIENA and evaluate three pipeline variants on the ADNI and PPMI longitudinal cohorts. Performance is assessed using three complementary criteria: correlation with longitudinal clinical and structural decline, scan-order consistency, and end-to-end runtime. Replacing the skull-stripping module yields the most consistent gains: in ADNI, it substantially strengthens associations between PBVC and multiple measures of disease progression relative to the standard SIENA pipeline, while across both datasets it markedly improves robustness under scan reversal. The fully integrated pipeline achieves the strongest scan-order consistency, reducing the error by up to 99.1%. In addition, GPU-enabled variants reduce execution time by up to 46% while maintaining CPU runtimes comparable to standard SIENA. Overall, these findings show that deep learning can meaningfully strengthen established longitudinal atrophy pipelines when used to reinforce their weakest image processing steps. More broadly, this study highlights the value of modularly modernizing clinically trusted neuroimaging tools without sacrificing their interpretability. Code is publicly available at https://github.com/Raciti/Enhanced-SIENA.git.

关键词: deep learning, neuroimaging, brain atrophy, SIENA, skull stripping, tissue segmentation, MRI, longitudinal analysis

172. ❌ SGMatch: Semantic-Guided Non-Rigid Shape Matching with Flow Regularization

作者: Tianwei Ye, Xiaoguang Mei, Yifan Xia, Fan Fan, Jun Huang, Jiayi Ma 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12937v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SGMatch专注于计算机视觉和几何处理领域，研究非刚性3D形状匹配问题，使用基于学习的方法整合语义特征和几何描述符，并引入条件流匹配进行正则化。所有评分关键词均与大语言模型、深度学习技术原理、AI科学应用等主题相关，而本文研究内容属于计算机图形学/几何处理领域，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于学习的语义引导非刚性3D形状匹配框架SGMatch，通过整合视觉基础模型的语义特征和条件流匹配正则化，在非等距变形和拓扑噪声下实现了准确的点对点对应关系。

摘要翻译

在非刚性三维形状之间建立精确的点对点对应关系仍然是一个关键挑战，尤其是在非等距形变和拓扑噪声存在的情况下。现有的功能映射流程存在仅凭几何描述符无法解决的模糊性，以及从截断谱基投影到密集逐点对应关系时固有的空间不一致性问题。本文提出SGMatch，一种基于学习的语义引导非刚性形状匹配框架。具体而言，我们设计了一个语义引导局部交叉注意力模块，该模块将来自视觉基础模型的语义特征整合到几何描述符中，同时保持局部结构连续性。此外，我们引入了一种基于条件流匹配的正则化目标，通过监督一个时变速度场来促进恢复对应关系的空间平滑性。在多个基准数据集上的实验结果表明，SGMatch在近等距设定下取得了具有竞争力的性能，并在非等距形变和拓扑噪声条件下实现了持续改进。

摘要 (Abstract)

Establishing accurate point-to-point correspondences between non-rigid 3D shapes remains a critical challenge, particularly under non-isometric deformations and topological noise. Existing functional map pipelines suffer from ambiguities that geometric descriptors alone cannot resolve, and spatial inconsistencies inherent in the projection of truncated spectral bases to dense pointwise correspondences. In this paper, we introduce SGMatch, a learning-based framework for semantic-guided non-rigid shape matching. Specifically, we design a Semantic-Guided Local Cross-Attention module that integrates semantic features from vision foundation models into geometric descriptors while preserving local structural continuity. Furthermore, we introduce a regularization objective based on conditional flow matching, which supervises a time-varying velocity field to encourage spatial smoothness of the recovered correspondences. Experimental results on multiple benchmarks demonstrate that SGMatch achieves competitive performance across near-isometric settings and consistent improvements under non-isometric deformations and topological noise.

关键词: non-rigid shape matching, 3D shapes, semantic-guided, flow regularization, geometric descriptors, vision foundation models, point-to-point correspondences, conditional flow matching

173. ❌ Rethinking VLMs for Image Forgery Detection and Localization

作者: Shaofeng Guo, Jiequan Cui, Richang Hong 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12930v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉语言模型（VLMs）在图像伪造检测与定位（IFDL）中的应用，属于计算机视觉与多模态领域，而非大语言模型（LLMs）或深度学习技术原理的创新。所有关键词均针对大语言模型（LLMs）及其相关技术（如MoE、Scaling Laws、RLHF、RAG等），与论文的视觉语言模型（VLMs）焦点无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为IFDL-VLM的新方法，通过利用视觉语言模型（VLMs）来提升图像伪造检测与定位的性能，并在多个基准测试中实现了最先进的结果。

摘要翻译

随着人工智能生成内容（AIGC）的迅速兴起，图像篡改技术日益普及，给图像伪造检测与定位（IFDL）带来了重大挑战。本文研究了如何充分利用视觉-语言模型（VLMs）来辅助IFDL任务。具体而言，我们观察到，由于VLMs固有的偏向语义合理性而非真实性的特性，其先验知识难以提升检测与定位性能，甚至可能产生负面影响。此外，位置掩码明确编码了伪造概念，可作为VLMs的额外先验知识以简化其训练优化过程，从而增强检测与定位结果的可解释性。基于这些发现，我们提出了一种名为IFDL-VLM的新型IFDL流程。为验证方法的有效性，我们在9个常用基准数据集上进行了实验，并评估了模型在域内和跨数据集泛化设置下的性能。实验结果表明，我们的方法在检测、定位和可解释性方面均取得了新的最优性能。代码发布于：https://github.com/sha0fengGuo/IFDL-VLM。

摘要 (Abstract)

With the rapid rise of Artificial Intelligence Generated Content (AIGC), image manipulation has become increasingly accessible, posing significant challenges for image forgery detection and localization (IFDL). In this paper, we study how to fully leverage vision-language models (VLMs) to assist the IFDL task. In particular, we observe that priors from VLMs hardly benefit the detection and localization performance and even have negative effects due to their inherent biases toward semantic plausibility rather than authenticity. Additionally, the location masks explicitly encode the forgery concepts, which can serve as extra priors for VLMs to ease their training optimization, thus enhancing the interpretability of detection and localization results. Building on these findings, we propose a new IFDL pipeline named IFDL-VLM. To demonstrate the effectiveness of our method, we conduct experiments on 9 popular benchmarks and assess the model performance under both in-domain and cross-dataset generalization settings. The experimental results show that we consistently achieve new state-of-the-art performance in detection, localization, and interpretability.Code is available at: https://github.com/sha0fengGuo/IFDL-VLM.

关键词: Vision-Language Models, Image Forgery Detection, Image Forgery Localization, AIGC, IFDL-VLM, Interpretability, Cross-dataset Generalization, State-of-the-art

174. ❌ VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation

作者: Juhye Park, Wooju Lee, Dasol Hong, Changki Sung, Youngwoo Seo, Dongwan Kang, Hyun Myung 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12918v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是计算机视觉领域的跨视角姿态估计问题，提出了一种通过双轴变换构建视角不变表示的方法（VIRD），专注于地面图像和卫星图像之间的几何变换、特征对齐和视图重建。论文内容完全属于计算机视觉、机器人定位和几何深度学习范畴，不涉及任何大语言模型、深度学习技术原理创新或AI for Science应用。所有评分关键词均与大模型、深度学习技术原理或科学AI应用相关，而本论文是纯粹的计算机视觉研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为VIRD的跨视角姿态估计方法，通过双轴变换构建视角不变表示来解决地面图像与卫星图像之间的视角差异问题，在KITTI和VIGOR数据集上显著降低了位置和方向误差。

摘要翻译

精确的全局定位对于自动驾驶和机器人技术至关重要，但基于全球导航卫星系统（GNSS）的方法常因遮挡和多路径效应而性能下降。作为一种新兴替代方案，跨视角姿态估计通过地面视角图像预测其相对于地理参考卫星图像的3自由度相机位姿。然而，现有方法难以弥合地面与卫星视角间的显著视点差异，主要受限于空间对应关系的不足。本文提出一种新颖的跨视角姿态估计方法，通过双轴变换构建视角不变表示。该方法首先对卫星视图应用极坐标变换以建立水平对应关系，随后在地面特征与极坐标变换后的卫星特征上采用上下文增强的位置注意力机制，以解决垂直错位问题，从而显式缓解视点差异。为进一步增强视角不变性，我们引入了视角重构损失，促使学习到的表示能够重构原始图像及跨视角图像。在KITTI和VIGOR数据集上的实验表明，该方法在无需方向先验的条件下优于现有最优方法，在KITTI数据集上将中位数位置误差与方向误差分别降低了50.7%和76.5%，在VIGOR数据集上分别降低了18.0%和46.8%。

摘要 (Abstract)

Accurate global localization is crucial for autonomous driving and robotics, but GNSS-based approaches often degrade due to occlusion and multipath effects. As an emerging alternative, cross-view pose estimation predicts the 3-DoF camera pose corresponding to a ground-view image with respect to a geo-referenced satellite image. However, existing methods struggle to bridge the significant viewpoint gap between the ground and satellite views mainly due to limited spatial correspondences. We propose a novel cross-view pose estimation method that constructs view-invariant representations through dual-axis transformation (VIRD). VIRD first applies a polar transformation to the satellite view to establish horizontal correspondence, then uses context-enhanced positional attention on the ground and polar-transformed satellite features to resolve vertical misalignment, explicitly mitigating the viewpoint gap. A view-reconstruction loss is introduced to strengthen the view invariance further, encouraging the derived representations to reconstruct the original and cross-view images. Experiments on the KITTI and VIGOR datasets demonstrate that VIRD outperforms the state-of-the-art methods without orientation priors, reducing median position and orientation errors by 50.7% and 76.5% on KITTI, and 18.0% and 46.8% on VIGOR, respectively.

关键词: cross-view pose estimation, view-invariant representation, dual-axis transformation, polar transformation, positional attention, view-reconstruction loss, autonomous driving, global localization

175. ❌ DirPA: Addressing Prior Shift in Imbalanced Few-shot Crop-type Classification

作者: Joana Reuss, Ekaterina Gikalo, Marco Körner 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12905v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于农业监测中的少样本作物分类问题，提出DirPA方法解决类别不平衡和先验偏移问题。所有关键词均与大模型、深度学习技术原理或特定AI应用领域相关，但论文未涉及任何大模型技术（如LLM、MoE、微调方法等），也未提及深度学习创新原理。仅与"AI for Science OR Bioinformatics OR Cheminformatics"有微弱关联（农业属于广义科学应用），但论文未明确使用AI for Science术语或深入讨论该领域，因此给5分（有一定关联）。其他关键词完全无关，均给0分。

!!! tip deepseek-chat TL;DR

该研究针对农业监测中因类别不平衡导致的先验偏移问题，提出Dirichlet Prior Augmentation (DirPA)方法，并在欧盟多国验证了该方法能提高少样本作物分类的鲁棒性和各类别性能。

摘要翻译

现实世界的农业监测常受严重的类别不平衡和高昂标签获取成本所限，导致显著的数据稀缺问题。在专门为数据稀缺场景设计的小样本学习框架中，训练集常被人工平衡处理。然而，这种做法与自然界中观察到的长尾分布相脱节，引发分布偏移，从而削弱模型在真实农业任务中的泛化能力。我们此前提出了狄利克雷先验增强方法（Dirichlet Prior Augmentation, DirPA；Reuss等人，2026a），旨在模型训练期间主动缓解此类标签分布偏斜的影响。本研究扩展了原实验的地理范围，具体通过在欧盟多个国家进行评估，突破局部实验的局限，以测试该方法在不同农业环境中的适应性。结果表明，DirPA在不同地理区域均表现出有效性。我们证明，无论目标区域如何，DirPA不仅能提升系统鲁棒性、在极端长尾分布下稳定训练过程，还能通过主动模拟先验知识，显著提高各个特定类别的性能表现。

摘要 (Abstract)

Real-world agricultural monitoring is often hampered by severe class imbalance and high label acquisition costs, resulting in significant data scarcity. In few-shot learning (FSL) – a framework specifically designed for data-scarce settings – , training sets are often artificially balanced. However, this creates a disconnect from the long-tailed distributions observed in nature, leading to a distribution shift that undermines the model’s ability to generalize to real-world agricultural tasks. We previously introduced Dirichlet Prior Augmentation (DirPA; Reuss et al., 2026a) to proactively mitigate the effects of such label distribution skews during model training. In this work, we extend the original study’s geographical scope. Specifically, we evaluate this extended approach across multiple countries in the European Union (EU), moving beyond localized experiments to test the method’s resilience across diverse agricultural environments. Our results demonstrate the effectiveness of DirPA across different geographical regions. We show that DirPA not only improves system robustness and stabilizes training under extreme long-tailed distributions, regardless of the target region, but also substantially improves individual class-specific performance by proactively simulating priors.

关键词: few-shot learning, class imbalance, prior shift, agricultural monitoring, crop-type classification, Dirichlet Prior Augmentation, long-tailed distributions, geographical generalization

176. ❌ Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis

作者: Yinuo Jiang, Jun Cheng, Yiran Wang, Cheng Cheng 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12903v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于LiDAR点云数据的神经辐射场（NeRF）重建和姿态优化，属于计算机视觉和3D重建领域。所有评分关键词均与大语言模型（LLM）、深度学习技术原理或AI在科学领域的应用相关，而本文研究的是特定传感器数据的几何重建问题，未涉及任何大模型技术、训练方法、推理优化或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需姿态先验的LiDAR神经辐射场框架SG-NLF，通过融合光谱先验和几何一致性来解决LiDAR数据稀疏性和无纹理特性导致的几何空洞问题，实验表明该方法在重建质量和姿态精度上分别提升了35.8%和68.8%。

摘要翻译

神经辐射场（NeRF）在图像新视角合成（NVS）领域取得了显著成功，并启发了其在激光雷达新视角合成中的扩展。然而，现有方法大多严重依赖精确的相机位姿进行场景重建。激光雷达数据的稀疏性和缺乏纹理特性也带来了独特挑战，导致几何空洞和表面不连续。为解决这些问题，我们提出了SG-NLF——一种融合光谱信息与几何一致性的无位姿激光雷达NeRF框架。具体而言，我们设计了一种基于光谱先验的混合表示方法以重建平滑几何结构。针对位姿优化问题，我们构建了基于特征兼容性的置信感知图来实现全局对齐。此外，引入对抗学习策略以增强跨帧一致性，从而提升重建质量。综合实验验证了我们框架的有效性，尤其在具有挑战性的低频场景中表现突出。相较于先前最优方法，SG-NLF在重建质量和位姿精度上分别提升了35.8%和68.8%。本工作可为激光雷达视角合成提供新的研究视角。

摘要 (Abstract)

Neural Radiance Fields (NeRF) have shown remarkable success in image novel view synthesis (NVS), inspiring extensions to LiDAR NVS. However, most methods heavily rely on accurate camera poses for scene reconstruction. The sparsity and textureless nature of LiDAR data also present distinct challenges, leading to geometric holes and discontinuous surfaces. To address these issues, we propose SG-NLF, a pose-free LiDAR NeRF framework that integrates spectral information with geometric consistency. Specifically, we design a hybrid representation based on spectral priors to reconstruct smooth geometry. For pose optimization, we construct a confidence-aware graph based on feature compatibility to achieve global alignment. In addition, an adversarial learning strategy is introduced to enforce cross-frame consistency, thereby enhancing reconstruction quality. Comprehensive experiments demonstrate the effectiveness of our framework, especially in challenging low-frequency scenarios. Compared to previous state-of-the-art methods, SG-NLF improves reconstruction quality and pose accuracy by over 35.8% and 68.8%. Our work can provide a novel perspective for LiDAR view synthesis.

关键词: LiDAR view synthesis, Neural Radiance Fields, pose-free reconstruction, spectral-geometric representation, confidence-aware graph, adversarial learning, 3D reconstruction

177. ❌ Forecasting Epileptic Seizures from Contactless Camera via Cross-Species Transfer Learning

作者: Mingkai Zhai, Wei Wang, Zongsheng Li, Quanying Liu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12887v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文主要研究基于视频的癫痫发作预测，采用跨物种迁移学习框架，属于AI在生物医学领域的应用。与大多数关键词（如LLM、MoE、RLHF等）无关，因为这些关键词主要涉及大模型技术原理。仅与两个关键词相关：1）“Pre-training OR Continual Pre-training OR Domain Adaptation”：论文使用跨物种数据进行辅助预训练，属于迁移学习/领域适应范畴，给5分；2）“AI for Science OR Bioinformatics OR Cheminformatics”：论文属于AI在生物医学（癫痫研究）中的应用，给8分。其他关键词均未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于视频的癫痫发作预测新任务，通过跨物种迁移学习框架利用大规模啮齿动物视频数据进行辅助预训练，实现了超过70%的预测准确率，为构建非侵入性、可扩展的癫痫早期预警系统提供了潜力。

摘要翻译

癫痫发作预测是癫痫研究中具有重要临床意义但极具挑战性的课题。现有方法主要依赖脑电图等神经信号，这些方法需要专用设备，限制了其在真实场景中的长期部署。相比之下，视频数据提供了一种非侵入性且易于获取的替代方案，然而现有的基于视频的研究主要集中于发作后检测，对发作预测的探索尚不充分。本研究提出了一种基于视频的癫痫发作预测新任务，即利用短暂的发作前视频片段（3-10秒）来预测后续5秒内是否会发生癫痫发作。针对带标注的人类癫痫视频数据稀缺的问题，我们提出了一种跨物种迁移学习框架，该框架利用大规模啮齿动物视频数据进行辅助预训练。这使得模型能够捕获具有跨物种泛化能力的癫痫相关行为动力学特征。实验结果表明，在严格仅使用视频数据的设定下，我们的方法实现了超过70%的预测准确率，并优于现有基线方法。这些发现凸显了跨物种学习在构建非侵入性、可扩展的癫痫早期预警系统方面的潜力。

摘要 (Abstract)

Epileptic seizure forecasting is a clinically important yet challenging problem in epilepsy research. Existing approaches predominantly rely on neural signals such as electroencephalography (EEG), which require specialized equipment and limit long-term deployment in real-world settings. In contrast, video data provide a non-invasive and accessible alternative, yet existing video-based studies mainly focus on post-onset seizure detection, leaving seizure forecasting largely unexplored. In this work, we formulate a novel task of video-based epileptic seizure forecasting, where short pre-ictal video segments (3-10 seconds) are used to predict whether a seizure will occur within the subsequent 5 seconds. To address the scarcity of annotated human epilepsy videos, we propose a cross-species transfer learning framework that leverages large-scale rodent video data for auxiliary pretraining. This enables the model to capture seizure-related behavioral dynamics that generalize across species. Experimental results demonstrate that our approach achieves over 70% prediction accuracy under a strictly video-only setting and outperforms existing baselines. These findings highlight the potential of cross-species learning for building non-invasive, scalable early-warning systems for epilepsy.

关键词: epileptic seizure forecasting, video-based prediction, cross-species transfer learning, pre-ictal video segments, rodent video data, non-invasive early-warning system, behavioral dynamics, prediction accuracy

178. ❌ A protocol for evaluating robustness to H&E staining variation in computational pathology models

作者: Lydia A. Schönpflug, Nikki van den Berg, Sonali Andani, Nanda Horeweg, Jurriaan Barkey Wolf, Tjalling Bosse, Viktor H. Koelzer, Maxime W. Lafarge 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12886v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算病理学（CPath）模型的评估协议，特别是针对H&E染色变化的鲁棒性评估。论文内容涉及深度学习模型在生物医学图像分析中的应用，但与大多数关键词（如LLM、MoE、SFT、RLHF、RAG等）无关，因为这些关键词主要针对大语言模型及其相关技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物信息学/计算病理学领域的应用，但并非核心创新技术，因此给予5分（有一定关联）。其他关键词均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种评估计算病理学模型对H&E染色变化鲁棒性的三步协议，并应用于306个微卫星不稳定性分类模型，发现鲁棒性与分类性能呈弱负相关，支持模型部署时的可靠性评估。

摘要翻译

对染色差异的敏感性仍是计算病理学模型部署的主要障碍，因为苏木精-伊红染色在不同实验室间存在差异，需要系统评估这种变异性如何影响模型预测。本研究开发了一套三步流程，用于评估计算病理学模型对H&E染色差异的鲁棒性：第一步选择参考染色条件，第二步表征测试集染色特性，第三步在模拟参考染色条件下应用计算病理学模型。我们首先基于PLISM数据集构建了新的参考染色库。作为示例应用，我们将该流程用于评估306个微卫星不稳定性分类模型在未见过的SurGen结直肠癌数据集（n=738）上的鲁棒性，其中包括基于TCGA-COAD/READ数据集训练的300个注意力机制多示例学习模型（涵盖UNI2-h、H-Optimus-1、Virchow2三种特征提取器）以及六个公开的MSI分类模型。以AUC衡量分类性能，并以四种模拟染色条件（低/高H&E染色强度、低/高H&E颜色相似度）下的AUC极差范围评估鲁棒性。所有模型在各类染色条件下的分类性能范围为AUC 0.769-0.911（$Δ$=0.142），鲁棒性范围为0.007-0.079（$Δ$=0.072），且与分类性能呈弱负相关（Pearson r=-0.22，95% CI [-0.34, -0.11]）。研究表明，该评估流程可实现基于鲁棒性信息的计算病理学模型选择，揭示H&E染色条件变化下的性能波动，为确定可靠模型部署的操作范围提供依据。代码详见https://github.com/CTPLab/staining-robustness-evaluation。

摘要 (Abstract)

Sensitivity to staining variation remains a major barrier to deploying computational pathology (CPath) models as hematoxylin and eosin (H&E) staining varies across laboratories, requiring systematic assessment of how this variability affects model prediction. In this work, we developed a three-step protocol for evaluating robustness to H&E staining variation in CPath models. Step 1: Select reference staining conditions, Step 2: Characterize test set staining properties, Step 3: Apply CPath model(s) under simulated reference staining conditions. Here, we first created a new reference staining library based on the PLISM dataset. As an exemplary use case, we applied the protocol to assess the robustness properties of 306 microsatellite instability (MSI) classification models on the unseen SurGen colorectal cancer dataset (n=738), including 300 attention-based multiple instance learning models trained on the TCGA-COAD/READ datasets across three feature extractors (UNI2-h, H-Optimus-1, Virchow2), alongside six public MSI classification models. Classification performance was measured as AUC, and robustness as the min-max AUC range across four simulated staining conditions (low/high H&E intensity, low/high H&E color similarity). Across models and staining conditions, classification performance ranged from AUC 0.769-0.911 ($Δ$ = 0.142). Robustness ranged from 0.007-0.079 ($Δ$ = 0.072), and showed a weak inverse correlation with classification performance (Pearson r=-0.22, 95% CI [-0.34, -0.11]). Thus, we show that the proposed evaluation protocol enables robustness-informed CPath model selection and provides insight into performance shifts across H&E staining conditions, supporting the identification of operational ranges for reliable model deployment. Code is available at https://github.com/CTPLab/staining-robustness-evaluation .

关键词: computational pathology, H&E staining variation, robustness evaluation, microsatellite instability classification, attention-based multiple instance learning, staining simulation, model deployment, AUC performance

179. ❌ TRACE: Structure-Aware Character Encoding for Robust and Generalizable Document Watermarking

作者: Jiale Meng, Jie Zhang, Runyi Hu, Zhe-Ming Lu, Tianwei Zhang, Yiming Li 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12873v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文TRACE专注于文档水印技术，利用扩散模型进行字符编码，属于计算机视觉和文档安全领域。所有评分关键词均围绕大模型、深度学习技术原理及其在科学领域的应用，而本文的核心是扩散模型在字符结构上的应用，不涉及语言模型、模型训练、推理优化、对齐、代理系统等大模型相关技术，也未应用于生物信息学等科学领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为TRACE的结构感知字符编码框架，利用扩散模型在文档中嵌入鲁棒且可泛化的水印，实验表明其在PSNR和提取准确率上优于现有方法，并适用于多种语言和字体。

摘要翻译

我们提出TRACE，一种利用扩散模型进行局部字符编码以嵌入数据的结构感知框架。与现有依赖边缘特征或预定义码本的方法不同，TRACE利用字符结构，因其在不同字符间具有稳定且统一的表示特性，从而对噪声干扰具备天然抵抗力。该框架包含三个核心组件：(1) 自适应扩散初始化，通过运动概率估计器（MPE）、目标点估计（TPE）和掩码绘制模型（MDM）等专用算法自动识别控制点、目标点及编辑区域；(2) 引导扩散编码，实现选定点的精确移动；(3) 掩码区域替换，采用专用损失函数以最小化扩散过程后的特征改变。综合实验表明，TRACE在性能上显著优于现有先进方法，在跨媒体传输后实现了超过5 dB的峰值信噪比（PSNR）提升及5%的提取准确率提高。该框架在多种语言和字体中展现出广泛通用性，使其特别适用于实际文档安全应用场景。

摘要 (Abstract)

We propose TRACE, a structure-aware framework leveraging diffusion models for localized character encoding to embed data. Unlike existing methods that rely on edge features or pre-defined codebooks, TRACE exploits character structures that provide inherent resistance to noise interference due to their stability and unified representation across diverse characters. Our framework comprises three key components: (1) adaptive diffusion initialization that automatically identifies handle points, target points, and editing regions through specialized algorithms including movement probability estimator (MPE), target point estimation (TPE) and mask drawing model (MDM), (2) guided diffusion encoding for precise movement of selected point, and (3) masked region replacement with a specialized loss function to minimize feature alterations after the diffusion process. Comprehensive experiments demonstrate \name{}’s superior performance over state-of-the-art methods, achieving more than 5 dB improvement in PSNR and 5% higher extraction accuracy following cross-media transmission. \name{} achieves broad generalizability across multiple languages and fonts, making it particularly suitable for practical document security applications.

关键词: Document Watermarking, Character Encoding, Diffusion Models, Structure-Aware, Robustness, Generalizability, Cross-media Transmission, Document Security

180. ❌ Composing Driving Worlds through Disentangled Control for Adversarial Scenario Generation

作者: Yifan Zhan, Zhengqing Chen, Qingjie Wang, Zhuo He, Muyao Niu, Xiaoyang Guo, Wei Yin, Weiqiang Ren, Qian Zhang, Yinqiang Zheng 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12864v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究自动驾驶场景生成，聚焦于视频生成模型的控制解耦技术（CompoSIA），涉及身份注入和动作控制机制，用于合成对抗性驾驶场景以测试规划器。所有关键词均与大语言模型、深度学习技术原理或科学AI应用直接相关，而本文专注于计算机视觉和自动驾驶的生成模型，未涉及任何大语言模型技术、训练方法、推理优化、对齐、代理系统或科学领域AI应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文针对自动驾驶中安全关键边缘案例的生成挑战，提出了CompoSIA——一种解耦场景结构、对象身份和自车动作控制的组合式驾驶视频模拟器，实现了高质量的对抗场景合成，并在下游压力测试中显著增加了规划器的碰撞率。

摘要翻译

自动驾驶领域的一个主要挑战是安全关键边缘案例的"长尾"问题，这些案例通常产生于常见交通要素的异常组合。合成这些场景至关重要，然而当前可控生成模型提供的指导不完整或存在纠缠，无法实现对场景结构、物体身份和自车动作的独立操控。我们提出CompoSIA——一个组合式驾驶视频模拟器，能够解耦这些交通要素，实现对多样化对抗性驾驶场景的细粒度控制。为支持场景要素的可控身份替换，我们提出噪声级身份注入技术，仅需单张参考图像即可实现跨不同姿态要素的姿态无关身份生成。此外，我们引入分层双分支动作控制机制以提升动作可控性。这种解耦控制能力使得对抗性场景合成成为可能——系统性地将安全要素组合成纠缠生成器无法产生的危险配置。大量对比实验表明，本方法在可控生成质量上优于最先进的基线模型：身份编辑的FVD指标提升17%，动作控制的旋转误差和平移误差分别降低30%和47%。下游压力测试进一步揭示了显著的规划器失效现象：在所有编辑模式下，3秒平均碰撞率增加了173%。

摘要 (Abstract)

A major challenge in autonomous driving is the “long tail” of safety-critical edge cases, which often emerge from unusual combinations of common traffic elements. Synthesizing these scenarios is crucial, yet current controllable generative models provide incomplete or entangled guidance, preventing the independent manipulation of scene structure, object identity, and ego actions. We introduce CompoSIA, a compositional driving video simulator that disentangles these traffic factors, enabling fine-grained control over diverse adversarial driving scenarios. To support controllable identity replacement of scene elements, we propose a noise-level identity injection, allowing pose-agnostic identity generation across diverse element poses, all from a single reference image. Furthermore, a hierarchical dual-branch action control mechanism is introduced to improve action controllability. Such disentangled control enables adversarial scenario synthesis-systematically combining safe elements into dangerous configurations that entangled generators cannot produce. Extensive comparisons demonstrate superior controllable generation quality over state-of-the-art baselines, with a 17% improvement in FVD for identity editing and reductions of 30% and 47% in rotation and translation errors for action control. Furthermore, downstream stress-testing reveals substantial planner failures: across editing modalities, the average collision rate of 3s increases by 173%.

关键词: autonomous driving, adversarial scenario generation, controllable video generation, disentangled control, identity injection, action control, driving simulator, stress-testing

181. ❌ Multimodal Protein Language Models for Enzyme Kinetic Parameters: From Substrate Recognition to Conformational Adaptation

作者: Fei Wang, Xinye Zheng, Kun Li, Yanyan Wei, Yuxin Liu, Ganpeng Hu, Tong Bao, Jingwen Yang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12845v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是使用蛋白质语言模型（PLMs）预测酶动力学参数，属于大模型在生物信息学领域的应用创新。高度相关的关键词包括：1）‘Large Language Models’（PLMs是大模型在蛋白质领域的应用）；2）‘Mixture of Experts’（论文提出Geometry-aware Mixture-of-Experts方法）；3）‘Post-training’（涉及fine-tuning PLMs）；4）‘AI for Science’（生物信息学应用）。中等相关的关键词：‘Pre-training’（PLMs基于预训练）和’PEFT’（fine-tuning可视为参数高效调整）。其他关键词如SLMs、Scaling Laws、RLHF等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ERBA的多模态蛋白质语言模型框架，通过两阶段条件建模（分子识别和几何感知专家混合）来预测酶动力学参数，实验表明该方法在多个基准上优于现有基线并具有更好的分布外性能。

摘要翻译

预测酶动力学参数可量化酶在特定生化条件下催化特定底物的效率。经典参数如转换数（$k_\text{cat}$）、米氏常数（$K_\text{m}$）和抑制常数（$K_\text{i}$）共同取决于酶序列、底物化学性质以及结合过程中活性位点的构象适应。许多学习流程将这一过程简化为酶与底物之间的静态兼容性问题，通过浅层操作融合其表征并回归单一数值。此类方法忽略了催化的阶段性本质，该过程涉及底物识别与构象适应两个阶段。为此，我们将动力学预测重新构建为阶段性多模态条件建模问题，并引入酶-反应桥接适配器（Enzyme-Reaction Bridging Adapter, ERBA）。该适配器通过微调将跨模态信息注入蛋白质语言模型（Protein Language Models, PLMs），同时保留其生化先验知识。ERBA分两阶段执行条件化处理：首先，分子识别交叉注意力（Molecular Recognition Cross-Attention, MRCA）将底物信息注入酶表征以捕获特异性；随后，几何感知专家混合层（Geometry-aware Mixture-of-Experts, G-MoE）整合活性位点结构，并将样本路由至针对活性口袋的专门化专家模块以反映诱导契合效应。为保持语义保真度，酶-底物分布对齐（Enzyme-Substrate Distribution Alignment, ESDA）在再生核希尔伯特空间中强制PLM流形内的分布一致性。在三个动力学终点指标及多种PLM骨干网络上的实验表明，相较于仅使用序列的基线方法与浅层融合基线，ERBA实现了持续的性能提升并展现出更强的分布外泛化能力，为可扩展的动力学预测提供了基于生物学原理的路径，并为整合辅因子、突变及时间分辨结构信息奠定了基础。

摘要 (Abstract)

Predicting enzyme kinetic parameters quantifies how efficiently an enzyme catalyzes a specific substrate under defined biochemical conditions. Canonical parameters such as the turnover number ($k_\text{cat}$), Michaelis constant ($K_\text{m}$), and inhibition constant ($K_\text{i}$) depend jointly on the enzyme sequence, the substrate chemistry, and the conformational adaptation of the active site during binding. Many learning pipelines simplify this process to a static compatibility problem between the enzyme and substrate, fusing their representations through shallow operations and regressing a single value. Such formulations overlook the staged nature of catalysis, which involves both substrate recognition and conformational adaptation. In this regard, we reformulate kinetic prediction as a staged multimodal conditional modeling problem and introduce the Enzyme-Reaction Bridging Adapter (ERBA), which injects cross-modal information via fine-tuning into Protein Language Models (PLMs) while preserving their biochemical priors. ERBA performs conditioning in two stages: Molecular Recognition Cross-Attention (MRCA) first injects substrate information into the enzyme representation to capture specificity; Geometry-aware Mixture-of-Experts (G-MoE) then integrates active-site structure and routes samples to pocket-specialized experts to reflect induced fit. To maintain semantic fidelity, Enzyme-Substrate Distribution Alignment (ESDA) enforces distributional consistency within the PLM manifold in a reproducing kernel Hilbert space. Experiments across three kinetic endpoints and multiple PLM backbones, ERBA delivers consistent gains and stronger out-of-distribution performance compared with sequence-only and shallow-fusion baselines, offering a biologically grounded route to scalable kinetic prediction and a foundation for adding cofactors, mutations, and time-resolved structural cues.

关键词: Protein Language Models, Enzyme Kinetic Parameters, Multimodal Modeling, Mixture of Experts, Fine-tuning, Bioinformatics, Conformational Adaptation, Substrate Recognition

182. ❌ coDrawAgents: A Multi-Agent Dialogue Framework for Compositional Image Generation

作者: Chunhan Li, Qifeng Wu, Jia-Hui Pan, Ka-Hei Hui, Jingyu Hu, Yuming Jiang, Bin Sheng, Xihui Liu, Wenjuan Gong, Zhengzhe Liu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12829v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出coDrawAgents，一个用于组合图像生成的多智能体对话框架，包含四个专门智能体（Interpreter, Planner, Checker, Painter）协作。该研究直接与’LLM Agents OR Autonomous Agents OR Agentic Workflow’和’Multi-agent Systems OR Agent Coordination’高度相关，因为其核心是设计一个多智能体系统来执行复杂任务。然而，论文专注于文本到图像生成的具体应用，并未涉及大模型技术原理（如LLM架构、训练方法、推理优化等）、科学AI应用或其他关键词领域。因此，仅这两个关键词获得高分，其余关键词均无关。

!!! tip deepseek-chat TL;DR

该论文针对文本到图像生成中多对象组合和属性保持的难题，提出了一个由四个专门智能体协作的多智能体对话框架coDrawAgents，实验表明其显著提升了文本-图像对齐、空间准确性和属性绑定。

摘要翻译

文本到图像生成技术发展迅速，但现有模型在复杂场景中忠实组合多个对象并保持其属性方面仍面临困难。我们提出coDrawAgents，这是一个交互式多智能体对话框架，包含四个专用智能体：解析器（Interpreter）、规划器（Planner）、检查器（Checker）和绘制器（Painter），它们通过协作改进组合式生成。解析器自适应地选择直接文本到图像路径或基于布局的多智能体处理流程。在布局感知模式下，它将提示解析为富含属性的对象描述符，依据语义显著性进行排序，并将具有相同语义优先级级别的对象分组以进行联合生成。在解析器的引导下，规划器采用分治策略，逐步为具有相同语义优先级级别的对象提出布局方案，同时将决策锚定在画布不断演化的视觉上下文中。检查器通过验证空间一致性与属性对齐，并在渲染前优化布局，引入了显式的纠错机制。最后，绘制器逐步合成图像，将新规划的对象融入画布，为后续迭代提供更丰富的上下文。这些智能体共同解决了三个关键挑战：降低布局复杂性、将规划锚定于视觉上下文，以及实现显式错误校正。在GenEval和DPG-Bench基准测试上的大量实验表明，与现有方法相比，coDrawAgents在文本-图像对齐、空间准确性和属性绑定方面均有显著提升。

摘要 (Abstract)

Text-to-image generation has advanced rapidly, but existing models still struggle with faithfully composing multiple objects and preserving their attributes in complex scenes. We propose coDrawAgents, an interactive multi-agent dialogue framework with four specialized agents: Interpreter, Planner, Checker, and Painter that collaborate to improve compositional generation. The Interpreter adaptively decides between a direct text-to-image pathway and a layout-aware multi-agent process. In the layout-aware mode, it parses the prompt into attribute-rich object descriptors, ranks them by semantic salience, and groups objects with the same semantic priority level for joint generation. Guided by the Interpreter, the Planner adopts a divide-and-conquer strategy, incrementally proposing layouts for objects with the same semantic priority level while grounding decisions in the evolving visual context of the canvas. The Checker introduces an explicit error-correction mechanism by validating spatial consistency and attribute alignment, and refining layouts before they are rendered. Finally, the Painter synthesizes the image step by step, incorporating newly planned objects into the canvas to provide richer context for subsequent iterations. Together, these agents address three key challenges: reducing layout complexity, grounding planning in visual context, and enabling explicit error correction. Extensive experiments on benchmarks GenEval and DPG-Bench demonstrate that coDrawAgents substantially improves text-image alignment, spatial accuracy, and attribute binding compared to existing methods.

关键词: multi-agent systems, text-to-image generation, compositional generation, dialogue framework, layout planning, error correction, agent collaboration, visual context grounding

183. ❌ GLEAM: A Multimodal Imaging Dataset and HAMM for Glaucoma Classification

作者: Jiao Wang, Chi Liu, Yiying Zhang, Hongchen Luo, Zhifen Guo, Ying Hu, Ke Xu, Jing Zhou, Hongyan Xu, Ruiting Zhou, Man Tang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12800v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学影像分析，特别是青光眼分类，提出了一个多模态数据集（GLEAM）和一个分层注意力掩码建模（HAMM）框架。论文内容与大多数关键词（涉及大模型技术原理、训练方法、推理优化、代理系统等）完全无关，因为这些关键词针对的是自然语言处理或通用AI领域。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在生物医学（眼科）领域的应用，属于AI for Science范畴，但并非核心创新于大模型技术，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究提出了首个公开的三模态青光眼数据集GLEAM和分层注意力掩码建模框架HAMM，用于整合多模态医学影像信息以实现跨疾病阶段的准确青光眼分类。

摘要翻译

我们提出青光眼病灶多模态影像评估与分析系统（GLEAM），这是首个公开的三模态青光眼数据集，包含扫描激光检眼镜眼底图像、视盘周围OCT图像和视野模式偏差图，并标注了四种疾病分期，能够有效利用多模态互补信息，促进不同疾病阶段的精准诊断与治疗。为有效整合跨模态信息，我们提出用于多模态青光眼分类的分层注意力掩码建模方法。该框架采用分层注意力编码器与轻量解码器，将跨模态表征学习的重点集中于编码器。

摘要 (Abstract)

We propose glaucoma lesion evaluation and analysis with multimodal imaging (GLEAM), the first publicly available tri-modal glaucoma dataset comprising scanning laser ophthalmoscopy fundus images, circumpapillary OCT images, and visual field pattern deviation maps, annotated with four disease stages, enabling effective exploitation of multimodal complementary information and facilitating accurate diagnosis and treatment across disease stages. To effectively integrate cross-modal information, we propose hierarchical attentive masked modeling (HAMM) for multimodal glaucoma classification. Our framework employs hierarchical attentive encoders and light decoders to focus cross-modal representation learning on the encoder.

关键词: glaucoma classification, multimodal imaging, medical dataset, hierarchical attention, masked modeling, ophthalmology, disease staging, cross-modal integration

184. ❌ OARS: Process-Aware Online Alignment for Generative Real-World Image Super-Resolution

作者: Shijie Zhao, Xuanyu Zhang, Bin Chen, Weiqi Li, Qunliang Xing, Kexin Zhang, Yan Wang, Junlin Li, Li Zhang, Jian Zhang, Tianfan Xue 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12811v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于图像超分辨率领域，主要涉及生成模型与人类视觉偏好的对齐（Alignment）以及使用LoRA进行参数高效微调（PEFT）。论文提出的OARS框架包含基于MLLM的奖励模型COMPASS和在线对齐过程，其中明确提到了“alignment”和“LoRA optimization”。其他关键词如LLMs、MoE、RLHF等与论文内容无关，因为论文的核心是图像处理而非语言模型或相关技术。

!!! tip deepseek-chat TL;DR

该论文提出了OARS框架，通过基于MLLM的奖励模型COMPASS和在线对齐过程，解决了生成式真实图像超分辨率模型中感知与保真度的权衡问题，实现了在保持保真度的同时提升感知质量，并在Real-ISR基准测试中达到了最先进的性能。

摘要翻译

将生成式真实世界图像超分辨率模型与人类视觉偏好对齐具有挑战性，这源于感知-保真度的权衡以及多样且未知的退化类型。现有方法依赖于离线偏好优化和静态指标聚合，这些方法通常缺乏可解释性，且在强条件约束下容易产生伪多样性。我们提出了OARS，这是一个基于过程感知的在线对齐框架，其核心是COMPASS——一种基于多模态大语言模型的奖励模型。COMPASS通过联合建模保真度保持与感知增益，并采用输入质量自适应的权衡策略，来评估从低分辨率到超分辨率的转换过程。为训练COMPASS，我们构建了涵盖合成与真实退化的COMPASS-20K数据集，并引入了一个三阶段感知标注流程，以产生经过校准的细粒度训练标签。在COMPASS的指导下，OARS通过浅层LoRA优化进行策略内探索，执行从冷启动流匹配开始，逐步过渡到全参考，最终实现无参考强化学习的渐进式在线对齐。大量实验和用户研究表明，该方法在保持保真度的同时实现了感知质量的持续提升，在真实世界图像超分辨率基准测试中达到了最先进的性能水平。

摘要 (Abstract)

Aligning generative real-world image super-resolution models with human visual preference is challenging due to the perception–fidelity trade-off and diverse, unknown degradations. Prior approaches rely on offline preference optimization and static metric aggregation, which are often non-interpretable and prone to pseudo-diversity under strong conditioning. We propose OARS, a process-aware online alignment framework built on COMPASS, a MLLM-based reward that evaluates the LR to SR transition by jointly modeling fidelity preservation and perceptual gain with an input-quality-adaptive trade-off. To train COMPASS, we curate COMPASS-20K spanning synthetic and real degradations, and introduce a three-stage perceptual annotation pipeline that yields calibrated, fine-grained training labels. Guided by COMPASS, OARS performs progressive online alignment from cold-start flow matching to full-reference and finally reference-free RL via shallow LoRA optimization for on-policy exploration. Extensive experiments and user studies demonstrate consistent perceptual improvements while maintaining fidelity, achieving state-of-the-art performance on Real-ISR benchmarks.

关键词: Generative Image Super-Resolution, Alignment, LoRA, Online Alignment, Perceptual Quality, Fidelity Preservation, MLLM-based Reward, Real-ISR

185. ❌ Spectral Defense Against Resource-Targeting Attack in 3D Gaussian Splatting

作者: Yang Chen, Yi Yu, Jiaming He, Yueqi Duan, Zheng Zhu, Yap-Peng Tan 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12796v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于3D高斯泼溅（3DGS）中的资源目标攻击防御，提出了一种基于频谱分析的方法来抑制高斯过度增长。论文的核心内容涉及计算机图形学、3D重建和对抗性攻击防御，与所有评分关键词（均围绕大模型、深度学习技术原理及其应用）完全无关。论文未涉及任何形式的大语言模型、深度学习模型训练、对齐、推理、代理系统或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文针对3D高斯泼溅中的资源目标攻击，提出了一种频谱防御方法，通过3D频率滤波和2D频谱正则化来抑制高斯过度增长，从而在攻击下实现了高达5.92倍的过度增长抑制、3.66倍的内存减少和4.34倍的速度提升。

摘要翻译

三维高斯泼溅（3D Gaussian Splatting，简称3DGS）技术的最新进展实现了高质量的渲染效果，但高斯表示也暴露出一种新的攻击面——资源导向型攻击。该攻击通过毒化训练图像，过度诱导高斯增长以导致资源耗尽。尽管已有研究探索了面向效率的平滑化、阈值化和剪枝等方法，但这些空间域策略仅作用于可见结构，忽略了隐蔽扰动如何扭曲训练数据底层的频谱行为。因此，被毒化的输入会引入异常的高频放大，误导3DGS将噪声模式误判为细节结构，最终导致高斯不稳定过度增长和场景保真度下降。为解决这一问题，我们提出在高斯场与图像场中实施频谱防御。我们首先设计了一种三维频率滤波器，以选择性剪枝表现出异常高频的高斯单元。由于自然场景中也包含合理的高频结构，直接抑制高频并不充分，我们进一步在渲染结果上开发了二维频谱正则化方法，在区分自然各向同性频率的同时，惩罚各向异性的角向能量，从而约束噪声模式。实验表明，我们的防御方法能够构建鲁棒、精确且安全的三维高斯泼溅系统，在遭受攻击时将过度增长抑制至多$5.92$倍，内存占用降低至多$3.66$倍，渲染速度提升至多$4.34$倍。

摘要 (Abstract)

Recent advances in 3D Gaussian Splatting (3DGS) deliver high-quality rendering, yet the Gaussian representation exposes a new attack surface, the resource-targeting attack. This attack poisons training images, excessively inducing Gaussian growth to cause resource exhaustion. Although efficiency-oriented methods such as smoothing, thresholding, and pruning have been explored, these spatial-domain strategies operate on visible structures but overlook how stealthy perturbations distort the underlying spectral behaviors of training data. As a result, poisoned inputs introduce abnormal high-frequency amplifications that mislead 3DGS into interpreting noisy patterns as detailed structures, ultimately causing unstable Gaussian overgrowth and degraded scene fidelity. To address this, we propose \textbf{Spectral Defense} in Gaussian and image fields. We first design a 3D frequency filter to selectively prune Gaussians exhibiting abnormally high frequencies. Since natural scenes also contain legitimate high-frequency structures, directly suppressing high frequencies is insufficient, and we further develop a 2D spectral regularization on renderings, distinguishing naturally isotropic frequencies while penalizing anisotropic angular energy to constrain noisy patterns. Experiments show that our defense builds robust, accurate, and secure 3DGS, suppressing overgrowth by up to $5.92\times$, reducing memory by up to $3.66\times$, and improving speed by up to $4.34\times$ under attacks.

关键词: 3D Gaussian Splatting, resource-targeting attack, spectral defense, frequency filter, Gaussian overgrowth, anisotropic angular energy, memory reduction, inference speed improvement

186. ❌ What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models

作者: Sen Nie, Jie Zhang, Zhongqi Wang, Zhaoyang Wei, Shiguang Shan, Xilin Chen 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12799v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	3.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉语言模型（VLMs）的对抗鲁棒性，属于大模型在特定领域（视觉-语言）的应用研究。核心贡献是提出R-Adapt框架，通过冻结预训练权重、仅在浅层进行最小化适配来平衡鲁棒性与准确性。与关键词的相关性分析：1）与"Large Language Models"等有一定关联（3分），因涉及VLMs（如LLaVA、Qwen-VL），但非纯文本LLM核心研究；2）与"Post-training/SFT”（8分）和"PEFT/LoRA”（8分）高度相关，因研究对抗性微调（adversarially fine-tuned）和参数高效适配（冻结权重、最小化调整）；3）与"Pre-training/Domain Adaptation”（5分）相关，因涉及预训练模型适配；4）与"Mechanistic Interpretability"（5分）相关，因分析鲁棒性机制（如低频频谱偏差、注意力模式）；其余关键词（如MoE、量化、RAG等）与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了视觉语言模型中对抗鲁棒性与清洁数据准确性之间的权衡问题，通过分析发现鲁棒性主要集中于浅层，并提出了冻结预训练权重、仅在初始层进行最小化适配的R-Adapt框架，在18个数据集上实现了鲁棒性与准确性的卓越平衡。

摘要翻译

在视觉语言模型（VLMs）中实现对抗鲁棒性不可避免地会损害其在干净数据上的准确性，这一长期存在的权衡问题极具挑战性。本研究通过探究一个根本性问题——是什么使VLMs具备鲁棒性？——来重新审视这一权衡。通过对经过对抗性微调的模型进行详细分析，我们考察了鲁棒性机制如何在内部运作，以及它们如何与干净数据准确性相互作用。我们的分析表明，对抗鲁棒性并非均匀分布于网络深度中。相反，出乎意料的是，它主要集中于浅层，由低频谱偏置和对输入不敏感的注意力模式所驱动。与此同时，对深层的更新往往会损害干净数据的准确性和鲁棒泛化能力。基于这些发现，我们提出了对抗鲁棒性适配（R-Adapt），这是一个简单而有效的框架。该框架冻结所有预训练权重，并仅在初始层引入极少量、由洞察驱动的适配。这一设计在对抗鲁棒性与干净数据准确性之间实现了卓越的平衡。R-Adapt进一步支持免训练、模型引导和数据驱动等多种范式，为无缝赋予标准模型鲁棒性提供了灵活的途径。在18个数据集和多样化任务上的广泛评估表明，我们的方法在各种攻击下均取得了最先进的性能。值得注意的是，R-Adapt能高效泛化至大型视觉语言模型（例如LLaVA和Qwen-VL），以增强其鲁棒性。我们的项目页面位于 https://summu77.github.io/R-Adapt。

摘要 (Abstract)

Achieving adversarial robustness in Vision-Language Models (VLMs) inevitably compromises accuracy on clean data, presenting a long-standing and challenging trade-off. In this work, we revisit this trade-off by investigating a fundamental question: What makes VLMs robust? Through a detailed analysis of adversarially fine-tuned models, we examine how robustness mechanisms function internally and how they interact with clean accuracy. Our analysis reveals that adversarial robustness is not uniformly distributed across network depth. Instead, unexpectedly, it is primarily localized within the shallow layers, driven by a low-frequency spectral bias and input-insensitive attention patterns. Meanwhile, updates to the deep layers tend to undermine both clean accuracy and robust generalization. Motivated by these insights, we propose Adversarial Robustness Adaptation (R-Adapt), a simple yet effective framework that freezes all pre-trained weights and introduces minimal, insight-driven adaptations only in the initial layers. This design achieves an exceptional balance between adversarial robustness and clean accuracy. R-Adapt further supports training-free, model-guided, and data-driven paradigms, offering flexible pathways to seamlessly equip standard models with robustness. Extensive evaluations on 18 datasets and diverse tasks demonstrate our state-of-the-art performance under various attacks. Notably, R-Adapt generalizes efficiently to large vision-language models (e.g., LLaVA and Qwen-VL) to enhance their robustness. Our project page is available at https://summu77.github.io/R-Adapt.

关键词: Vision-Language Models, Adversarial Robustness, Clean Accuracy, Fine-tuning, Parameter-efficient Adaptation, Low-frequency Spectral Bias, Attention Patterns, R-Adapt Framework

187. ❌ Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass

作者: Sangmin Kim, Minhyuk Hwang, Geonho Cha, Dongyoon Wee, Jaesik Park 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12789v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是计算机视觉和3D重建领域，具体关注从多视角视频中联合重建场景和多人人体网格。虽然摘要提到了'3D foundation models’，但这里的’foundation models’指的是3D视觉基础模型（如Pi3X、Multi-HMR），而非大语言模型（LLMs）。论文的核心技术是几何重建、多视图融合、尺度调整和多人关联，所有关键词均与大语言模型、深度学习技术原理或AI for Science的具体应用（如生物信息学）无关。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为CHROMM的统一框架，用于从多视角视频中联合重建场景点云和多人人体网格，无需外部模块或预处理，在保持竞争力的性能下实现了超过8倍的加速。

摘要翻译

三维基础模型的最新进展引发了人们对重建人体及其周边环境日益浓厚的兴趣。然而，现有方法大多聚焦于单目输入，将其扩展至多视角设置需要额外的开销模块或预处理数据。为此，我们提出了CHROMM，这是一个统一的框架，能够直接从多人多视角视频中联合估计相机参数、场景点云及人体网格，无需依赖外部模块或预处理。我们将来自Pi3X和Multi-HMR的强几何先验与人体先验整合到一个单一的可训练神经网络架构中，并引入一个尺度调整模块以解决人体与场景之间的尺度差异问题。我们还提出了一种多视角融合策略，在测试时将各视角的估计结果聚合为单一表示。最后，我们提出了一种基于几何的多人关联方法，该方法比基于外观的方法更具鲁棒性。在EMDB、RICH、EgoHumans和EgoExo4D数据集上的实验表明，CHROMM在全局人体运动与多视角姿态估计方面取得了具有竞争力的性能，同时运行速度比以往基于优化的多视角方法快8倍以上。项目页面：https://nstar1125.github.io/chromm。

摘要 (Abstract)

Recent advances in 3D foundation models have led to growing interest in reconstructing humans and their surrounding environments. However, most existing approaches focus on monocular inputs, and extending them to multi-view settings requires additional overhead modules or preprocessed data. To this end, we present CHROMM, a unified framework that jointly estimates cameras, scene point clouds, and human meshes from multi-person multi-view videos without relying on external modules or preprocessing. We integrate strong geometric and human priors from Pi3X and Multi-HMR into a single trainable neural network architecture, and introduce a scale adjustment module to solve the scale discrepancy between humans and the scene. We also introduce a multi-view fusion strategy to aggregate per-view estimates into a single representation at test-time. Finally, we propose a geometry-based multi-person association method, which is more robust than appearance-based approaches. Experiments on EMDB, RICH, EgoHumans, and EgoExo4D show that CHROMM achieves competitive performance in global human motion and multi-view pose estimation while running over 8x faster than prior optimization-based multi-view approaches. Project page: https://nstar1125.github.io/chromm.

关键词: 3D reconstruction, multi-view video, human mesh reconstruction, scene point cloud, multi-person association, scale adjustment, neural network, real-time performance

188. ❌ Think and Answer ME: Benchmarking and Exploring Multi-Entity Reasoning Grounding in Remote Sensing

作者: Shuchang Lyu, Haiquan Wen, Guangliang Cheng, Meng Li, Zheng Zhou, You Zhou, Dingding Yao, Zhenwei Shi 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12788v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文提出了一种用于遥感视觉定位的多实体推理基准和框架，与多个大模型相关关键词高度相关：1）使用视觉-语言基础模型（Foundation Models）构建框架（8分）；2）采用监督微调（SFT）进行冷启动初始化（10分）；3）涉及多步推理（Chain of Thought）和结构化推理轨迹（10分）；4）强调深度推理（System 2 Thinking）以处理多实体关系（8分）；5）属于AI for Science在遥感领域的应用（8分）。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对遥感视觉定位中缺乏多实体推理能力的问题，提出了一个新的多实体推理基准数据集ME-RSRG和一个基于视觉-语言基础模型的实体感知推理框架EAR，通过监督微调和实体感知奖励优化显著提升了多实体推理性能。

摘要翻译

近期，可验证奖励驱动的推理语言模型与强化学习技术的进展显著提升了多步推理能力。这一进展促使我们将推理范式扩展至遥感视觉定位任务。然而，现有的遥感定位方法仍主要局限于感知层面的匹配和单实体任务框架，限制了显式推理与实体间建模的作用。为应对这一挑战，我们提出了一个用于遥感多实体推理定位的新基准数据集（ME-RSRG）。基于ME-RSRG，我们将遥感定位重新定义为多实体推理任务，并提出了一种基于视觉-语言基础模型的实体感知推理框架。该框架能够生成结构化的推理轨迹和主客体定位输出。它采用监督微调进行冷启动初始化，并通过实体感知奖励驱动的组相对策略优化进一步优化。在ME-RSRG数据集上的大量实验证明了多实体推理任务的挑战性，并验证了我们所提出的实体感知推理框架的有效性。我们的数据集、代码和模型将在https://github.com/CV-ShuchangLyu/ME-RSRG公开。

摘要 (Abstract)

Recent advances in reasoning language models and reinforcement learning with verifiable rewards have significantly enhanced multi-step reasoning capabilities. This progress motivates the extension of reasoning paradigms to remote sensing visual grounding task. However, existing remote sensing grounding methods remain largely confined to perception-level matching and single-entity formulations, limiting the role of explicit reasoning and inter-entity modeling. To address this challenge, we introduce a new benchmark dataset for Multi-Entity Reasoning Grounding in Remote Sensing (ME-RSRG). Based on ME-RSRG, we reformulate remote sensing grounding as a multi-entity reasoning task and propose an Entity-Aware Reasoning (EAR) framework built upon visual-linguistic foundation models. EAR generates structured reasoning traces and subject-object grounding outputs. It adopts supervised fine-tuning for cold-start initialization and is further optimized via entity-aware reward-driven Group Relative Policy Optimization (GRPO). Extensive experiments on ME-RSRG demonstrate the challenges of multi-entity reasoning and verify the effectiveness of our proposed EAR framework. Our dataset, code, and models will be available at https://github.com/CV-ShuchangLyu/ME-RSRG.

关键词: Multi-Entity Reasoning, Remote Sensing Grounding, Visual-Linguistic Foundation Models, Supervised Fine-tuning, Reasoning Traces, Entity-Aware Reasoning, Group Relative Policy Optimization, Benchmark Dataset

189. ❌ Generalized Recognition of Basic Surgical Actions Enables Skill Assessment and Vision-Language-Model-based Surgical Planning

作者: Mengya Xu, Daiyun Shen, Jie Zhang, Hon Chi Yip, Yujia Gao, Cheng Chen, Dillan Imans, Yonghao Long, Yiru Ye, Yixiao Liu, Rongyun Mai, Kai Chen, Hongliang Ren, Yutong Ban, Guangsuo Wang, Francis Wong, Chi-Fai Ng, Kee Yuan Ngiam, Russell H. Taylor, Daguang Xu, Yueming Jin, Qi Dou 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12787v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究手术基本动作识别及其在手术技能评估和规划中的应用，属于AI在医疗科学领域的应用。与关键词’Large Language Models OR LLMs OR Foundation Models’相关度8分，因为论文提到使用大型视觉语言模型进行手术规划，但这不是核心方法（核心是基础模型用于动作识别）。与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为论文直接应用AI于外科手术（生物医学领域），属于AI for Science范畴。其他关键词如MoE、Scaling Laws、RLHF等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文通过构建最大的手术基本动作数据集并开发基础模型，实现了跨专科的通用手术动作识别，并应用于手术技能评估和基于大型视觉语言模型的手术规划，推动了外科智能的发展。

摘要翻译

人工智能、影像技术与大语言模型具备变革外科实践、培训与自动化的潜力。理解并建模基本外科动作——作为所有手术操作的基本单元——对于推动该领域发展至关重要。本文提出了一个涵盖6个外科专科、包含10类基本动作、超过11,000个视频片段的基本外科动作数据集，其规模为当前之最。基于该数据集，我们开发了一个能够进行基本动作通用识别的新基础模型。在针对不同术式类型及人体部位的多样化数据集验证实验中，我们的方法展现出稳健的跨专科性能。此外，我们通过两个下游应用展示了该基础模型的潜力：一是利用领域专业知识进行前列腺切除术中的手术技能评估，二是借助大型视觉-语言模型实现胆囊切除术与肾切除术中的动作规划。多国外科医生对语言模型生成的动作规划可解释文本进行了评估，结果证实其具有临床相关性。这些发现表明，基本外科动作能够在不同场景中被稳健识别，而精确的基本外科动作理解模型本质上能够促进复杂应用的开发，并加速实现外科超级智能。

摘要 (Abstract)

Artificial intelligence, imaging, and large language models have the potential to transform surgical practice, training, and automation. Understanding and modeling of basic surgical actions (BSA), the fundamental unit of operation in any surgery, is important to drive the evolution of this field. In this paper, we present a BSA dataset comprising 10 basic actions across 6 surgical specialties with over 11,000 video clips, which is the largest to date. Based on the BSA dataset, we developed a new foundation model that conducts general-purpose recognition of basic actions. Our approach demonstrates robust cross-specialist performance in experiments validated on datasets from different procedural types and various body parts. Furthermore, we demonstrate downstream applications enabled by the BAS foundation model through surgical skill assessment in prostatectomy using domain-specific knowledge, and action planning in cholecystectomy and nephrectomy using large vision-language models. Multinational surgeons’ evaluation of the language model’s output of the action planning explainable texts demonstrated clinical relevance. These findings indicate that basic surgical actions can be robustly recognized across scenarios, and an accurate BSA understanding model can essentially facilitate complex applications and speed up the realization of surgical superintelligence.

关键词: surgical actions, foundation model, large language models, surgical skill assessment, vision-language models, surgical planning, cross-specialty recognition, AI for surgery

190. ❌ PVI: Plug-in Visual Injection for Vision-Language-Action Models

作者: Zezhou Zhang, Songxin Zhang, Xiao Xiong, Junjie Zhang, Zejian Xie, Jingyi Xi, Zunyao Mao, Zan Mao, Zhixin Mai, Zhuoyang Song, Jiaxing Zhang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12772v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉-语言-动作模型（VLA）中的视觉注入问题，属于计算机视觉与机器人控制交叉领域。与大部分关键词无关，因为论文不涉及大语言模型（LLM）技术、推理方法、对齐、压缩等。相关度5分的关键词：1）‘Pre-training OR Continual Pre-training OR Domain Adaptation’：论文提到预训练的VLM和动作专家，涉及预训练模型的使用；2）‘Post-training OR Supervised Fine-tuning OR SFT’：论文使用单阶段微调（single-stage fine-tuning）；3）‘PEFT OR LoRA OR Parameter-efficient Fine-tuning’：PVI模块是轻量级的，通过零初始化残差路径注入特征，类似于参数高效微调的思想。其他关键词如AI for Science可能间接相关（机器人应用），但论文未明确涉及生物信息学或化学信息学，故给0分。

!!! tip deepseek-chat TL;DR

论文提出了一种轻量级的插件式视觉注入模块（PVI），用于增强视觉-语言-动作模型中的时空视觉特征，在模拟和真实机器人实验中提高了多阶段任务性能。

摘要翻译

将预训练视觉语言模型与流匹配动作专家配对的视觉语言操作架构已成为语言条件操控的重要范式。然而，视觉语言模型通常针对语义抽象优化，且多基于静态视觉观测进行条件化，这往往导致细粒度几何线索被弱化，并缺乏为动作专家提供明确的时间证据。先前研究通过注入辅助视觉特征来缓解此问题，但现有方法要么侧重于静态空间表征，要么需要对架构进行大幅修改以适应时序输入，使得时序信息未能得到充分探索。我们提出插件式视觉注入，这是一种轻量级、与编码器无关的模块，可附加于预训练的动作专家之上，通过零初始化残差路径注入辅助视觉表征，仅需单阶段微调即可保持预训练行为。使用该模块，我们在基础策略及一系列竞争性替代注入策略上获得了稳定提升，对照研究表明时序视频特征优于强静态图像特征，且在需要状态追踪与协调的多阶段任务上收益最为显著。在长视野双手机器人布料折叠任务上的真实机器人实验进一步证明了该模块在仿真环境之外的实用性。

摘要 (Abstract)

VLA architectures that pair a pretrained VLM with a flow-matching action expert have emerged as a strong paradigm for language-conditioned manipulation. Yet the VLM, optimized for semantic abstraction and typically conditioned on static visual observations, tends to attenuate fine-grained geometric cues and often lacks explicit temporal evidence for the action expert. Prior work mitigates this by injecting auxiliary visual features, but existing approaches either focus on static spatial representations or require substantial architectural modifications to accommodate temporal inputs, leaving temporal information underexplored. We propose Plug-in Visual Injection (PVI), a lightweight, encoder-agnostic module that attaches to a pretrained action expert and injects auxiliary visual representations via zero-initialized residual pathways, preserving pretrained behavior with only single-stage fine-tuning. Using PVI, we obtain consistent gains over the base policy and a range of competitive alternative injection strategies, and our controlled study shows that temporal video features (V-JEPA2) outperform strong static image features (DINOv2), with the largest gains on multi-phase tasks requiring state tracking and coordination. Real-robot experiments on long-horizon bimanual cloth folding further demonstrate the practicality of PVI beyond simulation.

关键词: Vision-Language-Action Models, Visual Injection, Temporal Features, Fine-tuning, Robot Manipulation, Multi-phase Tasks, Zero-initialized Residual, Bimanual Cloth Folding

191. ❌ Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation

作者: Shifeng Chen, Yihui Li, Jun Liao, Hongyu Yang, Di Huang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12766v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉和图形学领域的动态场景编辑技术，具体研究4D高斯场景的编辑方法（Catalyst4D框架），涉及锚点运动引导、颜色不确定性引导的外观细化等技术。所有评分关键词均与大语言模型、深度学习技术原理、AI科学应用等主题相关，而本文研究内容属于3D/4D场景重建与编辑的视觉计算领域，与评分关键词列表中的大模型技术、训练方法、推理优化、AI代理、科学AI应用等主题无直接关联。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了动态4D场景编辑中存在的运动伪影、时间闪烁和风格传播不一致的问题，提出了Catalyst4D框架，通过锚点运动引导和颜色不确定性引导的外观细化方法，实现了高质量、时空一致的动态场景编辑。

摘要翻译

近期基于神经辐射场（NeRF）与三维高斯溅射（3DGS）的三维场景编辑技术已能实现高质量的静态场景编辑。相比之下，动态场景编辑仍面临挑战，将二维扩散模型直接扩展至四维的方法常会产生运动伪影、时序闪烁与风格传播不一致等问题。我们提出Catalyst4D框架，该框架可将高质量的三维编辑结果迁移至动态的四维高斯场景中，同时保持空间与时序一致性。其核心组件——基于锚点的运动引导（Anchor-based Motion Guidance, AMG）——从原始高斯点与编辑后高斯点中构建一组结构稳定且具有空间代表性的锚点。这些锚点作为鲁棒的区域级参考，通过最优传输建立对应关系，从而实现一致形变传播，避免跨区域干扰或运动漂移。与之互补的颜色不确定性引导外观优化（Color Uncertainty-guided Appearance Refinement, CUAR）通过估计每个高斯点的颜色不确定性，并选择性优化易因遮挡产生伪影的区域，以保持时序外观一致性。大量实验表明，Catalyst4D能够实现时序稳定、高保真的动态场景编辑，在视觉质量与运动连贯性方面均优于现有方法。

摘要 (Abstract)

Recent advances in 3D scene editing using NeRF and 3DGS enable high-quality static scene editing. In contrast, dynamic scene editing remains challenging, as methods that directly extend 2D diffusion models to 4D often produce motion artifacts, temporal flickering, and inconsistent style propagation. We introduce Catalyst4D, a framework that transfers high-quality 3D edits to dynamic 4D Gaussian scenes while maintaining spatial and temporal coherence. At its core, Anchor-based Motion Guidance (AMG) builds a set of structurally stable and spatially representative anchors from both original and edited Gaussians. These anchors serve as robust region-level references, and their correspondences are established via optimal transport to enable consistent deformation propagation without cross-region interference or motion drift. Complementarily, Color Uncertainty-guided Appearance Refinement (CUAR) preserves temporal appearance consistency by estimating per-Gaussian color uncertainty and selectively refining regions prone to occlusion-induced artifacts. Extensive experiments demonstrate that Catalyst4D achieves temporally stable, high-fidelity dynamic scene editing and outperforms existing methods in both visual quality and motion coherence.

关键词: 4D scene editing, dynamic scene editing, 4D Gaussian scenes, temporal coherence, motion artifacts, Anchor-based Motion Guidance, Color Uncertainty-guided Appearance Refinement, spatial-temporal consistency

192. ❌ TerraFlow: Multimodal, Multitemporal Representation Learning for Earth Observation

作者: Nazar Puriy, Johannes Jakubik, Benedikt Blumenstiel, Konrad Schindler 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12762v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文TerraFlow专注于地球观测领域的多模态、多时序表示学习，属于AI for Science（科学AI）的应用范畴，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（评10分）。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、训练方法、推理优化、代理系统等），因此其他所有关键词均评0分。

!!! tip deepseek-chat TL;DR

论文提出TerraFlow方法，通过多模态、多时序表示学习解决地球观测数据中的时空序列建模问题，在GEO-Bench-2基准测试中超越现有基础模型，并在自然灾害风险图预测任务上取得显著改进。

摘要翻译

我们提出TerraFlow，一种面向地球观测的多模态、多时序学习新方法。该方法基于时序训练目标，能够实现跨空间、时间和模态的序列感知学习，同时对现实地球观测数据中常见的变长输入保持鲁棒性。实验表明，在GEO-Bench-2基准测试的所有时序任务中，TerraFlow均优于当前最先进的地球观测基础模型。我们进一步证明，TerraFlow能够初步实现基于深度学习的自然灾害风险图预测——该任务常导致其他前沿基础模型失效。在F1分数上TerraFlow超越现有最佳基础模型达50%，在Brier分数上提升24%。

摘要 (Abstract)

We propose TerraFlow, a novel approach to multimodal, multitemporal learning for Earth observation. TerraFlow builds on temporal training objectives that enable sequence-aware learning across space, time, and modality, while remaining robust to the variable-length inputs commonly encountered in real-world Earth observation data. Our experiments demonstrate superiority of TerraFlow over state-of-the-art foundation models for Earth observation across all temporal tasks of the GEO-Bench-2 benchmark. We additionally demonstrate that TerraFlow is able to make initial steps towards deep-learning based risk map prediction for natural disasters – a task on which other state-of-the-art foundation models frequently collapse. TerraFlow outperforms state-of-the-art foundation models by up to 50% in F1 score and 24% in Brier score.

关键词: Earth observation, multimodal learning, multitemporal learning, representation learning, foundation models, risk map prediction, temporal tasks, GEO-Bench-2

193. ❌ SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion

作者: Xiang Li, Heqian Qiu, Lanxiao Wang, Benliu Qiu, Fanman Meng, Linfeng Xu, Hongliang Li 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12764v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SAVA-X专注于计算机视觉中的跨视角视频分析任务（Ego→Exo模仿错误检测），提出了一种基于自适应采样、场景自适应视图嵌入和双向交叉注意力融合的框架。所有评分关键词均与大模型、深度学习技术原理或AI for Science直接相关，而本文研究的是特定视频处理任务，未涉及大模型技术、训练方法、推理优化、代理系统或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了从第三人称（exo）演示评估第一人称（ego）模仿时的跨视角错误检测问题，提出了SAVA-X框架，在EgoMe基准上显著提升了AUPRC和平均tIoU性能。

摘要翻译

错误检测在工业培训、医疗保健和装配质量控制中至关重要。现有研究大多基于单视角设定，无法处理使用第三人称（外视角）演示来评估第一人称（内视角）模仿的实际场景。本文正式提出“内视角→外视角模仿错误检测”任务：给定异步、长度不匹配的内视角与外视角视频，模型必须在内视角时间轴上定位操作步骤，并判断每一步是否存在错误。该设定引入了跨视角域偏移、时间错位和高度冗余性问题。在统一实验框架下，我们适配了密集视频描述与时序动作检测领域的强基线方法，发现它们在跨视角场景中表现不佳。为此，我们提出SAVA-X模型，采用“对齐-融合-检测”框架，包含三大核心组件：（1）视角条件自适应采样，（2）场景自适应视角嵌入，（3）双向交叉注意力融合机制。在EgoMe基准测试中，SAVA-X在AUPRC与平均时序交并比（mean tIoU）指标上均持续超越所有基线方法，消融实验也验证了各模块的互补优势。代码已开源：https://github.com/jack1ee/SAVAX。

摘要 (Abstract)

Error detection is crucial in industrial training, healthcare, and assembly quality control. Most existing work assumes a single-view setting and cannot handle the practical case where a third-person (exo) demonstration is used to assess a first-person (ego) imitation. We formalize Ego$\rightarrow$Exo Imitation Error Detection: given asynchronous, length-mismatched ego and exo videos, the model must localize procedural steps on the ego timeline and decide whether each is erroneous. This setting introduces cross-view domain shift, temporal misalignment, and heavy redundancy. Under a unified protocol, we adapt strong baselines from dense video captioning and temporal action detection and show that they struggle in this cross-view regime. We then propose SAVA-X, an Align-Fuse-Detect framework with (i) view-conditioned adaptive sampling, (ii) scene-adaptive view embeddings, and (iii) bidirectional cross-attention fusion. On the EgoMe benchmark, SAVA-X consistently improves AUPRC and mean tIoU over all baselines, and ablations confirm the complementary benefits of its components. Code is available at https://github.com/jack1ee/SAVAX.

关键词: Ego-to-Exo Imitation Error Detection, Cross-view Video Analysis, View-conditioned Adaptive Sampling, Scene-adaptive View Embeddings, Bidirectional Cross-attention Fusion, Temporal Misalignment, Procedural Step Localization, Video Understanding

194. ❌ HIFICL: High-Fidelity In-Context Learning for Multimodal Tasks

作者: Xiaoyu Li, Yuhang Liu, Zheng Luo, Xuanshuo Kang, Fangqi Lou, Xiaohua Wu, Zihan Xiong 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12760v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	15.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究In-Context Learning（ICL）在大规模多模态模型中的应用，与关键词’In-context Learning OR Many-shot Learning’高度相关（15分）。论文提出HIFICL方法，通过虚拟键值对和低秩分解实现上下文感知的参数高效微调，与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分）。论文涉及Large Multimodal Models（LMMs），属于大模型范畴，与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分）。其他关键词如MoE、SLMs、Scaling Laws、RAG、RLHF等未在论文中涉及，评分为0分。

!!! tip deepseek-chat TL;DR

论文针对大规模多模态模型中In-Context Learning性能敏感且计算成本高的问题，提出了High-Fidelity In-Context Learning（HIFICL）方法，通过虚拟键值对和低秩分解更准确地建模ICL机制，在多个多模态基准测试中优于现有近似方法。

摘要翻译

上下文学习（In-Context Learning, ICL）是大规模多模态模型（Large Multimodal Models, LMMs）的重要范式，它通过少量上下文示例（in-context demonstrations, ICDs）来实现新任务适应。然而，其性能对示例配置敏感且计算成本高昂。从数学角度分析，这些示例的影响可分解为标准注意力输出与上下文值的动态混合。现有近似方法通过学习一个“偏移向量”来简化这一过程。受精确分解的启发，我们提出了高保真上下文学习（High-Fidelity In-Context Learning, HiFICL），以更真实地建模ICL机制。HiFICL包含三个核心组件：1）一组作为可学习上下文的“虚拟键值对”；2）用于稳定和正则化训练的低秩分解；3）一个简单的端到端训练目标。从另一视角看，该机制构成了一种上下文感知的参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）形式。大量实验表明，HiFICL在多个多模态基准测试中均稳定优于现有近似方法。代码发布于https://github.com/bbbandari/HiFICL。

摘要 (Abstract)

In-Context Learning (ICL) is a significant paradigm for Large Multimodal Models (LMMs), using a few in-context demonstrations (ICDs) for new task adaptation. However, its performance is sensitive to demonstration configurations and computationally expensive. Mathematically, the influence of these demonstrations can be decomposed into a dynamic mixture of the standard attention output and the context values. Current approximation methods simplify this process by learning a “shift vector”. Inspired by the exact decomposition, we introduce High-Fidelity In-Context Learning (HIFICL) to more faithfully model the ICL mechanism. HIFICL consists of three key components: 1) a set of “virtual key-value pairs” to act as a learnable context, 2) a low-rank factorization for stable and regularized training, and 3) a simple end-to-end training objective. From another perspective, this mechanism constitutes a form of context-aware Parameter-Efficient Fine-Tuning (PEFT). Extensive experiments show that HiFICL consistently outperforms existing approximation methods on several multimodal benchmarks. The code is available at https://github.com/bbbandari/HiFICL.

关键词: In-Context Learning, Large Multimodal Models, Parameter-Efficient Fine-Tuning, Virtual Key-Value Pairs, Low-rank Factorization, Multimodal Tasks, Context-aware, End-to-end Training

195. ❌ SAP: Segment Any 4K Panorama

作者: Lutao Jiang, Zidong Cao, Weikai Chen, Xu Zheng, Yuanhuiyi Lyu, Zhenyang Li, Zeyu HU, Yingda Yin, Keyang Luo, Runze Zhang, Kai Yan, Shengju Qian, Haidi Fan, Yifan Peng, Xin Wang, Hui Xiong, Ying-Cong Chen 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12759v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SAP专注于计算机视觉领域的全景图像实例分割，提出了一种轨迹对齐的范式来提升全景分割性能，并构建了大规模合成数据集进行训练。虽然论文涉及基础模型（foundation model）概念，但所有评分关键词均特指大语言模型（LLM）及相关技术（如MoE、RLHF、RAG等），而本论文研究的是视觉分割模型，与语言模型技术无直接关联。论文未涉及任何语言模型架构、训练方法、推理优化或应用场景，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对全景图像中基础分割模型性能下降的问题，提出了一种轨迹对齐的范式SAP，通过将全景分割重新定义为固定轨迹的透视视频分割，并利用合成数据进行训练，在真实世界4K全景基准上实现了比SAM2高17.2%的零样本mIoU提升。

摘要翻译

可提示实例分割在具身智能与增强现实系统中得到广泛应用，但基于透视图像训练的基础模型在360°全景图像上性能常出现显著下降。本文提出Segment Any 4K Panorama（SAP）——一种面向4K高分辨率全景实例级分割的基础模型。我们将全景分割重新定义为固定轨迹的透视视频分割，通过沿连续球面轨迹采样重叠透视图像块来分解全景图。这种内存对齐的重构方法在保持原生4K分辨率的同时，恢复了稳定跨视图传播所需的平滑视点过渡。为实现大规模监督训练，我们利用InfiniGen引擎合成了183,440张带实例分割标签的4K分辨率全景图像。在此轨迹对齐范式下训练的SAP模型能有效泛化至真实世界360°图像，在真实世界4K全景基准测试中，相比不同尺寸的原始SAM2模型实现了+17.2的零样本平均交并比提升。

摘要 (Abstract)

Promptable instance segmentation is widely adopted in embodied and AR systems, yet the performance of foundation models trained on perspective imagery often degrades on 360° panoramas. In this paper, we introduce Segment Any 4K Panorama (SAP), a foundation model for 4K high-resolution panoramic instance-level segmentation. We reformulate panoramic segmentation as fixed-trajectory perspective video segmentation, decomposing a panorama into overlapping perspective patches sampled along a continuous spherical traversal. This memory-aligned reformulation preserves native 4K resolution while restoring the smooth viewpoint transitions required for stable cross-view propagation. To enable large-scale supervision, we synthesize 183,440 4K-resolution panoramic images with instance segmentation labels using the InfiniGen engine. Trained under this trajectory-aligned paradigm, SAP generalizes effectively to real-world 360° images, achieving +17.2 zero-shot mIoU gain over vanilla SAM2 of different sizes on real-world 4K panorama benchmark.

关键词: panoramic segmentation, instance segmentation, foundation model, 4K resolution, trajectory-aligned, zero-shot, synthetic data, InfiniGen

196. ❌ Show, Don’t Tell: Detecting Novel Objects by Watching Human Videos

作者: James Akl, Jose Nicolas Avendano Arbelaez, James Barabas, Jennifer L. Barry, Kalie Ching, Noam Eshed, Jiahui Fu, Michel Hidalgo, Andrew Hoelscher, Tushar Kusnur, Andrew Messing, Zachary Nagler, Brian Okorn, Mauro Passerino, Tim J. Perkins, Eric Rosen, Ankit Shah, Tanmay Shankar, Scott Shaw 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12751v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究机器人通过观看人类演示视频来检测和识别新物体的系统，采用自监督学习和自动数据集创建方法。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文专注于计算机视觉和机器人领域的物体检测，未涉及任何大模型技术、训练方法、推理优化或AI for Science的具体应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为'Show, Don't Tell'的自监督系统，通过观看人类演示视频自动创建数据集并训练定制化物体检测器，使机器人能够快速识别新物体，从而显著提升了任务完成性能。

摘要翻译

机器人如何在人类演示过程中快速识别并认知新出现的物体？现有的闭集物体检测器常在此失效，因为这些物体处于分布范围之外。尽管开集检测器（如视觉语言模型）有时能够成功，但它们通常需要昂贵且繁琐的人工参与式提示工程来唯一识别新物体实例。本文提出一种自监督系统，通过在自动创建的数据集上训练定制化物体检测器，并以人类演示本身作为监督信号，从而消除对繁琐语言描述和昂贵提示工程的需求。在我们的“展示而非告知”方法中，我们在演示过程中向检测器直接展示感兴趣的具体物体，而非通过复杂语言描述向检测器传达这些物体的信息。通过完全绕过语言环节，该范式使我们能够快速训练出针对人类任务演示中观察到的相关物体而定制的检测器。我们开发了一套集成式机器人系统，将这种自动数据集创建与新物体检测的“展示而非告知”范式部署于真实机器人平台。实验结果表明，我们的流程在操作物体的检测与识别性能上显著优于现有先进方法，从而提升了机器人的任务完成能力。

摘要 (Abstract)

How can a robot quickly identify and recognize new objects shown to it during a human demonstration? Existing closed-set object detectors frequently fail at this because the objects are out-of-distribution. While open-set detectors (e.g., VLMs) sometimes succeed, they often require expensive and tedious human-in-the-loop prompt engineering to uniquely recognize novel object instances. In this paper, we present a self-supervised system that eliminates the need for tedious language descriptions and expensive prompt engineering by training a bespoke object detector on an automatically created dataset, supervised by the human demonstration itself. In our approach, “Show, Don’t Tell,” we show the detector the specific objects of interest during the demonstration, rather than telling the detector about these objects via complex language descriptions. By bypassing language altogether, this paradigm enables us to quickly train bespoke detectors tailored to the relevant objects observed in human task demonstrations. We develop an integrated on-robot system to deploy our “Show, Don’t Tell” paradigm of automatic dataset creation and novel object-detection on a real-world robot. Empirical results demonstrate that our pipeline significantly outperforms state-of-the-art detection and recognition methods for manipulated objects, leading to improved task completion for the robot.

关键词: novel object detection, self-supervised learning, human demonstration, automatic dataset creation, robot vision, open-set detection, on-robot system, task completion

197. ❌ SLICE: Semantic Latent Injection via Compartmentalized Embedding for Image Watermarking

作者: Zheng Gao, Yifan Yang, Xiaoyu Li, Xiaoyan Feng, Haoran Fan, Yang Song, Jiaojiao Jiang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12749v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究扩散模型图像水印技术，专注于语义感知水印的鲁棒性和篡改定位，属于计算机视觉和图像安全领域。论文未涉及大语言模型、深度学习技术原理创新或科学应用，与所有评分关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对扩散模型图像水印易受语义编辑攻击的问题，提出了SLICE框架，通过解耦语义因子并锚定到初始噪声，实现了鲁棒的语义感知水印和可定位的篡改检测。

摘要翻译

扩散模型初始噪声水印已成为图像溯源的一种前景广阔的方法，但内容无关的噪声模式可能通过反演与再生攻击被伪造。近期基于语义感知的水印方法通过将验证过程与图像语义绑定，提升了鲁棒性。然而，这些方法依赖于单一的全局语义绑定，使其易受局部但全局连贯的语义编辑攻击。为克服此局限并提供可信的语义感知水印，我们提出基于分区嵌入的语义潜变量注入方法（SLICE）。本框架将图像语义解耦为四个语义因子（主体、环境、动作与细节），并将其精确锚定至初始高斯噪声的不同区域。这种细粒度的语义绑定支持高级水印验证，使得语义篡改可被检测与定位。我们从理论上论证了SLICE为何能实现鲁棒且可靠的篡改定位，并为误接受率提供了统计保证。实验结果表明，在面对先进的语义引导再生攻击时，SLICE显著优于现有基线方法，在保持图像质量与语义保真度的同时大幅降低了攻击成功率。总体而言，SLICE提供了一种无需重新训练的实用溯源方案，既具备细粒度诊断能力，又能有效抵御现实中的对抗性篡改操作。

摘要 (Abstract)

Watermarking the initial noise of diffusion models has emerged as a promising approach for image provenance, but content-independent noise patterns can be forged via inversion and regeneration attacks. Recent semantic-aware watermarking methods improve robustness by conditioning verification on image semantics. However, their reliance on a single global semantic binding makes them vulnerable to localized but globally coherent semantic edits. To address this limitation and provide a trustworthy semantic-aware watermark, we propose $\underline{\textbf{S}}$emantic $\underline{\textbf{L}}$atent $\underline{\textbf{I}}$njection via $\underline{\textbf{C}}$ompartmentalized $\underline{\textbf{E}}$mbedding ($\textbf{SLICE}$). Our framework decouples image semantics into four semantic factors (subject, environment, action, and detail) and precisely anchors them to distinct regions in the initial Gaussian noise. This fine-grained semantic binding enables advanced watermark verification where semantic tampering is detectable and localizable. We theoretically justify why SLICE enables robust and reliable tamper localization and provides statistical guarantees on false-accept rates. Experimental results demonstrate that SLICE significantly outperforms existing baselines against advanced semantic-guided regeneration attacks, substantially reducing attack success while preserving image quality and semantic fidelity. Overall, SLICE offers a practical, training-free provenance solution that is both fine-grained in diagnosis and robust to realistic adversarial manipulations.

关键词: image watermarking, diffusion models, semantic-aware watermark, tamper localization, semantic binding, initial noise, robust watermark, provenance

198. ❌ Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World

作者: Yuzhi Huang, Kairun Wen, Rongxin Gao, Dongxuan Liu, Yibin Lou, Jie Wu, Jing Xu, Jian Zhang, Zheng Yang, Yunlong Lin, Chenxin Li, Panwang Pan, Junbin Lu, Jingyan Jiang, Xinghao Ding, Yue Huang, Zhi Wang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12746v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）在动态4D世界中的感知、跟踪和推理能力，与’Large Language Models’高度相关（10分）。论文探讨了’Chain of Thought’和’System 2 Thinking’相关概念，通过结构化方法（如Mask-Guided Fusion和ST-TCM）增强推理能力，但未深入具体技术细节，因此给予8分。其他关键词（如MoE、SLMs、Scaling Laws、训练方法、优化技术、代理系统等）在论文中未涉及或仅边缘提及，给予0分。

!!! tip deepseek-chat TL;DR

该论文研究了多模态大语言模型在物理4D动态世界中的感知、跟踪和推理能力，发现现有模型在时空推理和动态对象定位上存在不一致性，并提出结构化集成方法（如Mask-Guided Fusion和ST-TCM）显著提升了模型的动态感知和时空推理性能。

摘要翻译

人类栖居于一个物理的四维世界，其中几何结构与语义内容随时间演变，构成了动态的四维现实（空间维度叠加时间维度）。尽管当前的多模态大语言模型（MLLMs）在静态视觉理解方面表现出色，但它们是否也能擅长“动态思维”，即感知、追踪并推理演化场景中的时空动态？为系统评估其时空推理与局部动态感知能力，我们引入了Dyn-Bench——一个基于多样化的真实世界与合成视频数据集构建的大规模基准，旨在实现对时空理解的稳健且可扩展的评估。通过对海量二维与四维数据源进行多阶段筛选，Dyn-Bench提供了高质量动态场景集合，包含1千个视频、7千个视觉问答（VQA）对以及3千个动态物体定位对。我们探究了通用型、空间型及区域级MLLMs如何以语言和视觉形式表达其动态思维过程，发现现有模型无法在时空推理与动态物体定位任务中同时保持强劲性能，且常对运动与交互产生不一致的解读。值得注意的是，传统提示策略（如思维链或基于描述的提示）带来的改进有限，而结构化整合方法——包括掩码引导融合（Mask-Guided Fusion）与时空文本认知地图（Spatio-Temporal Textual Cognitive Map, ST-TCM）——能显著增强MLLMs在物理四维世界中的动态感知与时空推理能力。代码与基准数据集发布于https://dyn-bench.github.io/。

摘要 (Abstract)

Humans inhabit a physical 4D world where geometric structure and semantic content evolve over time, constituting a dynamic 4D reality (spatial with temporal dimension). While current Multimodal Large Language Models (MLLMs) excel in static visual understanding, can they also be adept at “thinking in dynamics”, i.e., perceive, track and reason about spatio-temporal dynamics in evolving scenes? To systematically assess their spatio-temporal reasoning and localized dynamics perception capabilities, we introduce Dyn-Bench, a large-scale benchmark built from diverse real-world and synthetic video datasets, enabling robust and scalable evaluation of spatio-temporal understanding. Through multi-stage filtering from massive 2D and 4D data sources, Dyn-Bench provides a high-quality collection of dynamic scenes, comprising 1k videos, 7k visual question answering (VQA) pairs, and 3k dynamic object grounding pairs. We probe general, spatial and region-level MLLMs to express how they think in dynamics both linguistically and visually, and find that existing models cannot simultaneously maintain strong performance in both spatio-temporal reasoning and dynamic object grounding, often producing inconsistent interpretations of motion and interaction. Notably, conventional prompting strategies (e.g., chain-of-thought or caption-based hints) provide limited improvement, whereas structured integration approaches, including Mask-Guided Fusion and Spatio-Temporal Textual Cognitive Map (ST-TCM), significantly enhance MLLMs’ dynamics perception and spatio-temporal reasoning in the physical 4D world. Code and benchmark are available at https://dyn-bench.github.io/.

关键词: Multimodal Large Language Models, Spatio-temporal Reasoning, Dynamic Object Grounding, 4D World, Dyn-Bench Benchmark, Mask-Guided Fusion, Spatio-Temporal Textual Cognitive Map, Visual Question Answering

199. ❌ The COTe score: A decomposable framework for evaluating Document Layout Analysis models

作者: Jonathan Bourne, Mwiza Simbeye, Ishtar Govia 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12718v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于文档布局分析（DLA）的评估指标创新，提出了COTe评分框架和SSU标注方法。论文内容涉及计算机视觉、文档理解和评估方法，但完全不涉及大语言模型、深度学习技术原理、AI for Science等关键词领域。所有关键词均与大模型技术、训练方法、推理优化、对齐技术、AI科学应用等相关，而本文研究的是文档图像分析的评估指标，属于传统计算机视觉任务，与给定关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对文档布局分析模型评估中传统指标（如IoU、F1）不适用于2D打印媒体的局限性，提出了基于语义结构单元的COTe评分框架，通过案例研究和模型评估证明其能更有效地揭示模型失败模式并减少性能解释差距。

摘要翻译

文档布局分析（Document Layout Analysis, DLA）是指将页面解析为有意义元素的过程，通常借助机器学习模型实现。目前，模型质量普遍采用通用目标检测指标（如交并比IoU、F1值或平均精度均值mAP）进行评估。然而，这些指标原本是为三维空间的二维投影图像设计的，并不适用于印刷媒介这种原生二维图像。这种差异可能导致指标对模型性能产生误导性或无信息量的解读。为促进更稳健、可比较且精细化的文档布局分析，我们提出：结构语义单元（Structural Semantic Unit, SSU）——一种将关注点从内容物理结构转向语义结构的关系标注方法；以及覆盖率、重叠度、侵入度与冗余度（Coverage, Overlap, Trespass, and Excess, COTe）评分——一种用于衡量页面解析质量的可分解指标。我们通过案例研究，并在三个DLA数据集上评估五种常见DLA模型，验证了这些方法的实用价值。实验表明，COTe评分比传统指标更具信息量，能揭示模型间不同的失效模式（例如突破语义边界或重复解析同一区域）。此外，相较于F1值，COTe评分将解读与性能的差距最高降低了76%。值得注意的是，即使没有显式的SSU标注，COTe评分的粒度鲁棒性仍基本保持，这降低了使用该系统的入门门槛。最后，我们发布了SSU标注数据集及用于DLA项目的COTe评分Python工具库。

摘要 (Abstract)

Document Layout analysis (DLA), is the process by which a page is parsed into meaningful elements, often using machine learning models. Typically, the quality of a model is judged using general object detection metrics such as IoU, F1 or mAP. However, these metrics are designed for images that are 2D projections of 3D space, not for the natively 2D imagery of printed media. This discrepancy can result in misleading or uninformative interpretation of model performance by the metrics. To encourage more robust, comparable, and nuanced DLA, we introduce: The Structural Semantic Unit (SSU) a relational labelling approach that shifts the focus from the physical to the semantic structure of the content; and the Coverage, Overlap, Trespass, and Excess (COTe) score, a decomposable metric for measuring page parsing quality. We demonstrate the value of these methods through case studies and by evaluating 5 common DLA models on 3 DLA datasets. We show that the COTe score is more informative than traditional metrics and reveals distinct failure modes across models, such as breaching semantic boundaries or repeatedly parsing the same region. In addition, the COTe score reduces the interpretation-performance gap by up to 76% relative to the F1. Notably, we find that the COTe’s granularity robustness largely holds even without explicit SSU labelling, lowering the barriers to entry for using the system. Finally, we release an SSU labelled dataset and a Python library for applying COTe in DLA projects.

关键词: Document Layout Analysis, Evaluation Metrics, COTe Score, Structural Semantic Unit, Page Parsing, Model Performance, Semantic Boundaries, DLA Datasets

200. ❌ UNIStainNet: Foundation-Model-Guided Virtual Staining of H&E to IHC

作者: Jillur Rahman Saurav, Thuong Le Hoai Pham, Pritam Mukherjee, Paul Yi, Brent A. Orr, Jacob M. Luber 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12716v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文UNIStainNet的核心是使用病理学基础模型（UNI）指导H&E到IHC的虚拟染色，属于AI在生物医学（病理学）领域的应用创新。因此，与’Foundation Models’高度相关（10分），因为论文明确使用了冻结的病理学基础模型（UNI）提供语义指导；与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为这是AI在生物信息学/病理学诊断中的具体应用；与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为涉及领域适应（从H&E到IHC的域转换），但论文未深入讨论预训练技术本身。其他关键词（如MoE、SFT、RAG等）与论文内容无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出UNIStainNet，一种基于病理学基础模型（UNI）指导的虚拟染色方法，用于从H&E图像生成IHC染色，在多个生物标志物上实现了最先进的性能，且单个模型可同时处理多种染色。

摘要翻译

基于苏木精-伊红（H&E）染色图像的虚拟免疫组织化学（IHC）染色技术，可直接从常规切片中提供初步分子信息，从而加速诊断进程，并在组织样本有限时减少重复切片的需求。现有方法通过对比学习目标、原型匹配或域对齐来提升生成真实性，但生成器本身并未获得来自病理学基础模型的直接指导。我们提出了UNIStainNet，这是一种基于SPADE-UNet架构的模型，其条件输入来自冻结的病理学基础模型（UNI）提取的密集空间标记，为染色转换提供组织层面的语义指导。一套错位感知损失函数确保了染色定量准确性，而通过学习得到的染色嵌入向量使单一模型能够同时服务于多种IHC标记物。在MIST数据集上，UNIStainNet通过单一统一模型在所有四种染色（HER2、Ki67、ER、PR）上均达到了最先进的分布指标，而先前方法通常需要为每种染色单独训练模型。在BCI数据集上，该模型同样取得了最佳的分布指标。一项基于组织类型的分层误差分析表明，残留误差具有系统性，主要集中于非肿瘤组织区域。代码发布于https://github.com/facevoid/UNIStainNet。

摘要 (Abstract)

Virtual immunohistochemistry (IHC) staining from hematoxylin and eosin (H&E) images can accelerate diagnostics by providing preliminary molecular insight directly from routine sections, reducing the need for repeat sectioning when tissue is limited. Existing methods improve realism through contrastive objectives, prototype matching, or domain alignment, yet the generator itself receives no direct guidance from pathology foundation models. We present UNIStainNet, a SPADE-UNet conditioned on dense spatial tokens from a frozen pathology foundation model (UNI), providing tissue-level semantic guidance for stain translation. A misalignment-aware loss suite preserves stain quantification accuracy, and learned stain embeddings enable a single model to serve multiple IHC markers simultaneously. On MIST, UNIStainNet achieves state-of-the-art distributional metrics on all four stains (HER2, Ki67, ER, PR) from a single unified model, where prior methods typically train separate per-stain models. On BCI, it also achieves the best distributional metrics. A tissue-type stratified failure analysis reveals that remaining errors are systematic, concentrating in non-tumor tissue. Code is available at https://github.com/facevoid/UNIStainNet.

关键词: virtual immunohistochemistry staining, pathology foundation model, H&E to IHC translation, UNIStainNet, semantic guidance, multiple IHC markers, state-of-the-art performance, tissue-level analysis

201. ❌ Text-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval

作者: Jing Yang, Hui Xue, Shipeng Zhu, Pengfei Fang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12711v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究无监督跨域图像检索（UCDIR），提出了一种结合文本先验和相位先验的TPSNet方法。论文主要涉及计算机视觉、跨域学习和图像检索领域，与大多数大模型和深度学习技术原理的关键词无关。唯一相关的关键词是’Pre-training OR Continual Pre-training OR Domain Adaptation’，因为论文涉及跨域适应（Domain Adaptation）问题，但并非大模型领域的预训练或持续预训练，因此给予5分（有一定关联）。其他关键词均与论文内容无关，给予0分。

!!! tip deepseek-chat TL;DR

该论文针对无监督跨域图像检索中伪标签不准确和语义退化的问题，提出了结合文本先验和相位先验的TPSNet方法，在基准测试中显著优于现有方法。

摘要翻译

本文研究无监督跨域图像检索（UCDIR），其目标是在不依赖标注数据的情况下，检索不同领域中同一类别的图像。现有方法通常利用聚类算法生成的伪标签，作为域内表征学习和跨域特征对齐的监督信号。然而，这些离散的伪标签往往无法提供准确且全面的语义指导。此外，对齐过程常常忽视域特定信息与语义信息之间的纠缠，导致学习到的表征出现语义退化，最终损害检索性能。本文针对这些局限性，提出了一种具有双重先验的文本-相位协同网络（Text-Phase Synergy Network with Dual Priors, TPSNet）。具体而言，我们首先利用CLIP为每个域生成一组特定于类别的提示词，称为域提示（domain prompt），作为文本先验（text prior），以提供更精确的语义监督。同时，我们进一步引入相位先验（phase prior），以域不变的相位特征（phase features）表示，将其整合到原始图像表征中，以弥合域分布差距，同时保持语义完整性。借助这两种先验的协同作用，TPSNet在UCDIR基准测试上显著优于现有最先进方法。

摘要 (Abstract)

This paper studies unsupervised cross-domain image retrieval (UCDIR), which aims to retrieve images of the same category across different domains without relying on labeled data. Existing methods typically utilize pseudo-labels, derived from clustering algorithms, as supervisory signals for intra-domain representation learning and cross-domain feature alignment. However, these discrete pseudo-labels often fail to provide accurate and comprehensive semantic guidance. Moreover, the alignment process frequently overlooks the entanglement between domain-specific and semantic information, leading to semantic degradation in the learned representations and ultimately impairing retrieval performance. This paper addresses the limitations by proposing a Text-Phase Synergy Network with Dual Priors(TPSNet). Specifically, we first employ CLIP to generate a set of class-specific prompts per domain, termed as domain prompt, serving as a text prior that offers more precise semantic supervision. In parallel, we further introduce a phase prior, represented by domain-invariant phase features, which is integrated into the original image representations to bridge the domain distribution gaps while preserving semantic integrity. Leveraging the synergy of these dual priors, TPSNet significantly outperforms state-of-the-art methods on UCDIR benchmarks.

关键词: unsupervised cross-domain image retrieval, UCDIR, text-phase synergy network, TPSNet, CLIP, domain prompt, phase prior, domain-invariant features

202. ❌ Deep Learning Based Estimation of Blood Glucose Levels from Multidirectional Scleral Blood Vessel Imaging

作者: Muhammad Ahmed Khan, Manqiang Peng, Ding Lin, Saif Ur Rehman Khan 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12715v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用深度学习（卷积神经网络、Transformer注意力）进行医学图像分析（巩膜血管成像）以实现血糖水平估计和糖尿病分类，属于AI在生物医学领域的应用。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为其属于AI在生物信息学/医学领域的应用研究，但并非论文的核心创新点（核心是特定深度学习框架设计），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究提出了一种名为ScleraGluNet的多视角深度学习框架，通过分析多方向巩膜血管图像，实现了对血糖状态的分类（正常、控制良好糖尿病、高血糖糖尿病）和连续空腹血糖水平的非侵入性估计，取得了高准确率和与实验室测量强相关的结果。

摘要翻译

定期监测血糖状态对糖尿病管理至关重要，但传统的血液检测方法难以满足频繁评估的需求。巩膜含有表浅微血管系统，这些血管可能呈现糖尿病相关的改变，且易于在眼表直接观察。本研究提出ScleraGluNet——一种多视角深度学习框架，能够基于多方向巩膜血管图像实现三类代谢状态分类（正常、控制良好的糖尿病和高血糖糖尿病）以及连续空腹血浆葡萄糖（FPG）估计。数据集包含445名参与者（150/140/155例），每名参与者采集五个注视方向的前节图像，共计2,225张。经过血管增强处理后，通过并行卷积分支提取特征，并采用蝠鲼觅食优化算法（MRFO）进行特征优化，最终通过基于Transformer的多视角交叉注意力机制进行特征融合。采用按参与者划分的五折交叉验证进行评估，确保同一参与者的所有图像均划分至同一折。ScleraGluNet在分类任务中达到93.8%的总准确率，对正常、控制良好糖尿病和高血糖糖尿病类别的受试者工作特征曲线下面积（AUC）分别为0.971、0.956和0.982。在FPG估计任务中，模型取得平均绝对误差（MAE）= 6.42 mg/dL、均方根误差（RMSE）= 7.91 mg/dL的精度，与实验室测量值呈现强相关性（r = 0.983；R² = 0.966）。Bland-Altman分析显示平均偏差为+1.45 mg/dL，95%一致性界限为-8.33至+11.23 mg/dL。这些结果表明，结合多视角学习技术的多方向巩膜血管成像是一种具有潜力的无创血糖评估方法，在临床推广前需开展多中心验证研究。

摘要 (Abstract)

Regular monitoring of glycemic status is essential for diabetes management, yet conventional blood-based testing can be burdensome for frequent assessment. The sclera contains superficial microvasculature that may exhibit diabetes related alterations and is readily visible on the ocular surface. We propose ScleraGluNet, a multiview deep-learning framework for three-class metabolic status classification (normal, controlled diabetes, and high-glucose diabetes) and continuous fasting plasma glucose (FPG) estimation from multidirectional scleral vessel images. The dataset comprised 445 participants (150/140/155) and 2,225 anterior-segment images acquired from five gaze directions per participant. After vascular enhancement, features were extracted using parallel convolutional branches, refined with Manta Ray Foraging Optimization (MRFO), and fused via transformer-based cross-view attention. Performance was evaluated using subject-wise five-fold cross-validation, with all images from each participant assigned to the same fold. ScleraGluNet achieved 93.8% overall accuracy, with one-vs-rest AUCs of 0.971,0.956, and 0.982 for normal, controlled diabetes, and high-glucose diabetes, respectively. For FPG estimation, the model achieved MAE = 6.42 mg/dL and RMSE = 7.91 mg/dL, with strong correlation to laboratory measurements (r = 0.983; R2 = 0.966). Bland Altman analysis showed a mean bias of +1.45 mg/dL with 95% limits of agreement from -8.33 to +11.23$ mg/dL. These results support multidirectional scleral vessel imaging with multiview learning as a promising noninvasive approach for glycemic assessment, warranting multicenter validation before clinical deployment.

关键词: deep learning, blood glucose estimation, scleral vessel imaging, diabetes classification, noninvasive monitoring, multiview learning, transformer attention, medical image analysis

203. ❌ HFP-SAM: Hierarchical Frequency Prompted SAM for Efficient Marine Animal Segmentation

作者: Pingping Zhang, Tianyu Yan, Yuhao Wang, Yang Liu, Tongdan Tang, Yili Ma, Long Lv, Feng Tian, Weibing Sun, and Huchuan Lu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12708v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于计算机视觉领域的海洋动物分割任务，提出了一种基于Segment Anything Model (SAM)的改进框架HFP-SAM，通过频率引导适配器、频率感知点选择和全视图Mamba模块来提升分割性能。所有关键词均与大语言模型、深度学习技术原理或AI for Science相关，但论文内容与绝大多数关键词（如LLMs、MoE、Scaling Laws、RLHF、RAG、CoT、Agents等）完全无关，仅与最后一个关键词’AI for Science’有一定关联，因为该研究属于AI在海洋科学/生物信息学中的应用，但并非核心的生物信息学或化学信息学研究，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对海洋动物分割任务中长距离建模和细节感知的挑战，提出了HFP-SAM框架，通过频率引导适配器、频率感知点选择和全视图Mamba模块有效提升了分割性能，在四个公共数据集上验证了其优越性。

摘要翻译

海洋动物分割（Marine Animal Segmentation, MAS）旨在从复杂的海洋环境中识别并分割出海洋动物。以往大多数基于深度学习的MAS方法难以有效处理长距离建模问题。近期，Segment Anything Model（SAM）在通用图像分割领域受到广泛关注，但其在感知细粒度细节与频率信息方面存在不足。为此，我们提出了一种新颖的学习框架——层次化频率提示SAM（Hierarchical Frequency Prompted SAM, HFP-SAM），以实现高性能的MAS。首先，我们设计了频率引导适配器（Frequency Guided Adapter, FGA），通过频域先验掩码将海洋场景信息高效注入到冻结的SAM骨干网络中。此外，我们引入了频率感知点选择（Frequency-aware Point Selection, FPS）机制，通过频率分析生成高亮区域。这些区域与SAM的粗粒度预测结果相结合，生成点提示并整合到SAM的解码器中以进行精细化预测。最后，为获得全面的分割掩码，我们提出了全视角Mamba（Full-View Mamba, FVM）模块，以线性计算复杂度高效提取空间与通道上下文信息。在四个公开数据集上的大量实验证明了本方法的优越性能。源代码已公开于https://github.com/Drchip61/TIP-HFP-SAM。

摘要 (Abstract)

Marine Animal Segmentation (MAS) aims at identifying and segmenting marine animals from complex marine environments. Most of previous deep learning-based MAS methods struggle with the long-distance modeling issue. Recently, Segment Anything Model (SAM) has gained popularity in general image segmentation. However, it lacks of perceiving fine-grained details and frequency information. To this end, we propose a novel learning framework, named Hierarchical Frequency Prompted SAM (HFP-SAM) for high-performance MAS. First, we design a Frequency Guided Adapter (FGA) to efficiently inject marine scene information into the frozen SAM backbone through frequency domain prior masks. Additionally, we introduce a Frequency-aware Point Selection (FPS) to generate highlighted regions through frequency analysis. These regions are combined with the coarse predictions of SAM to generate point prompts and integrate into SAM’s decoder for fine predictions. Finally, to obtain comprehensive segmentation masks, we introduce a Full-View Mamba (FVM) to efficiently extract spatial and channel contextual information with linear computational complexity. Extensive experiments on four public datasets demonstrate the superior performance of our approach. The source code is publicly available at https://github.com/Drchip61/TIP-HFP-SAM.

关键词: Marine Animal Segmentation, Segment Anything Model, Frequency Domain, Hierarchical Frequency Prompt, Full-View Mamba, Computer Vision, Image Segmentation, Deep Learning

204. ❌ VCBench: A Streaming Counting Benchmark for Spatial-Temporal State Maintenance in Long Videos

作者: Pengyiang Liu, Zhongyue Shi, Hongye Hao, Qi Fu, Xueting Bi, Siwei Zhang, Xiaoyang Hu, Zitian Wang, Linjiang Huang, Si Liu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12703v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频理解领域的基准测试，提出VCBench用于评估视频语言模型在时空状态维护方面的能力，涉及对象计数、事件计数、轨迹一致性等具体任务。所有评分关键词均与大模型技术原理、训练方法、推理优化、对齐技术、代理系统、科学AI应用等直接相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了VCBench基准测试，用于诊断视频理解模型在长视频中维护时空状态的能力，发现当前模型在对象和事件计数等任务上存在显著缺陷。

摘要翻译

视频理解要求模型在播放过程中持续追踪并更新世界状态。现有基准测试虽已在多维度推进了视频理解评估，但对模型如何维持世界状态的观测仍显不足。我们提出VCBench——一个流式计数基准测试，将计数任务重新定位为诊断世界状态维持能力的最小化探针。我们将该能力分解为对象计数（追踪当前可见对象 vs. 追踪累积独立实体）与事件计数（检测瞬时动作 vs. 追踪完整活动周期），形成8个细分子类别。VCBench包含406个视频，逐帧标注了10,071个事件发生时刻与对象状态变化时刻，沿时间线生成1,000个流式问答对及4,576个查询点。通过流式多点查询观测状态维持轨迹，我们设计了三个互补指标以诊断数值精度、轨迹一致性和时序感知能力。对主流视频-语言模型的评估表明，当前模型在时空状态维持方面仍存在显著缺陷，尤其在周期性事件计数等任务上表现困难。VCBench为衡量和改进视频理解系统的状态维持能力提供了诊断框架。

摘要 (Abstract)

Video understanding requires models to continuously track and update world state during playback. While existing benchmarks have advanced video understanding evaluation across multiple dimensions, the observation of how models maintain world state remains insufficient. We propose VCBench, a streaming counting benchmark that repositions counting as a minimal probe for diagnosing world state maintenance capability. We decompose this capability into object counting (tracking currently visible objects vs.\ tracking cumulative unique identities) and event counting (detecting instantaneous actions vs.\ tracking complete activity cycles), forming 8 fine-grained subcategories. VCBench contains 406 videos with frame-by-frame annotations of 10,071 event occurrence moments and object state change moments, generating 1,000 streaming QA pairs with 4,576 query points along timelines. By observing state maintenance trajectories through streaming multi-point queries, we design three complementary metrics to diagnose numerical precision, trajectory consistency, and temporal awareness. Evaluation on mainstream video-language models shows that current models still exhibit significant deficiencies in spatial-temporal state maintenance, particularly struggling with tasks like periodic event counting. VCBench provides a diagnostic framework for measuring and improving state maintenance in video understanding systems.

关键词: video understanding, state maintenance, streaming counting, benchmark evaluation, spatial-temporal analysis, object counting, event counting, video-language models

作者: Pingcong Li, Zihui Yu, Bichi Zhang, Sören Schwertfeger 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12696v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉语言导航（VLN）中的分层导航框架HaltNav，主要涉及多模态大语言模型（MLLM）在导航任务中的应用、拓扑地图规划、局部异常检测和重新规划。虽然论文使用了MLLM进行任务理解和异常检测，但所有关键词都专注于纯文本大语言模型（LLM）的技术原理、训练方法、推理优化、对齐技术、代理系统等具体技术，而论文的核心是视觉语言导航系统，MLLM仅作为其中一个模块用于高层次任务理解和异常检测，并未深入探讨LLM本身的技术创新。论文未涉及任何关键词中提到的具体LLM技术（如MoE、Scaling Laws、RLHF、LoRA、RAG、CoT、量化等），也未涉及AI for Science的具体应用领域（如生物信息学、化学信息学）。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

论文提出了一个名为HaltNav的分层导航框架，通过结合轻量级拓扑地图（osmAG）和多模态大语言模型（MLLM）的全局规划与局部异常检测能力，解决了视觉语言导航在环境变化下的鲁棒性问题，实验表明该方法在长视野导航任务中显著提高了性能。

摘要翻译

视觉与语言导航（VLN）正从遵循刻板、逐步指令的模式，转向开放词汇、面向目标的自主导航。要在无需详尽路径提示的情况下实现这一转变，智能体需利用结构先验信息。以往研究常假设使用计算量大的二维/三维度量地图，而本文则采用一种轻量级、基于文本的osmAG（OpenStreetMap区域图）——一种易于获取和维护的平面图级拓扑表示。然而，在现实部署中，仅依赖先验地图进行全局规划具有脆弱性，因为局部连通性可能发生变化（例如关闭的门或拥挤通道），导致执行时失败。为弥补这一不足，我们提出一种分层导航框架HaltNav，它将osmAG的鲁棒全局规划能力与VLN的局部探索及指令接地能力相结合。我们的方法采用一个基于MLLM的“大脑”模块，该模块具备高层次任务接地和障碍感知能力。在osmAG的条件下，大脑模块将全局路径转化为一系列局部化执行片段，为VLN执行器提供基于先验、以目标为中心的子指令。同时，它通过我们提出的“反应式视觉暂停”机制检测局部异常：该机制中断局部控制循环，通过使相应拓扑失效来更新osmAG，并触发重新规划以协调可行的绕行路径。为高效训练这种暂停能力，我们引入一种数据合成流程，利用生成模型在原本可通行的场景中注入逼真障碍物，从而大幅丰富困难负样本。大量实验表明，我们的分层框架在无需冗长语言指令的情况下优于多种基线方法，并显著提升了环境变化下长程视觉与语言导航的鲁棒性。

摘要 (Abstract)

Vision-and-Language Navigation (VLN) is shifting from rigid, step-by-step instruction following toward open-vocabulary, goal-oriented autonomy. Achieving this transition without exhaustive routing prompts requires agents to leverage structural priors. While prior work often assumes computationally heavy 2D/3D metric maps, we instead exploit a lightweight, text-based osmAG (OpenStreetMap Area Graph), a floorplan-level topological representation that is easy to obtain and maintain. However, global planning over a prior map alone is brittle in real-world deployments, where local connectivity can change (e.g., closed doors or crowded passages), leading to execution-time failures. To address this gap, we propose a hierarchical navigation framework HaltNav that couples the robust global planning of osmAG with the local exploration and instruction-grounding capability of VLN. Our approach features an MLLM-based brain module, which is capable of high-level task grounding and obstruction awareness. Conditioned on osmAG, the brain converts the global route into a sequence of localized execution snippets, providing the VLN executor with prior-grounded, goal-centric sub-instructions. Meanwhile, it detects local anomalies via a mechanism we term Reactive Visual Halting (RVH), which interrupts the local control loop, updates osmAG by invalidating the corresponding topology, and triggers replanning to orchestrate a viable detour. To train this halting capability efficiently, we introduce a data synthesis pipeline that leverages generative models to inject realistic obstacles into otherwise navigable scenes, substantially enriching hard negative samples. Extensive experiments demonstrate that our hierarchical framework outperforms several baseline methods without tedious language instructions, and significantly improves robustness for long-horizon vision-language navigation under environmental changes.

关键词: Vision-Language Navigation, Hierarchical Navigation Framework, Topological Priors, MLLM-based Brain Module, Reactive Visual Halting, Global Planning, Local Anomaly Detection, Robust Navigation

作者: Liangzheng Sun, Mengfan He, Xingyu Shao, Binbin Li, Zhiqiang Yan, Chunyu Li, Ziyang Meng, Fei Xing 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12690v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的跨模态特征匹配，特别是红外-可见光图像匹配，涉及传统和深度学习方法的基准测试与评估。所有关键词均与大语言模型、模型训练技术、推理方法、对齐、代理系统等大模型核心技术或AI for Science的具体应用（如生物信息学）无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个全面的跨模态特征匹配基准CM-Bench，用于评估红外-可见光图像匹配算法，并引入了一个新的红外-卫星跨模态数据集及自适应预处理前端。

摘要翻译

红外-可见光（IR-VIS）特征匹配在跨模态视觉定位、导航与感知中起着至关重要的作用。随着深度学习技术的快速发展，一系列具有代表性的图像匹配方法相继被提出。然而，由于显著的外观差异，跨模态特征匹配仍是一项具有挑战性的任务。当前跨模态特征匹配研究的一个重要缺口在于缺乏标准化的评估基准与评价指标。本文提出了一个全面的跨模态特征匹配基准——CM-Bench，该基准涵盖了多个跨模态数据集上的30种特征匹配算法。具体而言，我们首先对当前最先进的传统方法和基于深度学习的方法进行了总结，并将其分类为稀疏、半稠密和稠密方法。这些方法通过不同任务进行评估，包括单应性估计、相对位姿估计以及基于特征匹配的地理定位。此外，我们引入了一种基于分类网络的自适应预处理前端，可在匹配前自动选择合适的增强策略。我们还提出了一个新颖的红外-卫星跨模态数据集，其中包含人工标注的真实对应关系，用于实际地理定位评估。数据集及相关资源将在以下网址公开：https://github.com/SLZ98/CM-Bench。

摘要 (Abstract)

Infrared-visible (IR-VIS) feature matching plays an essential role in cross-modality visual localization, navigation and perception. Along with the rapid development of deep learning techniques, a number of representative image matching methods have been proposed. However, crossmodal feature matching is still a challenging task due to the significant appearance difference. A significant gap for cross-modal feature matching research lies in the absence of standardized benchmarks and metrics for evaluations. In this paper, we introduce a comprehensive cross-modal feature matching benchmark, CM-Bench, which encompasses 30 feature matching algorithms across diverse cross-modal datasets. Specifically, state-of-the-art traditional and deep learning-based methods are first summarized and categorized into sparse, semidense, and dense methods. These methods are evaluated by different tasks including homography estimation, relative pose estimation, and feature-matching-based geo-localization. In addition, we introduce a classification-network-based adaptive preprocessing front-end that automatically selects suitable enhancement strategies before matching. We also present a novel infrared-satellite cross-modal dataset with manually annotated ground-truth correspondences for practical geo-localization evaluation. The dataset and resource will be available at: https://github.com/SLZ98/CM-Bench.

关键词: cross-modal feature matching, infrared-visible images, benchmark, deep learning, geo-localization, dataset, evaluation, computer vision

207. ❌ STRAP-ViT: Segregated Tokens with Randomized – Transformations for Defense against Adversarial Patches in ViTs

作者: Nandish Chattopadhyay, Anadi Goyal, Chandan Karfa, Anupam Chattopadhyay 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12688v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域，研究针对Vision Transformers（ViTs）的对抗性补丁防御机制。论文的核心内容涉及对抗性攻击、防御策略、ViT架构和图像分类，与提供的关键词列表（主要围绕大语言模型、训练技术、推理优化、对齐、代理系统等）完全无关。所有关键词均未在标题、摘要或研究主题中出现，因此所有相关度评分均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为STRAP-ViT的防御机制，通过检测异常令牌并应用随机化变换来有效抵御Vision Transformers中的对抗性补丁攻击，在多个数据集和攻击方法上实现了接近干净基线的高鲁棒性准确率。

摘要翻译

对抗性补丁是一种物理可实现的局部噪声，能够劫持视觉变换器（ViT）的自注意力机制，将焦点拉向一个高对比度的小区域，并破坏类别标记以迫使模型产生高置信度的错误分类。本文主张，与图像中包含对抗性噪声区域相对应的标记，相较于未与对抗性扰动重叠的标记，具有不同的统计特性。基于此洞见，我们提出一种名为STRAP-ViT的机制，该机制在检测阶段使用詹森-香农散度作为度量来分离表现为异常行为的标记，随后在缓解阶段对这些标记应用随机化复合变换，以使对抗性噪声失效。需变换的最小标记数量是该防御机制的一个超参数，其选择标准是确保至少50%的补丁区域被变换后的标记所覆盖。STRAP-ViT作为一个无需训练、即插即用的模块，可嵌入ViT架构中（仅用于推理阶段），其计算成本极低，且无需任何额外的训练成本或投入。STRAP-ViT已在多种预训练视觉变换器架构（ViT-base-16和DinoV2）和数据集（ImageNet与CalTech-101）上，针对多种对抗性攻击（Adversarial Patch、LAVAN、GDPA和RP2）进行了测试，结果显示其能提供优异的鲁棒性准确率，与干净数据基线相比仅下降2-3%，并优于当前最先进的防御方法。

摘要 (Abstract)

Adversarial patches are physically realizable localized noise, which are able to hijack Vision Transformers (ViT) self-attention, pulling focus toward a small, high-contrast region and corrupting the class token to force confident misclassifications. In this paper, we claim that the tokens which correspond to the areas of the image that contain the adversarial noise, have different statistical properties when compared to the tokens which do not overlap with the adversarial perturbations. We use this insight to propose a mechanism, called STRAP-ViT, which uses Jensen-Shannon Divergence as a metric for segregating tokens that behave as anomalies in the Detection Phase, and then apply randomized composite transformations on them during the Mitigation Phase to make the adversarial noise ineffective. The minimum number of tokens to transform is a hyper-parameter for the defense mechanism and is chosen such that at least 50% of the patch is covered by the transformed tokens. STRAP-ViT fits as a non-trainable plug-and-play block within the ViT architectures, for inference purposes only, with a minimal computational cost and does not require any additional training cost/effort. STRAP-ViT has been tested on multiple pre-trained vision transformer architectures (ViT-base-16 and DinoV2) and datasets (ImageNet and CalTech-101), across multiple adversarial attacks (Adversarial Patch, LAVAN, GDPA and RP2), and found to provide excellent robust accuracies lying within a 2-3% range of the clean baselines, and outperform the state-of-the-art.

关键词: Adversarial Patches, Vision Transformers, Defense Mechanism, Jensen-Shannon Divergence, Robust Accuracy, Plug-and-play, Token Segregation, Randomized Transformations

208. ❌ RSONet: Region-guided Selective Optimization Network for RGB-T Salient Object Detection

作者: Bin Wan, Runmin Cong, Xiaofei Zhou, Hao Fang, Chengtao Lv, Sam Kwong 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12685v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文RSONet专注于RGB-T显著目标检测的计算机视觉任务，提出了一种区域引导的选择性优化网络来解决RGB和热图像之间显著区域不一致的问题。所有评分关键词均涉及大模型、深度学习技术原理或AI在科学领域的应用，而该论文研究的是传统的计算机视觉网络架构（编码器-解码器结构、融合模块等），未涉及任何大模型、语言模型、训练技术、推理方法、代理系统或AI for Science的具体应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对RGB和热图像中显著目标区域不一致的问题，提出了一种区域引导的选择性优化网络（RSONet），通过设计上下文交互、空间感知融合、选择性优化等模块，在RGB-T数据集上实现了与27种先进方法相竞争的性能。

摘要翻译

本文聚焦于RGB与热成像图像间显著区域不一致的问题。为解决此问题，我们提出了面向RGB-T显著目标检测的区域引导选择性优化网络，该网络包含区域引导阶段与显著性生成阶段。在区域引导阶段，我们设计了三个具有相同编码器-解码器结构的并行分支，各分支配备上下文交互模块与空间感知融合模块，以生成用于计算相似性分数的引导图。随后，在显著性生成阶段，选择性优化模块基于先前获得的相似性值融合RGB与热成像特征，以减轻两种模态间显著目标分布不一致的影响。此后，为生成高质量的检测结果，采用多重密集连接与视觉状态空间块的密集细节增强模块被应用于低层特征，以优化细节信息。此外，我们在高层特征中嵌入相互交互语义模块，通过双向融合策略挖掘位置线索。我们在RGB-T数据集上进行了大量实验，结果表明所提出的RSONet模型在与27种先进显著目标检测方法的对比中取得了具有竞争力的性能。

摘要 (Abstract)

This paper focuses on the inconsistency in salient regions between RGB and thermal images. To address this issue, we propose the Region-guided Selective Optimization Network for RGB-T Salient Object Detection, which consists of the region guidance stage and saliency generation stage. In the region guidance stage, three parallel branches with same encoder-decoder structure equipped with the context interaction (CI) module and spatial-aware fusion (SF) module are designed to generate the guidance maps which are leveraged to calculate similarity scores. Then, in the saliency generation stage, the selective optimization (SO) module fuses RGB and thermal features based on the previously obtained similarity values to mitigate the impact of inconsistent distribution of salient targets between the two modalities. After that, to generate high-quality detection result, the dense detail enhancement (DDE) module which adopts the multiple dense connections and visual state space blocks is applied to low-level features for optimizing the detail information. In addition, the mutual interaction semantic (MIS) module is placed in the high-level features to dig the location cues by the mutual fusion strategy. We conduct extensive experiments on the RGB-T dataset, and the results demonstrate that the proposed RSONet achieves competitive performance against 27 state-of-the-art SOD methods.

关键词: RGB-T salient object detection, region-guided selective optimization, context interaction module, spatial-aware fusion, selective optimization module, dense detail enhancement, mutual interaction semantic, multi-modal fusion

209. ❌ Bin~Wan,G2HFNet: GeoGran-Aware Hierarchical Feature Fusion Network for Salient Object Detection in Optical Remote Sensing Images

作者: Bin Wan, Runmin Cong, Xiaofei Zhou, Hao Fang, Chengtao Lv, Sam Kwong 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12680v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的显著目标检测（SOD），特别是针对光学遥感图像。它提出了一种基于Swin Transformer的层次特征融合网络（G2HFNet），并引入了多个模块来处理尺度变化和复杂背景。论文的核心是计算机视觉中的特定任务（显著目标检测）和网络架构设计，不涉及任何大语言模型（LLM）、深度学习技术原理创新（如MoE、Scaling Laws、训练对齐方法等）、AI代理、推理技术或AI for Science（如生物信息学）。所有评分关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文研究内容与这些主题完全无关，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文针对光学遥感图像中显著目标检测面临的尺度变化大和背景复杂的问题，提出了一种几何粒度感知的层次特征融合网络（G2HFNet），通过多个专门模块增强了特征表示，并在实验中取得了更好的检测性能。

摘要翻译

从航空视角获取的遥感图像常呈现显著的尺度变化与复杂背景，这为显著目标检测任务带来挑战。现有方法通常采用单一尺度的均匀注意力机制提取多层次特征，导致特征表征欠佳且检测结果不完整。为解决这些问题，我们提出一种几何与粒度感知的层次特征融合网络，该网络充分利用光学遥感图像中的几何与粒度线索。具体而言，G2HFNet采用Swin Transformer作为主干网络提取多层次特征，并集成三个核心模块：多尺度细节增强模块用于处理目标尺度变化并丰富细节信息，双分支几何-粒度互补模块联合捕获中层特征的细粒度细节与位置信息，以及深度语义感知模块通过自注意力机制优化高层位置线索。此外，网络引入局部-全局引导融合模块替代传统卷积，以实现有效的多层次特征整合。大量实验表明，G2HFNet能生成高质量的显著图，并在具有挑战性的遥感场景中显著提升检测性能。

摘要 (Abstract)

Remote sensing images captured from aerial perspectives often exhibit significant scale variations and complex backgrounds, posing challenges for salient object detection (SOD). Existing methods typically extract multi-level features at a single scale using uniform attention mechanisms, leading to suboptimal representations and incomplete detection results. To address these issues, we propose a GeoGran-Aware Hierarchical Feature Fusion Network (G2HFNet) that fully exploits geometric and granular cues in optical remote sensing images. Specifically, G2HFNet adopts Swin Transformer as the backbone to extract multi-level features and integrates three key modules: the multi-scale detail enhancement (MDE) module to handle object scale variations and enrich fine details, the dual-branch geo-gran complementary (DGC) module to jointly capture fine-grained details and positional information in mid-level features, and the deep semantic perception (DSP) module to refine high-level positional cues via self-attention. Additionally, a local-global guidance fusion (LGF) module is introduced to replace traditional convolutions for effective multi-level feature integration. Extensive experiments demonstrate that G2HFNet achieves high-quality saliency maps and significantly improves detection performance in challenging remote sensing scenarios.

关键词: Salient Object Detection, Optical Remote Sensing Images, Hierarchical Feature Fusion, Swin Transformer, Multi-scale Detail Enhancement, GeoGran-Aware, Self-attention, Feature Integration

210. ❌ Vision Verification Enhanced Fusion of VLMs for Efficient Visual Reasoning

作者: Selim Furkan Tekin, Yichang Xu, Gaowen Liu, Ramana Rao Kompella, Margaret L. Loper, Ling Liu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12669v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Vision-Language Models (VLMs)的融合方法，属于大模型在视觉语言任务中的应用，与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为VLMs可视为多模态基础模型。论文明确提到’mitigate hallucinations’，与’Hallucination Mitigation OR Factuality OR Truthfulness’高度相关（8分）。其他关键词主要针对纯语言模型的技术原理、训练方法、推理优化、代理系统等，本文未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为V3Fusion的视觉验证增强融合方法，通过引入焦点误差多样性和基于CKA的焦点多样性度量来选择和融合多个视觉语言模型，有效提升了视觉推理任务的性能并缓解了幻觉问题。

摘要翻译

随着视觉-语言模型（Vision-Language Models, VLMs）数量与多样性的日益增长，许多研究致力于探索跨多个VLM的基于语言的集成、协作与路由技术，以提升多模型推理能力。与此不同，我们同时利用视觉与语言模态来解决多样化的模型选择问题。我们引入了焦点误差多样性以捕捉不同VLM之间的互补推理能力，并提出一种基于CKA的焦点多样性度量（CKA-focal）来衡量它们在视觉嵌入中的不一致性。在由候选VLM池构建的集成曲面上，我们应用遗传算法有效剔除那些对融合性能无增益的组成VLM。我们不仅为每项任务确定了最佳组合，还融合了模型池中每个VLM的输出，并证明异构模型能够动态捕捉认知不确定性并减少幻觉。我们的V3Fusion方法能够生成具有高焦点多样性的双重融合预测，即使在缺乏多数共识或大多数VLM做出错误预测的情况下，仍能实现高性能的视觉-语言推理。通过在四个主流VLM基准（A-OKVQA、MMMU、MMMU-Pro和OCR-VQA）上的大量实验验证了V3Fusion的有效性。结果显示，V3Fusion在MMMU上以8.09%的准确率提升超越性能最佳的VLM，在MMMU-Pro上提升达4.87%。在生成任务中，V3Fusion在A-OKVQA和OCR-VQA上均优于当前表现最优的两个VLM——Intern-VL2-8b与Qwen2.5-VL-7b。我们的代码与数据集已公开于https://github.com/sftekin/v3fusion。

摘要 (Abstract)

With the growing number and diversity of Vision-Language Models (VLMs), many works explore language-based ensemble, collaboration, and routing techniques across multiple VLMs to improve multi-model reasoning. In contrast, we address the diverse model selection using both vision and language modalities. We introduce focal error diversity to capture complementary reasoning across VLMs and a CKA-based focal diversity metric (CKA-focal) to measure disagreement in their visual embeddings. On the constructed ensemble surface from a pool of candidate VLMs, we applied a Genetic Algorithm to effectively prune out those component VLMs that do not add value to the fusion performance. We identify the best combination for each task as well as fuse the outputs of each VLMs in the model pool, and show that heterogeneous models can capture epistemic uncertainty dynamically and mitigate hallucinations. Our V3Fusion approach is capable of producing dual focal-diversity fused predictions with high performance for vision-language reasoning, even when there is no majority consensus or the majority of VLMs make incorrect predictions. Extensive experiments validate V3Fusion on four popular VLM benchmarks (A-OKVQA, MMMU, MMMU-Pro, and OCR-VQA). The results show that V3Fusion outperforms the best-performing VLM on MMMU by 8.09% and MMMU-Pro by 4.87% gain in accuracy. For generative tasks, V3Fusion outperforms Intern-VL2-8b and Qwen2.5-VL-7b, the top-2 VLM performers on both A-OKVQA and OCR-VQA. Our code and datasets are available at https://github.com/sftekin/v3fusion.

关键词: Vision-Language Models, Model Fusion, Ensemble Learning, Visual Reasoning, Hallucination Mitigation, Genetic Algorithm, Multi-modal AI, VLM Benchmarks

211. ❌ Learning Geometric and Photometric Features from Panoramic LiDAR Scans for Outdoor Place Categorization

作者: Kazuto Nakashima, Hojung Jung, Yuki Oto, Yumi Iwashita, Ryo Kurazume, Oscar Martinez Mozos 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12663v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究使用卷积神经网络（CNN）处理3D LiDAR数据（全景深度/反射率图像）进行户外场所分类，属于计算机视觉和机器人感知领域。所有评分关键词均与大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文专注于传统的CNN架构处理特定传感器数据，未涉及任何大模型技术、训练方法、推理优化、对齐技术、代理系统或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种使用卷积神经网络处理全景LiDAR深度和反射率图像进行户外场所分类的新方法，并在自建的大规模MPO数据集上验证了其有效性，性能优于传统方法。

摘要翻译

语义场所分类是自主机器人与车辆的核心任务之一，使其能够在陌生环境中具备自主决策与导航能力。相较于室内场所，户外场所因感知条件的变化（如24小时内动态光照变化、车辆与行人的遮挡等）而成为更具挑战性的目标。本文提出一种利用卷积神经网络（CNNs）进行户外场所分类的新方法，该方法以三维激光雷达（3D LiDAR）获取的全向深度/反射率图像作为输入。首先，我们构建了一个名为“多模态全景三维户外数据集”（Multi-modal Panoramic 3D Outdoor, MPO）的大规模户外场所数据集，包含由两种不同激光雷达采集的两种点云数据，并标注为六类户外场所：海岸、森林、室内/室外停车场、居住区及城区。其次，我们设计了基于激光雷达的户外场所分类卷积神经网络，并利用MPO数据集对该方法进行评估。在MPO数据集上的实验结果表明，我们的方法优于传统方法，并证明了同时使用深度与反射率模态的有效性。为分析训练后的深度网络，我们对所学特征进行了可视化。

摘要 (Abstract)

Semantic place categorization, which is one of the essential tasks for autonomous robots and vehicles, allows them to have capabilities of self-decision and navigation in unfamiliar environments. In particular, outdoor places are more difficult targets than indoor ones due to perceptual variations, such as dynamic illuminance over twenty-four hours and occlusions by cars and pedestrians. This paper presents a novel method of categorizing outdoor places using convolutional neural networks (CNNs), which take omnidirectional depth/reflectance images obtained by 3D LiDARs as the inputs. First, we construct a large-scale outdoor place dataset named Multi-modal Panoramic 3D Outdoor (MPO) comprising two types of point clouds captured by two different LiDARs. They are labeled with six outdoor place categories: coast, forest, indoor/outdoor parking, residential area, and urban area. Second, we provide CNNs for LiDAR-based outdoor place categorization and evaluate our approach with the MPO dataset. Our results on the MPO dataset outperform traditional approaches and show the effectiveness in which we use both depth and reflectance modalities. To analyze our trained deep networks we visualize the learned features.

关键词: outdoor place categorization, convolutional neural networks, panoramic LiDAR, depth/reflectance images, MPO dataset, autonomous robots, semantic place categorization, 3D LiDAR

212. ❌ VFM-Recon: Unlocking Cross-Domain Scene-Level Neural Reconstruction with Scale-Aligned Foundation Priors

作者: Yuhang Ming, Tingkang Xi, Xingrui Yang, Lixin Yang, Yong Peng, Cewu Lu, Wanzeng Kong 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12657v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究计算机视觉领域的神经体积重建，使用视觉基础模型（VFMs）作为先验知识。与大多数关键词无关，因为关键词主要针对语言模型而非视觉模型。相关关键词：1) ‘Pre-training OR Continual Pre-training OR Domain Adaptation’ 得8分，因为论文使用预训练的VFM特征并处理跨域场景；2) ‘PEFT OR LoRA OR Parameter-efficient Fine-tuning’ 得8分，因为论文使用轻量级任务特定适配器进行微调；3) ‘AI for Science OR Bioinformatics OR Cheminformatics’ 得5分，因为论文属于AI在科学（计算机视觉/3D重建）领域的应用，但非生物/化学信息学。

!!! tip deepseek-chat TL;DR

该论文解决了从单目视频进行跨域场景级神经体积重建的挑战，通过引入尺度对齐阶段和轻量级适配器整合预训练的视觉基础模型先验，在多个数据集上实现了最先进的性能。

摘要翻译

基于单目视频的场景级神经体素重建仍面临挑战，尤其在存在严重域偏移的情况下。尽管视觉基础模型（VFMs）的最新进展提供了从大规模数据中学习到的可迁移广义先验，但其尺度模糊的预测与体素融合所需的尺度一致性不兼容。为弥合这一差距，我们提出了VFMRecon，这是首次尝试在场景级神经重建中桥接可迁移的VFM先验与尺度一致性要求的研究。具体而言，我们首先引入一个轻量级的尺度对齐阶段，以恢复多视角间的尺度一致性。随后，我们通过轻量级的任务特定适配器将预训练的VFM特征整合到神经体素重建流程中，这些适配器在训练时专注于重建任务，同时保留了预训练表征的跨域鲁棒性。我们在ScanNet训练集上训练模型，并在分布内的ScanNet测试集以及分布外的TUM RGB-D和Tanks and Temples数据集上进行评估。结果表明，我们的模型在所有数据集域上均达到了最先进的性能。特别是在具有挑战性的室外数据集Tanks and Temples上，我们的模型在重建网格评估中取得了70.1的F1分数，显著优于最接近的竞争对手VGGT（其得分仅为51.8）。

摘要 (Abstract)

Scene-level neural volumetric reconstruction from monocular videos remains challenging, especially under severe domain shifts. Although recent advances in vision foundation models (VFMs) provide transferable generalized priors learned from large-scale data, their scaleambiguous predictions are incompatible with the scale consistency required by volumetric fusion. To address this gap, we present VFMRecon, the first attempt to bridge transferable VFM priors with scaleconsistent requirements in scene-level neural reconstruction. Specifically, we first introduce a lightweight scale alignment stage that restores multiview scale coherence. We then integrate pretrained VFM features into the neural volumetric reconstruction pipeline via lightweight task-specific adapters, which are trained for reconstruction while preserving the crossdomain robustness of pretrained representations. We train our model on ScanNet train split and evaluate on both in-distribution ScanNet test split and out-of-distribution TUM RGB-D and Tanks and Temples datasets. The results demonstrate that our model achieves state-of-theart performance across all datasets domains. In particular, on the challenging outdoor Tanks and Temples dataset, our model achieves an F1 score of 70.1 in reconstructed mesh evaluation, substantially outperforming the closest competitor, VGGT, which only attains 51.8.

关键词: neural volumetric reconstruction, vision foundation models, cross-domain, scale alignment, lightweight adapters, scene-level reconstruction, monocular videos, VFM-Recon

213. ❌ ZO-SAM: Zero-Order Sharpness-Aware Minimization for Efficient Sparse Training

作者: Jie Ji, Gen Li, Kaiyuan Deng, Fatemeh Afghah, Xiaolong Ma 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13115v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究稀疏神经网络训练方法，与关键词"Mixture of Experts OR MoE OR Sparse Models"高度相关（8分），因为论文核心是解决稀疏训练中的梯度噪声和收敛问题。其他关键词主要涉及大语言模型、对齐、推理、代理等特定技术，与论文的通用深度学习优化框架无直接关联，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ZO-SAM的新型优化框架，通过将零阶优化集成到SAM方法中，解决了稀疏神经网络训练中梯度噪声大、收敛困难的问题，显著降低了计算成本并提高了模型鲁棒性。

摘要翻译

深度学习模型尽管取得了令人瞩目的成就，但其高昂的计算成本和内存需求限制了其在资源受限环境中的可用性。稀疏神经网络通过大幅减少参数量和计算开销，显著缓解了这些限制。然而，现有的稀疏训练方法常面临混沌且嘈杂的梯度信号，严重阻碍了收敛与泛化性能，尤其是在高稀疏度条件下。为应对这一关键挑战，我们提出了零阶锐度感知最小化（Zero-Order Sharpness-Aware Minimization, ZO-SAM），这是一种新颖的优化框架，策略性地将零阶优化整合到SAM方法中。与传统SAM不同，ZO-SAM在扰动过程中仅需一次反向传播步骤，并选择性利用零阶梯度估计。这一创新方法相比传统SAM将反向传播计算成本降低一半，显著减少了梯度方差并有效消除了相关计算开销。通过利用SAM识别平坦最小值的能力，ZO-SAM稳定了训练过程并加速了收敛。这些效率提升在稀疏训练场景中尤为重要，因为计算成本是限制SAM实用性的主要瓶颈。此外，使用ZO-SAM训练的模型在分布偏移下表现出更强的鲁棒性，进一步拓宽了其在实际部署中的实用性。

摘要 (Abstract)

Deep learning models, despite their impressive achievements, suffer from high computational costs and memory requirements, limiting their usability in resource-constrained environments. Sparse neural networks significantly alleviate these constraints by dramatically reducing parameter count and computational overhead. However, existing sparse training methods often experience chaotic and noisy gradient signals, severely hindering convergence and generalization performance, particularly at high sparsity levels. To tackle this critical challenge, we propose Zero-Order Sharpness-Aware Minimization (ZO-SAM), a novel optimization framework that strategically integrates zero-order optimization within the SAM approach. Unlike traditional SAM, ZO-SAM requires only a single backpropagation step during perturbation, selectively utilizing zero-order gradient estimations. This innovative approach reduces the backpropagation computational cost by half compared to conventional SAM, significantly lowering gradient variance and effectively eliminating associated computational overhead. By harnessing SAM’s capacity for identifying flat minima, ZO-SAM stabilizes the training process and accelerates convergence. These efficiency gains are particularly important in sparse training scenarios, where computational cost is the primary bottleneck that limits the practicality of SAM. Moreover, models trained with ZO-SAM exhibit improved robustness under distribution shift, further broadening its practicality in real-world deployments.

关键词: Sparse Training, Zero-Order Optimization, Sharpness-Aware Minimization, Gradient Variance Reduction, Computational Efficiency, Flat Minima, Convergence Acceleration, Distribution Shift Robustness

214. ❌ Breaking the Tuning Barrier: Zero-Hyperparameters Yield Multi-Corner Analysis Via Learned Priors

作者: Wei W. Xing, Kaiqi Huang, Jiazhan Liu, Hong Qiu, Shan Shen 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13092v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文提出了一种用于电路多角点分析的零超参数方法，核心是使用在数百万回归任务上预训练的基础模型进行上下文学习，无需微调即可适应新电路。这直接相关于’In-context Learning’（核心方法，10分）和’Pre-training’（模型基础，8分）。作为电子设计自动化（EDA）应用，属于’AI for Science’范畴（8分）。虽然未明确提及LLM，但使用了’foundation model’，与’Large Language Models’有一定概念关联（5分）。其他关键词（如MoE、SFT、RAG等）在论文中未涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了电路多角点分析中AI模型需要大量超参数调优的'调优障碍'问题，通过使用预训练基础模型的上下文学习能力，实现了零超参数调优，在保持高精度（平均MRE低至0.11%）的同时将总验证成本降低了10倍以上。

摘要翻译

良率多角点分析需在超过25个工艺-电压-温度角点下验证电路性能，导致组合仿真成本高达$O(K \times N)$，其中$K$表示角点数量，$N$为每角点超过$10^4$的样本量。现有方法面临根本性权衡：简单模型易于自动化但无法处理非线性电路，而先进人工智能模型虽能捕捉复杂行为，却需每次设计迭代耗费数小时进行超参数调优，形成“调优壁垒”。我们通过用预训练基础模型从数百万回归任务中学得的先验知识，替代人工设计的先验（即模型规范），突破了这一壁垒。该模型具备上下文学习能力，无需调优或重新训练即可即时适配各电路。其注意力机制通过识别不同工作条件间共享的电路物理特性，自动实现跨角点知识迁移。结合自动化特征选择器（将特征维度从1152D降至48D），本方法在零调优条件下达到最先进精度（平均相对误差均值低至0.11%），并将总体验证成本降低超过10倍。

摘要 (Abstract)

Yield Multi-Corner Analysis validates circuits across 25+ Process-Voltage-Temperature corners, resulting in a combinatorial simulation cost of $O(K \times N)$ where $K$ denotes corners and $N$ exceeds $10^4$ samples per corner. Existing methods face a fundamental trade-off: simple models achieve automation but fail on nonlinear circuits, while advanced AI models capture complex behaviors but require hours of hyperparameter tuning per design iteration, forming the Tuning Barrier. We break this barrier by replacing engineered priors (i.e., model specifications) with learned priors from a foundation model pre-trained on millions of regression tasks. This model performs in-context learning, instantly adapting to each circuit without tuning or retraining. Its attention mechanism automatically transfers knowledge across corners by identifying shared circuit physics between operating conditions. Combined with an automated feature selector (1152D to 48D), our method matches state-of-the-art accuracy (mean MREs as low as 0.11%) with zero tuning, reducing total validation cost by over $10\times$.

关键词: Zero-Hyperparameters, Multi-Corner Analysis, Learned Priors, Foundation Model, In-context Learning, Circuit Validation, Automated Feature Selection, Attention Mechanism

215. ❌ Causal Cellular Context Transfer Learning (C3TL): An Efficient Architecture for Prediction of Unseen Perturbation Effects

作者: Michael Scholkemper, Sach Mukherjee 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13051v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于生物医学领域的AI应用，提出了一种轻量级框架C3TL用于预测化学和遗传扰动对细胞状态的影响。论文与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为其核心是生物信息学/计算生物学中的AI应用。与’Large Language Models OR LLMs OR Foundation Models’和’Small Language Models OR SLMs OR On-device AI’有一定关联（各5分），因为论文提到了与大规模基础模型的对比，并强调其轻量级、小模型的特点。与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为其框架涉及迁移学习（Context Transfer Learning）和领域适应（从已知扰动推广到新情境）。其他关键词（如MoE、SFT、RAG、推理方法、对齐、压缩等）与论文的生物医学焦点和具体方法无直接关系，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种轻量级框架C3TL，用于预测化学和遗传扰动对细胞状态的影响，该方法在仅使用广泛可得的批量分子数据和小模型的情况下，实现了与最先进基础模型相竞争的性能，从而为生物医学中的因果学习方法提供了更易访问的途径。

摘要翻译

预测化学与遗传扰动对定量细胞状态的影响，是计算生物学、分子医学和药物发现领域的核心挑战。近期研究通过利用大规模单细胞数据和海量基础模型来应对这一任务。然而，此类计算资源和广泛数据集在学术或临床环境中并非总能获取，从而限制了其应用。本文提出一种轻量级扰动效应预测框架，该框架利用生物干预的结构化特性及特定的归纳偏置/不变性。我们的方法借助关于扰动效应的现有信息，以实现对新情境的泛化，并且仅需广泛可得的批量分子数据。通过将针对特定情境的扰动效应预测与真实的大规模干预实验进行比较，广泛测试表明该方法在新情境中能实现精准预测。所提出的方法与最先进的基础模型性能相当，但所需数据更简单、模型规模小得多且耗时更少。通过聚焦于稳健的批量信号和高效架构，我们证明无需专用硬件或超大规模模型即可实现扰动效应的准确预测，从而为在生物医学领域广泛利用因果学习方法开辟了新途径。

摘要 (Abstract)

Predicting the effects of chemical and genetic perturbations on quantitative cell states is a central challenge in computational biology, molecular medicine and drug discovery. Recent work has leveraged large-scale single-cell data and massive foundation models to address this task. However, such computational resources and extensive datasets are not always accessible in academic or clinical settings, hence limiting utility. Here we propose a lightweight framework for perturbation effect prediction that exploits the structured nature of biological interventions and specific inductive biases/invariances. Our approach leverages available information concerning perturbation effects to allow generalization to novel contexts and requires only widely-available bulk molecular data. Extensive testing, comparing predictions of context-specific perturbation effects against real, large-scale interventional experiments, demonstrates accurate prediction in new contexts. The proposed approach is competitive with SOTA foundation models but requires simpler data, much smaller model sizes and less time. Focusing on robust bulk signals and efficient architectures, we show that accurate prediction of perturbation effects is possible without proprietary hardware or very large models, hence opening up ways to leverage causal learning approaches in biomedicine generally.

关键词: perturbation effect prediction, computational biology, lightweight framework, causal learning, transfer learning, bulk molecular data, biomedicine, efficient architecture

216. ❌ 3DTCR: A Physics-Based Generative Framework for Vortex-Following 3D Reconstruction to Improve Tropical Cyclone Intensity Forecasting

作者: Jun Liu, Xiaohui Zhong, Kai Zheng, Jiarui Li, Yifei Li, Tao Zhou, Wenxu Qian, Shun Dai, Ruian Tie, Yangyang Zhao, Hao Li 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13049v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于热带气旋强度预测，提出了一种基于物理约束的生成式AI框架（3DTCR），使用条件流匹配（CFM）进行三维结构重建。论文与大多数关键词无关，因为这些关键词主要涉及大语言模型（LLM）及其相关技术（如微调、对齐、推理优化、智能体等），而本文研究的是特定科学领域（气象学）的生成式AI应用，不涉及LLM。仅与两个关键词相关：1) “Pre-training OR Continual Pre-training OR Domain Adaptation”：论文提到使用"latent domain adaptation"和"two-stage transfer learning"，与领域适应有一定关联，给5分。2) “AI for Science OR Bioinformatics OR Cheminformatics”：论文属于AI在科学领域的应用（气象学），与"AI for Science"高度相关，给10分。其他关键词均未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对热带气旋强度预测中三维精细结构重建的挑战，提出了一种基于物理约束的生成式AI框架（3DTCR），通过条件流匹配和领域适应技术，显著提高了预测精度并降低了计算成本。

摘要翻译

热带气旋（TC）强度预报至今仍具挑战性，因为当前的数值模型和基于人工智能的气象模型均未能令人满意地再现极端TC的结构与强度。尽管强度时间序列预测已取得显著进展，但其输出的是强度序列，而非控制TC演变的三维内核精细结构及物理机制。高分辨率数值模拟虽能捕捉这些特征，但计算成本高昂，在大规模业务化应用中效率低下。本文提出3DTCR——一个基于物理的生成式框架，它将物理约束与生成式人工智能的高效性相结合，用于三维TC结构重建。该框架基于六年、3公里分辨率的移动区域WRF数据集进行训练，利用条件流匹配（CFM）实现区域自适应的涡旋跟随重建，并通过潜在域适应和两阶段迁移学习进行优化。该框架缓解了低分辨率目标和平滑过度预报所带来的局限，在保持路径稳定性的同时，改善了TC内核结构与强度的表征。结果表明，在长达5天的几乎所有预报时效上，3DTCR在TC强度预测方面均优于欧洲中期天气预报中心高分辨率预报系统（ECMWF-HRES），并且相较于其FuXi输入，将最大10米风速（WS10M）的均方根误差降低了36.5%。这些发现凸显了3DTCR作为一个基于物理的生成式框架，能够以较低计算成本高效解析精细尺度结构，这可能为改进TC强度预报提供一条有前景的途径。

摘要 (Abstract)

Tropical cyclone (TC) intensity forecasting remains challenging as current numerical and AI-based weather models fail to satisfactorily represent extreme TC structure and intensity. Although intensity time-series forecasting has achieved significant advances, it outputs intensity sequences rather than the three-dimensional inner-core fine-scale structure and physical mechanisms governing TC evolution. High-resolution numerical simulations can capture these features but remain computationally expensive and inefficient for large-scale operational applications. Here we present 3DTCR, a physics-based generative framework combining physical constraints with generative AI efficiency for 3D TC structure reconstruction. Trained on a six-year, 3-km-resolution moving-domain WRF dataset, 3DTCR enables region-adaptive vortex-following reconstruction using conditional Flow Matching(CFM), optimized via latent domain adaptation and two-stage transfer learning. The framework mitigates limitations imposed by low-resolution targets and over-smoothed forecasts, improving the representation of TC inner-core structure and intensity while maintaining track stability. Results demonstrate that 3DTCR outperforms the ECMWF high-resolution forecasting system (ECMWF-HRES) in TC intensity prediction at nearly all lead times up to 5 days and reduces the RMSE of maximum WS10M by 36.5% relative to its FuXi inputs. These findings highlight 3DTCR as a physics-based generative framework that efficiently resolves fine-scale structures at lower computational cost, which may offer a promising avenue for improving TC intensity forecasting.

关键词: Tropical cyclone intensity forecasting, 3D structure reconstruction, Physics-based generative framework, Conditional Flow Matching, Domain adaptation, Transfer learning, Computational efficiency, Weather modeling

217. ❌ Convergence Rate of a Functional Learning Method for Contextual Stochastic Optimization

作者: Noel Smith, Andrzej Ruszczynski 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13048v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是随机优化问题中的函数学习方法，专注于条件期望的近似和联合学习优化算法，属于传统的数学优化和统计学习领域。论文内容完全不涉及大模型、深度学习、AI技术原理或AI在科学领域的应用，与所有评分关键词均无关联。

!!! tip deepseek-chat TL;DR

该论文研究在无法直接采样条件分布的情况下，通过参数化函数类近似条件期望，并分析同时学习与优化的算法，证明了该方法达到O(1/√N)的收敛速率。

摘要翻译

我们考虑一个涉及两个随机变量的随机优化问题：上下文变量 $X$ 与因变量 $Y$。目标是最小化应用于条件期望 $\mathbb{E}[f(X, Y,β) \mid X]$ 的非线性损失泛函的期望值，其中 $f$ 是一个非线性函数，$β$ 表示决策变量。我们关注于一个实际重要的设定：直接从 $Y \mid X$ 的条件分布中采样是不可行的，仅能获得一个独立同分布的观测对序列 ${(X^k, Y^k)}_{k=0,1,2,\ldots}$。在我们的方法中，条件期望通过一个预先指定的参数函数类进行近似。我们分析了一种同时学习与优化的算法，该算法联合估计条件期望并优化外部目标，并证明该方法达到了 $\mathcal{O}\big(1/\sqrt{N}\big)$ 阶的收敛速率，其中 $N$ 表示观测对的数量。

摘要 (Abstract)

We consider a stochastic optimization problem involving two random variables: a context variable $X$ and a dependent variable $Y$. The objective is to minimize the expected value of a nonlinear loss functional applied to the conditional expectation $\mathbb{E}[f(X, Y,β) \mid X]$, where $f$ is a nonlinear function and $β$ represents the decision variables. We focus on the practically important setting in which direct sampling from the conditional distribution of $Y \mid X$ is infeasible, and only a stream of i.i.d.\ observation pairs ${(X^k, Y^k)}_{k=0,1,2,\ldots}$ is available. In our approach, the conditional expectation is approximated within a prespecified parametric function class. We analyze a simultaneous learning-and-optimization algorithm that jointly estimates the conditional expectation and optimizes the outer objective, and establish that the method achieves a convergence rate of order $\mathcal{O}\big(1/\sqrt{N}\big)$, where $N$ denotes the number of observed pairs.

关键词: stochastic optimization, conditional expectation, parametric function class, simultaneous learning-and-optimization, convergence rate, nonlinear loss functional, i.i.d. observations, contextual optimization

218. ❌ OpenACMv2: An Accuracy-Constrained Co-Optimization Framework for Approximate DCiM

作者: Yiqi Zhou, Yue Yuan, Yikai Wang, Bohao Liu, Qinxin Mei, Zhuohua Liu, Shan Shen, Wei Xing, Daying Sun, Li Li, Guozhu Liu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13042v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是数字内存计算（DCiM）的近似计算硬件优化框架，专注于架构-电路协同设计、功耗-性能-面积（PPA）优化和蒙特卡洛晶体管级仿真。所有评分关键词均涉及大模型、深度学习技术原理或AI科学应用，而本文完全不涉及这些主题，属于硬件设计自动化领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为OpenACMv2的精度约束协同优化框架，用于近似数字内存计算（DCiM）的架构和电路级设计，通过两级优化实现了显著的功耗-性能-面积（PPA）改进，同时保持精度约束。

摘要翻译

数字存内计算（Digital Compute-in-Memory，DCiM）通过减少数据移动来加速神经网络。近似数字存内计算可进一步提升功耗-性能-面积（PPA）指标，但需要在架构与晶体管层级耦合的决策中进行精度约束下的协同优化。基于OpenYield平台，我们提出了精度约束协同优化方法，并推出开源框架OpenACMv2。该框架通过两级优化实现协同优化操作化：（1）在精度约束下进行架构搜索，探索压缩器组合与SRAM宏参数，其驱动力来源于基于图神经网络的快速PPA与误差代理模型；（2）采用蒙特卡洛方法对标准单元和SRAM位单元进行考虑工艺偏差及PVT变化的晶体管尺寸优化。通过将协同优化解耦为架构级探索与电路级尺寸调整，OpenACMv2整合了经典的单目标与多目标优化器，能够提供优异的PPA-精度权衡并实现稳健收敛。该工作流兼容FreePDK45与OpenROAD平台，支持可复现的评估与便捷的推广应用。实验表明，在受控精度预算下可实现显著的PPA提升，为近似数字存内计算提供了快速的“假设分析”探索能力。该框架已发布于https://github.com/ShenShan123/OpenACM。

摘要 (Abstract)

Digital Compute-in-Memory (DCiM) accelerates neural networks by reducing data movement. Approximate DCiM can further improve power-performance-area (PPA), but demands accuracy-constrained co-optimization across coupled architecture and transistor-level choices. Building on OpenYield, we introduce Accuracy-Constrained Co-Optimization (ACCO) and present OpenACMv2, an open framework that operationalizes ACCO via two-level optimization: (1) accuracy-constrained architecture search of compressor combinations and SRAM macro parameters, driven by a fast GNN-based surrogate for PPA and error; and (2) variation- and PVT-aware transistor sizing for standard cells and SRAM bitcells using Monte Carlo. By decoupling ACCO into architecture-level exploration and circuit-level sizing, OpenACMv2 integrates classic single- and multi-objective optimizers to deliver strong PPA-accuracy tradeoffs and robust convergence. The workflow is compatible with FreePDK45 and OpenROAD, supporting reproducible evaluation and easy adoption. Experiments demonstrate significant PPA improvements under controlled accuracy budgets, enabling rapid “what-if” exploration for approximate DCiM. The framework is available on https://github.com/ShenShan123/OpenACM.

关键词: Approximate Computing, Digital Compute-in-Memory, Accuracy-Constrained Co-Optimization, Power-Performance-Area, Architecture-Circuit Co-design, Monte Carlo Simulation, Hardware Optimization, Open-source Framework

219. ❌ Federated Few-Shot Learning on Neuromorphic Hardware: An Empirical Study Across Physical Edge Nodes

作者: Steven Motta, Gioele Nanni 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13037v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究联邦学习在神经形态硬件上的应用，专注于二进制权重更新、特征提取器微调和原型互补机制。所有关键词均与大语言模型、深度学习技术原理或科学AI应用相关，而本文研究的是神经形态计算和联邦学习，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

本文研究了在神经形态硬件上进行联邦学习的可行性，发现神经元级连接策略优于权重平均，并揭示了特征质量和原型互补性是影响联邦学习性能的关键因素。

摘要翻译

神经形态硬件上的联邦学习仍属空白，因为片上脉冲时序依赖可塑性（STDP）产生的是二元权重更新，而非标准算法所预设的浮点梯度。我们使用BrainChip Akida AKD1000处理器构建了一个双节点联邦系统，并在七个分析阶段中进行了约1,580次实验测试。在四种权重交换策略中，神经元级拼接策略（FedUnion）始终能保持精度，而逐元素权重平均策略（FedAvg）则完全破坏了精度（p = 0.002）。对上游特征提取器进行领域自适应微调贡献了大部分的精度提升，这证实了特征质量是主导因素。将特征维度从64扩展到256时，最佳策略的联邦准确率达到77.0%（n=30，p < 0.001）。两个独立的不对称现象（更宽的特征对联邦学习的助益大于个体学习，而二值化对联邦学习的损害更大）指向一个共享的原型互补机制：跨节点迁移的效果随神经元原型独特性而增强。

摘要 (Abstract)

Federated learning on neuromorphic hardware remains unexplored because on-chip spike-timing-dependent plasticity (STDP) produces binary weight updates rather than the floating-point gradients assumed by standard algorithms. We build a two-node federated system with BrainChip Akida AKD1000 processors and run approximately 1,580 experimental trials across seven analysis phases. Of four weight-exchange strategies tested, neuron-level concatenation (FedUnion) consistently preserves accuracy while element-wise weight averaging (FedAvg) destroys it (p = 0.002). Domain-adaptive fine-tuning of the upstream feature extractor accounts for most of the accuracy gains, confirming feature quality as the dominant factor. Scaling feature dimensionality from 64 to 256 yields 77.0% best-strategy federated accuracy (n=30, p < 0.001). Two independent asymmetries (wider features help federation more than individual learning, while binarization hurts federation more) point to a shared prototype complementarity mechanism: cross-node transfer scales with the distinctiveness of neuron prototypes.

关键词: Federated Learning, Neuromorphic Hardware, Few-Shot Learning, Spike-Timing-Dependent Plasticity, Weight Exchange Strategies, Domain-Adaptive Fine-tuning, Feature Dimensionality, Prototype Complementarity

220. ❌ Association-Aware GNN for Precoder Learning in Cell-Free Systems

作者: Mingyu Deng, Shengqian Han 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13035v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是无线通信系统中基于图神经网络的预编码器学习，属于深度学习在通信工程领域的应用。所有评分关键词都专门针对大语言模型（LLM）及其相关技术（如MoE、RLHF、RAG、量化等），而本文完全不涉及语言模型、自然语言处理或大模型技术，仅使用传统的深度学习（图神经网络）解决通信优化问题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种关联感知图神经网络（AAGNN），用于解决无蜂窝系统中考虑用户设备-接入点关联状态的预编码器学习问题，仿真结果表明该方法在性能和泛化能力上优于基线学习方法。

摘要翻译

深度学习已被广泛视为优化传统蜂窝系统中多用户多天线预编码器的有效方法。然而，无蜂窝系统与蜂窝系统的一个关键区别在于用户设备（UE）与接入点（AP）关联的灵活性。因此，最优预编码器不仅取决于信道状态信息，还依赖于动态的UE-AP关联状态。本文提出了一种关联感知图神经网络（AAGNN），其将关联状态显式地纳入预编码设计中。我们利用无蜂窝预编码策略的置换等变性特性来降低AAGNN的训练复杂度，并采用注意力机制以增强其泛化性能。仿真结果表明，所提出的AAGNN在学习性能和泛化能力上均优于基线学习方法，同时保持了较低的训练与推理复杂度。

摘要 (Abstract)

Deep learning has been widely recognized as a promising approach for optimizing multi-user multi-antenna precoders in traditional cellular systems. However, a critical distinction between cell-free and cellular systems lies in the flexibility of user equipment (UE)-access point (AP) associations. Consequently, the optimal precoder depends not only on channel state information but also on the dynamic UE-AP association status. In this paper, we propose an association-aware graph neural network (AAGNN) that explicitly incorporates association status into the precoding design. We leverage the permutation equivariance properties of the cell-free precoding policy to reduce the training complexity of AAGNN and employ an attention mechanism to enhance its generalization performance. Simulation results demonstrate that the proposed AAGNN outperforms baseline learning methods in both learning performance and generalization capabilities while maintaining low training and inference complexity.

关键词: cell-free systems, precoder learning, graph neural network, association-aware, permutation equivariance, attention mechanism, wireless communication, deep learning

221. ❌ FraudFox: Adaptable Fraud Detection in the Real World

作者: Matthew Butler, Yi Fan, Christos Faloutsos 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13014v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《FraudFox: Adaptable Fraud Detection in the Real World》专注于欺诈检测系统，使用Extended Kalman Filters和Pareto优化等技术解决资源受限环境中的对抗性攻击问题，并已在亚马逊生产环境中部署。然而，论文内容与所有评分关键词（均围绕大模型、深度学习技术原理及其应用）完全无关，未涉及任何大模型、深度学习、AI for Science或相关技术，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究如何在资源受限的对抗性环境中优化欺诈检测系统，通过动态权重调整和最优决策面方法实现自适应欺诈预防，并已在亚马逊实际部署应用。

摘要翻译

本文提出的方法（FraudFox）为资源受限环境下的对抗性攻击提供了解决方案。我们重点关注以下问题：在周一凌晨3点尝试购买500美元鞋子的“Smith”行为可疑程度如何？在对抗性环境中，如何整合来自多个风险评估模块（“预言机”）的风险评分？更重要的是，在给定历史数据（订单、价格及后续结果）及业务目标/限制条件下，对于如上述“Smith”交易般的交易，哪些应予以“通过”，哪些应转交人工审核？业务限制可能包括：“最多可进行x次调查”或“因欺诈造成的损失不得超过y美元”。这正是本工作聚焦的两个研究问题。针对第一个问题（“预言机加权”），我们采用扩展卡尔曼滤波器结合动态重要性权重的方法，以自动持续更新每个“预言机”的权重。对于第二个问题，我们展示了如何推导最优决策曲面，并计算帕累托最优解集以支持假设分析。一个关键考量是适应性：欺诈者会根据我们过去的决策改变行为模式，因此我们需要相应调整。最终构建的系统FraudFox具备可扩展性，能适应欺诈行为的变化，且高效实用，目前已在亚马逊投入生产运行。FraudFox增强了反欺诈子系统，并带来了显著的性能提升。

摘要 (Abstract)

The proposed method (FraudFox) provides solutions to adversarial attacks in a resource constrained environment. We focus on questions like the following: How suspicious is Smith', trying to buy \$500 shoes, on Monday 3am? How to merge the risk scores, from a handful of risk-assessment modules (oracles’) in an adversarial environment? More importantly, given historical data (orders, prices, and what-happened afterwards), and business goals/restrictions, which transactions, like the Smith' transaction above, which ones should we pass’, versus send to human investigators? The business restrictions could be: at most $x$ investigations are feasible', or at most $$y$ lost due to fraud’. These are the two research problems we focus on, in this work. One approach to address the first problem (`oracle-weighting’), is by using Extended Kalman Filters with dynamic importance weights, to automatically and continuously update our weights for each ‘oracle’. For the second problem, we show how to derive an optimal decision surface, and how to compute the Pareto optimal set, to allow what-if questions. An important consideration is adaptation: Fraudsters will change their behavior, according to our past decisions; thus, we need to adapt accordingly. The resulting system, \method, is scalable, adaptable to changing fraudster behavior, effective, and already in \textbf{production} at Amazon. FraudFox augments a fraud prevention sub-system and has led to significant performance gains.

关键词: Fraud Detection, Adversarial Attacks, Extended Kalman Filters, Pareto Optimal, Resource Constrained, Adaptive System, Risk Assessment, Production Deployment

222. ❌ PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses

作者: Chenlong Yin, Runpeng Geng, Yanting Wang, Jinyuan Jia 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13026v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PISmith专注于大语言模型（LLMs）的安全评估，特别是针对提示注入攻击的防御系统。论文核心涉及LLMs（高度相关，10分）和LLM Agents（高度相关，10分），因为研究重点在于评估LLM应用（尤其是自主代理）中的提示注入防御。论文使用强化学习（RL）进行红队测试，但未涉及RLHF/RLAIF/DPO等特定对齐技术。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RAG、Quantization等与论文的网络安全评估主题无关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出了PISmith，一个基于强化学习的红队测试框架，用于系统评估现有提示注入防御的鲁棒性，发现最先进的防御在面对自适应攻击时仍然脆弱。

摘要翻译

提示注入对现实世界的大语言模型应用，特别是自主智能体，构成了严重的安全威胁。尽管已有多种防御方案被提出，但其针对自适应攻击的鲁棒性尚未得到充分评估，这可能造成虚假的安全感。本研究提出PISmith，一个基于强化学习的红队测试框架，通过训练一个攻击性大语言模型来优化注入提示，系统性地评估现有提示注入防御措施。该框架在实用的黑盒设置下运行，攻击者仅能查询受防御保护的大语言模型并观察其输出。我们发现，直接应用标准的GRPO方法来攻击强效防御会导致次优性能，这源于极端的奖励稀疏性——大多数生成的注入提示被防御机制拦截，导致策略的熵在发现有效攻击策略前就发生坍缩，而罕见的成功案例又无法被有效学习。为此，我们引入了自适应熵正则化和动态优势加权技术，以维持探索性并放大从稀少成功中学习的效果。在13个基准测试上的广泛评估表明，当前最先进的提示注入防御在面对自适应攻击时依然脆弱。我们还将PISmith与静态、基于搜索和基于强化学习三大类别的7种基线方法进行了比较，结果显示PISmith始终能取得最高的攻击成功率。此外，在InjecAgent和AgentDojo的智能体场景中，针对开源和闭源大语言模型（例如GPT-4o-mini和GPT-5-nano），PISmith均表现出强大的性能。我们的代码公开于https://github.com/albert-y1n/PISmith。

摘要 (Abstract)

Prompt injection poses serious security risks to real-world LLM applications, particularly autonomous agents. Although many defenses have been proposed, their robustness against adaptive attacks remains insufficiently evaluated, potentially creating a false sense of security. In this work, we propose PISmith, a reinforcement learning (RL)-based red-teaming framework that systematically assesses existing prompt-injection defenses by training an attack LLM to optimize injected prompts in a practical black-box setting, where the attacker can only query the defended LLM and observe its outputs. We find that directly applying standard GRPO to attack strong defenses leads to sub-optimal performance due to extreme reward sparsity – most generated injected prompts are blocked by the defense, causing the policy’s entropy to collapse before discovering effective attack strategies, while the rare successes cannot be learned effectively. In response, we introduce adaptive entropy regularization and dynamic advantage weighting to sustain exploration and amplify learning from scarce successes. Extensive evaluation on 13 benchmarks demonstrates that state-of-the-art prompt injection defenses remain vulnerable to adaptive attacks. We also compare PISmith with 7 baselines across static, search-based, and RL-based attack categories, showing that PISmith consistently achieves the highest attack success rates. Furthermore, PISmith achieves strong performance in agentic settings on InjecAgent and AgentDojo against both open-source and closed-source LLMs (e.g., GPT-4o-mini and GPT-5-nano). Our code is available at https://github.com/albert-y1n/PISmith.

关键词: Prompt Injection, LLM Security, Reinforcement Learning, Red Teaming, Autonomous Agents, Black-box Attack, Adaptive Attacks, Defense Evaluation

223. ❌ Dependency-Aware Parallel Decoding via Attention for Diffusion LLMs

作者: Bumjun Kim, Dongjae Jeon, Moongyu Jeon, Albert No 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12996v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文的核心贡献是针对扩散大语言模型（dLLMs）提出了一种新的并行解码方法DAPD，利用自注意力机制构建条件依赖图来指导并行解码。论文高度相关于’Large Language Models’（10分），因为dLLMs是LLMs的一种变体，且论文多次提及LLMs。与’Speculative Decoding OR Inference Acceleration’有一定关联（8分），因为DAPD旨在通过并行解码加速dLLMs的推理过程，属于推理加速范畴。其他关键词如MoE、SFT、RAG、量化等均未在论文中涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文针对扩散大语言模型并行解码中难以处理token间依赖关系的问题，提出了一种基于自注意力构建依赖图的训练无关解码方法DAPD，实验表明其能提升解码精度与步数的权衡，并更好地利用dLLMs的任意顺序生成能力。

摘要翻译

扩散大语言模型（dLLMs）的并行解码面临挑战，因为每个去噪步骤仅提供词元级别的边缘分布，而同时解掩多个词元需要考虑词元间的依赖关系。我们提出依赖感知并行解码（Dependency-Aware Parallel Decoding, DAPD），这是一种简单、无需训练的解码方法，它利用自注意力机制在被掩码词元上推导出一个条件依赖图。在每次迭代中，图中的边捕获强烈的词元交互，而非边则表示弱依赖关系。并行解码随后被简化为在图上选择一个独立集，并并行解掩所选词元。这避免了在无需辅助模型或重新训练的情况下，同时更新强耦合的词元。在LLaDA和Dream上的实验表明，DAPD相较于现有方法改善了准确率与步数之间的权衡，并实现了更全局分布的并行更新，从而更好地利用了dLLMs的任意顺序生成能力。

摘要 (Abstract)

Parallel decoding for diffusion LLMs (dLLMs) is difficult because each denoising step provides only token-wise marginal distributions, while unmasking multiple tokens simultaneously requires accounting for inter-token dependencies. We propose Dependency-Aware Parallel Decoding (DAPD), a simple, training-free decoding method that uses self-attention to induce a conditional dependency graph over masked tokens. At each iteration, edges in this graph capture strong token interactions, while non-edges indicate weak dependence. Parallel decoding is then reduced to selecting an independent set on the graph and unmasking the selected tokens in parallel. This avoids co-updating strongly coupled tokens without auxiliary models or retraining. Experiments on LLaDA and Dream show that DAPD improves the accuracy-steps trade-off over existing methods and enables more globally distributed parallel updates that better exploit the any-order generation capability of dLLMs.

关键词: Diffusion LLMs, Parallel Decoding, Dependency-Aware, Self-Attention, Conditional Dependency Graph, Inference Acceleration, dLLMs, Decoding Method

224. ❌ Retrieval-Enhanced Real Estate Appraisal

作者: Simon Popelier, Matthieu X. B. Sarazin, Maximilien Bohm, Mathieu Gierski, Hanna Mergui, Matthieu Ospici, Adrien Bernhardt 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12986v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于房地产评估中的可比交易选择问题，提出了一种基于混合向量-地理检索模块的学习选择策略方法。论文与大多数大模型技术关键词无关，因为它不涉及LLM、MoE、SLM、预训练、对齐、推理优化、智能体等核心大模型技术。唯一的相关关键词是’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’，因为论文使用了检索模块来选择和利用可比交易数据，这与检索增强生成中的检索概念有一定关联，但论文并非专门针对大语言模型的RAG应用，因此给予5分（有一定关联）。其他关键词均未涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文研究房地产评估中可比交易的选择问题，提出了一种通过学习选择策略而非强制规定的方法，结合混合向量-地理检索模块，能够在减少可比交易数量和模型参数的同时达到接近最先进模型的性能。

摘要翻译

销售比较法（Sales Comparison Approach, SCA）是房地产评估中最常用的方法之一。它既被用作房地产专业评估的参考依据，也是自动估价模型（Automatic Valuation Models, AVM）的主要类型之一，近年来在机器学习方法中也日益受到关注。能够处理集合与图结构数据的模型所展现出的性能，使得该方法得以高效适配，并取得了显著成果。销售比较法的核心在于选取过往交易案例（可比实例）作为参考，这些案例需根据其与目标房产销售特征的相似性进行筛选。本研究聚焦于房地产评估中可比实例的选择问题。我们证明，通过学习选择策略而非强行规定策略，可以显著改进许多现有先进算法中所采用的可比实例选择方式。我们的方法依赖于一个混合向量-地理检索模块，该模块能够适应不同数据集，并与估值模块联合优化。我们进一步表明，使用精心筛选的可比实例可以构建所需可比实例更少、参数更精简的模型，同时其性能接近当前最优模型。所有评估均在涵盖美国、巴西和法国地区的五个数据集上进行。

摘要 (Abstract)

The Sales Comparison Approach (SCA) is one of the most popular when it comes to real estate appraisal. Used as a reference in real estate expertise and as one of the major types of Automatic Valuation Models (AVM), it recently gained popularity within machine learning methods. The performance of models able to use data represented as sets and graphs made it possible to adapt this methodology efficiently, yielding substantial results. SCA relies on taking past transactions (comparables) as references, selected according to their similarity with the target property’s sale. In this study, we focus on the selection of these comparables for real estate appraisal. We demonstrate that the selection of comparables used in many state-of-the-art algorithms can be significantly improved by learning a selection policy instead of imposing it. Our method relies on a hybrid vector-geographical retrieval module capable of adapting to different datasets and optimized jointly with an estimation module. We further show that the use of carefully selected comparables makes it possible to build models that require fewer comparables and fewer parameters with performance close to state-of-the-art models. All our evaluations are made on five datasets which span areas in the United States, Brazil, and France.

关键词: Real Estate Appraisal, Sales Comparison Approach, Comparables Selection, Retrieval Module, Automatic Valuation Models, Machine Learning, Hybrid Vector-Geographical Retrieval, Property Valuation

225. ❌ Exact Federated Continual Unlearning for Ridge Heads on Frozen Foundation Models

作者: Yijun Quan, Wentai Wu, Giovanni Montana 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12977v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究基础模型（Foundation Models）在联邦学习设置下的精确遗忘问题，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文涉及基础模型的适应和部署，与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为基础模型作为冻结特征提取器，其适应过程可视为一种领域适应。其他关键词如MoE、SLMs、SFT、RLHF、RAG、推理加速、AI for Science等均未在论文标题或摘要中提及或暗示，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在联邦学习环境中，针对冻结基础模型和岭回归头的精确遗忘问题，提出了一种通信协议，通过固定大小的消息支持任意添加和删除请求，确保服务器维护的头部在精确算术上与集中式重新训练完全一致，并在四个基准测试中验证了其有效性。

摘要翻译

基础模型通常作为冻结的特征提取器部署，配备一个可训练的小型头部，以适应联邦学习场景中的私有用户生成数据。“被遗忘权”要求能够按需从已训练模型中移除特定样本或用户的影响。现有的联邦遗忘方法主要针对通用深度模型，依赖于近似重构或选择性重训练，使得精确遗忘成本高昂或难以实现。我们在一个具有实际意义但尚未充分探索的机制中研究此问题：采用冻结基础模型配合岭回归头部。其精确最优解仅通过两个加性充分统计量依赖于数据，我们将其转化为一种通信协议，该协议通过固定大小的消息支持任意流式的“添加”和“删除”请求。服务器维护的头部在精确算术意义上，与每次请求后集中式重训练的结果保持逐点一致。我们提供了确定性的重训练等价性保证、顺序与分区不变性、两种服务器端变体方案，以及零KL散度的贝叶斯证明。在四个基准测试上的实验验证了这些保证：两种变体方案均与集中式岭回归重训练结果匹配，相对Frobenius误差在$10^{-9}$以内，并以数量级更快的速度完成每个请求。

摘要 (Abstract)

Foundation models are commonly deployed as frozen feature extractors with a small trainable head to adapt to private, user-generated data in federated settings. The ``right to be forgotten’’ requires removing the influence of specific samples or users from the trained model on demand. Existing federated unlearning methods target general deep models and rely on approximate reconstruction or selective retraining, making exactness costly or elusive. We study this problem in a practically relevant but under-explored regime: a frozen foundation model with a ridge-regression head. The exact optimum depends on the data only through two additive sufficient statistics, which we turn into a communication protocol supporting an arbitrary stream of \emph{add} and \emph{delete} requests via fixed-size messages. The server maintains a head that is, in exact arithmetic, \emph{pointwise identical} to centralized retraining after every request. We provide deterministic retrain-equivalence guarantees, order and partition invariance, two server-side variants, and a Bayesian certificate of zero KL divergence. Experiments on four benchmarks confirm the guarantees: both variants match centralized ridge retraining to within $10^{-9}$ relative Frobenius error and complete each request at orders-of-

关键词: Federated Learning, Continual Unlearning, Foundation Models, Ridge Regression, Exact Unlearning, Right to be Forgotten, Frozen Feature Extractor, Communication Protocol

226. ❌ A theory of learning data statistics in diffusion models, from easy to hard

作者: Lorenzo Bardone, Claudia Merger, Sebastian Goldt 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12901v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究扩散模型的学习动态，特别是学习数据统计的顺序（从简单到复杂），属于生成模型的理论分析。所有评分关键词均针对大语言模型（LLMs）及其相关技术（如训练方法、推理优化、应用等），而论文专注于扩散模型（一种生成模型），与LLMs无直接关联。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了扩散模型如何从简单到复杂地学习数据统计，提出了扩散信息指数来解释学习不同阶统计量的样本复杂度差异。

摘要翻译

尽管扩散模型已成为一类强大的生成模型，但其学习动态机制仍未得到充分理解。我们首先通过实证研究表明，在自然图像上训练的标准扩散模型表现出分布简单性偏好：模型先学习简单的成对输入统计量，随后才专注于高阶相关性。我们在基于最小数据模型——混合累积量模型——训练的简单去噪器中重现了这一行为，该模型允许我们精确控制输入的成对与高阶相关性。我们识别出一个模型标量不变量，它控制着学习成对与高阶相关性所需的样本复杂度，我们将其称为扩散信息指数，以类比不同学习范式中相关的类似不变量。利用该不变量，我们证明去噪器能够以线性样本复杂度学习输入的简单成对统计量，而更复杂的高阶统计量（如四阶累积量）至少需要立方级样本复杂度。我们还证明，当成对统计量与高阶统计量共享相关潜在结构时，学习四阶累积量的样本复杂度可降至线性。本研究揭示了扩散模型如何学习复杂度递增分布的一个关键机制。

摘要 (Abstract)

While diffusion models have emerged as a powerful class of generative models, their learning dynamics remain poorly understood. We address this issue first by empirically showing that standard diffusion models trained on natural images exhibit a distributional simplicity bias, learning simple, pair-wise input statistics before specializing to higher-order correlations. We reproduce this behaviour in simple denoisers trained on a minimal data model, the mixed cumulant model, where we precisely control both pair-wise and higher-order correlations of the inputs. We identify a scalar invariant of the model that governs the sample complexity of learning pair-wise and higher-order correlations that we call the diffusion information exponent, in analogy to related invariants in different learning paradigms. Using this invariant, we prove that the denoiser learns simple, pair-wise statistics of the inputs at linear sample complexity, while more complex higher-order statistics, such as the fourth cumulant, require at least cubic sample complexity. We also prove that the sample complexity of learning the fourth cumulant is linear if pair-wise and higher-order statistics share a correlated latent structure. Our work describes a key mechanism for how diffusion models can learn distributions of increasing complexity.

关键词: diffusion models, learning dynamics, data statistics, sample complexity, mixed cumulant model, diffusion information exponent, generative models

227. ❌ Enhanced Drug-drug Interaction Prediction Using Adaptive Knowledge Integration

作者: Pengfei Liu, Jun Tao, Zhixiang Ren 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12885v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	8.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是使用大语言模型（LLM）进行药物相互作用预测，属于AI for Science（生物信息学）应用。论文明确提到使用LLM和强化学习技术（RLHF相关），因此这两个关键词高度相关（10分）。论文提到few-shot learning，与In-context Learning有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，因此评分为0。

!!! tip deepseek-chat TL;DR

该研究提出了一种自适应知识增强框架，利用大语言模型和强化学习技术来改进药物相互作用预测的准确性，通过few-shot学习实现了比基线更好的性能。

摘要翻译

药物相互作用事件预测对于预防不良反应和确保最佳治疗效果至关重要。然而，现有方法常面临数据集不平衡、相互作用机制复杂以及对未知药物组合泛化能力不足等挑战。为解决这些问题，我们提出了一种知识增强框架，能够将先验药物知识自适应地注入大型语言模型。该框架利用强化学习技术促进自适应知识提取与合成，从而高效优化策略空间，以提升大型语言模型在药物相互作用事件预测中的准确性。通过小样本学习，我们相比基线模型取得了显著提升。该方法为药物相互作用事件预测的科学知识学习建立了一个有效框架。

摘要 (Abstract)

Drug-drug interaction event (DDIE) prediction is crucial for preventing adverse reactions and ensuring optimal therapeutic outcomes. However, existing methods often face challenges with imbalanced datasets, complex interaction mechanisms, and poor generalization to unknown drug combinations. To address these challenges, we propose a knowledge augmentation framework that adaptively infuses prior drug knowledge into a large language model (LLM). This framework utilizes reinforcement learning techniques to facilitate adaptive knowledge extraction and synthesis, thereby efficiently optimizing the strategy space to enhance the accuracy of LLMs for DDIE predictions. As a result of few-shot learning, we achieved a notable improvement compared to the baseline. This approach establishes an effective framework for scientific knowledge learning for DDIE predictions.

关键词: Drug-drug interaction prediction, Large language model, Knowledge augmentation, Reinforcement learning, Few-shot learning, Adaptive knowledge integration, DDIE prediction, Scientific knowledge learning

228. ❌ Explainable AI Using Inherently Interpretable Components for Wearable-based Health Monitoring

作者: Maurice Kuschel, Solveig Vieluf, Claus Reinsberger, Tobias Loddenkemper, Tanuj Hasija 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12880v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于可解释AI（XAI）方法在可穿戴健康监测中的应用，与关键词’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为其核心贡献是提出一种新的XAI方法。与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（8分），因为它应用于医疗健康监测（癫痫检测等），属于AI在科学/生物信息学领域的应用。其他所有关键词均与大模型、深度学习技术原理、训练方法、推理优化、代理系统等无关，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合解释空间和基于概念解释的新型可解释AI方法，使用固有可解释组件来解释可穿戴设备时间序列数据的AI预测，在保持模型性能的同时实现了可解释性，并应用于健康监测和癫痫检测。

摘要翻译

基于人工智能模型的医疗与健康可穿戴设备应用，为实时监测与可解释事件检测提供了巨大潜力。可解释人工智能（XAI）对于评估模型所学内容、建立患者、医疗专业人员、模型开发者和领域专家对模型输出的信任至关重要。由于可穿戴设备记录的时间序列数据具有复杂性和时序依赖性，解释基于此类数据的人工智能决策尤为困难。当前，使用可解释特征进行解释的方法常导致模型性能下降。我们提出了一种新颖的XAI方法，该方法结合解释空间与基于概念的解释，以解释针对时间序列数据的人工智能预测。通过采用固有可解释组件（IICs, Inherently Interpretable Components）——即在定制解释空间中封装领域特定、可解释的概念——我们在保持时间序列训练模型性能的同时，实现了基于提取特征的、具备可解释性的概念化解释。此外，我们为基于可穿戴设备的健康监测定义了一套领域特定的IICs，并在实际应用中（包括状态评估与癫痫发作检测）验证了其可用性。

摘要 (Abstract)

The use of wearables in medicine and wellness, enabled by AI-based models, offers tremendous potential for real-time monitoring and interpretable event detection. Explainable AI (XAI) is required to assess what models have learned and build trust in model outputs, for patients, healthcare professionals, model developers, and domain experts alike. Explaining AI decisions made on time-series data recorded by wearables is especially challenging due to the data’s complex nature and temporal dependencies. Too often, explainability using interpretable features leads to performance loss. We propose a novel XAI method that combines explanation spaces and concept-based explanations to explain AI predictions on time-series data. By using Inherently Interpretable Components (IICs), which encapsulate domain-specific, interpretable concepts within a custom explanation space, we preserve the performance of models trained on time series while achieving the interpretability of concept-based explanations based on extracted features. Furthermore, we define a domain-specific set of IICs for wearable-based health monitoring and demonstrate their usability in real applications, including state assessment and epileptic seizure detection.

关键词: Explainable AI, XAI, Inherently Interpretable Components, wearable health monitoring, time-series data, concept-based explanations, epileptic seizure detection, interpretability

229. ❌ Test-time RL alignment exposes task familiarity artifacts in LLM benchmarks

作者: Kun Wang, Reinhard Heckel 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12875v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM评估方法，提出test-time RL alignment方法替代SFT-based train-before-test，直接涉及LLM评估、SFT、RL对齐等关键词。论文提到reasoning tasks，与CoT有一定关联但非核心。其他关键词如MoE、量化、RAG等未涉及。

!!! tip deepseek-chat TL;DR

论文提出一种test-time RL alignment方法，用于更准确地评估LLM能力，发现传统评估方法因任务熟悉度偏差而高估SFT/RLHF效果，实际基础模型对齐后性能差距显著缩小。

摘要翻译

直接基于基准测试评估大语言模型可能存在误导性，因为相对较强的性能可能反映的是模型对任务的熟悉度而非真实能力。“训练后测试"方法通过在评估前为每个模型提供任务相关的训练来控制任务熟悉度的影响，传统上通常通过监督微调实现。然而，合适的训练数据往往难以获取，且评估结果会随所选数据的变化而产生波动。本文提出一种用于"训练后测试"的两阶段测试时强化学习对齐方法：首先，使用单样本的强化学习使模型初步适应任务格式；其次，采用多数表决奖励机制的测试时强化学习将模型与基准分布对齐。我们的测试时强化学习方法能达到与基于监督微调的"训练后测试"相当的对齐效果，且无需任务特定的训练集。在一个缺乏训练数据的领域特定基准测试中，我们发现直接评估会低估基础模型的真实能力——这些模型经过对齐后表现显著提升，从而能更真实地反映其能力。此外，对于推理任务，经过对齐后，微调模型与其基础模型之间的性能差距基本消失，这表明文献中报道的RLHF/监督微调带来的性能提升主要并非源于推理能力的差异，而是任务熟悉度造成的人为假象。

摘要 (Abstract)

Direct evaluation of LLMs on benchmarks can be misleading because comparatively strong performance may reflect task familiarity rather than capability. The train-before-test approach controls for task familiarity by giving each model task-relevant training before evaluation, originally through supervised finetuning. However, suitable training data is often hard to come by, and evaluation results vary with the data chosen. In this paper, we propose a two-stage test-time reinforcement learning (RL) alignment method for train-before-test. First, RL with a single sample provides a first alignment of the model to the task format, and second, test-time RL with majority-voting reward aligns the model to the benchmark distribution. Our test-time RL alignment method aligns similarly well as SFT-based train-before test, but without requiring a task-specific training set. On a domain-specific benchmark without training data, we show that direct evaluation underestimates base models which perform substantially better once aligned, yielding a more faithful evaluation of their capabilities. Moreover, for reasoning tasks, the performance gap between fine-tuned models and their base models largely disappears after alignment, suggesting that many gains from RLVR/SFT reported in the literature are not a difference in reasoning capability, but rather artifacts of task familiarity.

关键词: LLM evaluation, test-time RL alignment, task familiarity, benchmark artifacts, reinforcement learning, supervised fine-tuning, capability assessment, reasoning tasks

230. ❌ Surrogates for Physics-based and Data-driven Modelling of Parametric Systems: Review and New Perspectives

作者: Matteo Giacomini, Pedro Díez 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12870v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文是一篇关于代理模型（surrogate models）在参数化系统中应用的综述文章，主要涵盖物理基和数据驱动建模方法，包括降维、多保真度方法、自适应采样等技术。论文属于科学机器学习（Scientific Machine Learning）领域，但并未专门讨论大语言模型（LLMs）、深度学习技术原理创新或大模型在不同领域的应用。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及科学计算和工程应用中的AI方法，但并非核心焦点，因此给予5分（有一定关联）。其他所有关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文综述了物理基和数据驱动代理模型在参数化系统建模中的方法，包括降维、多保真度技术和自适应采样，旨在为优化、控制和数字孪生等应用提供高效建模框架。

摘要翻译

代理模型在用户定义的输入参数与目标输出量之间建立了紧凑的关联关系，使得在大量查询场景中能够高效评估复杂的参数化系统。这种能力在众多应用领域中至关重要，包括优化、控制、数据同化、不确定性量化，以及制造业、个性化医疗、智慧城市和可持续性等新兴数字孪生技术。本文综述了构建代理模型的现有方法学，这些方法或利用系统控制律和动力学结构的知识（基于物理的），或利用实验观测数据（数据驱动的），以及融合这两种范式的混合方法。通过将代理模型的设计重新审视为一个函数逼近问题，本文从（i）降维基底的选取和（ii）合适逼近准则的确定两个方面，对现有方法学进行了梳理。本文回顾了科学机器学习领域相关的方法，旨在综合既有知识、最新进展以及以下方面的新视角：基于本征正交分解、广义分解和人工神经网络的降维、基于物理的与数据驱动的代理建模；利用不同精度来源信息的多保真度方法；以及通过自适应采样、模型增强和数据扩增技术来提升代理模型质量。

摘要 (Abstract)

Surrogate models provide compact relations between user-defined input parameters and output quantities of interest, enabling the efficient evaluation of complex parametric systems in many-query settings. Such capabilities are essential in a wide range of applications, including optimisation, control, data assimilation, uncertainty quantification, and emerging digital twin technologies in various fields such as manufacturing, personalised healthcare, smart cities, and sustainability. This article reviews established methodologies for constructing surrogate models exploiting either knowledge of the governing laws and the dynamical structure of the system (physics-based) or experimental observations (data-driven), as well as hybrid approaches combining these two paradigms. By revisiting the design of a surrogate model as a functional approximation problem, existing methodologies are reviewed in terms of the choice of (i) a reduced basis and (ii) a suitable approximation criterion. The paper reviews methodologies pertaining to the field of Scientific Machine Learning, and it aims at synthesising established knowledge, recent advances, and new perspectives on: dimensionality reduction, physics-based, and data-driven surrogate modelling based on proper orthogonal decomposition, proper generalised decomposition, and artificial neural networks; multi-fidelity methods to exploit information from sources with different fidelities; adaptive sampling, enrichment, and data augmentation techniques to enhance the quality of surrogate models.

关键词: surrogate models, physics-based modelling, data-driven modelling, dimensionality reduction, proper orthogonal decomposition, artificial neural networks, multi-fidelity methods, adaptive sampling

231. ❌ On Linear Separability of the MNIST Handwritten Digits Dataset

作者: Ákos Hajnal 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12850v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究MNIST手写数字数据集的线性可分性问题，属于传统机器学习/模式识别领域的基础研究。论文内容完全不涉及大模型、深度学习技术原理创新、大模型在不同领域的应用、或AI for Science等关键词。所有关键词均与大模型技术、深度学习创新、科学AI应用无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文通过实证研究解决了MNIST手写数字数据集是否线性可分的长期争议问题，系统评估了训练集、测试集和组合集在不同分离设置下的可分性。

摘要翻译

MNIST数据集包含数以千计的手写数字图像，至今仍是评估各类模式识别与图像分类模型的基础基准。线性可分性是众多统计与机器学习方法中的核心概念。尽管MNIST数据集历史悠久，且其尺寸与分辨率相对简单，但关于该数据集是否线性可分的问题从未得到完整解答——科学文献与非正式来源中存在相互矛盾的说法。本文旨在通过全面的实证研究来探讨此问题，分别区分训练集、测试集及合并数据集中存在的成对线性可分性与一对多线性可分性。文章回顾了评估线性可分性的理论方法，结合前沿技术与工具，系统检验了所有相关数据组合，并报告了研究结果。

摘要 (Abstract)

The MNIST dataset containing thousands of handwritten digit images is still a fundamental benchmark for evaluating various pattern-recognition and image-classification models. Linear separability is a key concept in many statistical and machine-learning techniques. Despite the long history of the MNIST dataset and its relative simplicity in size and resolution, the question of whether the dataset is linearly separable has never been fully answered – scientific and informal sources share conflicting claims. This paper aims to provide a comprehensive empirical investigation to address this question, distinguishing pairwise and one-vs-rest separation of the training, the test and the combined sets, respectively. It reviews the theoretical approaches to assessing linear separability, alongside state-of-the-art methods and tools, then systematically examines all relevant assemblies, and reports the findings.

关键词: MNIST dataset, linear separability, handwritten digits, pattern recognition, image classification, empirical investigation, pairwise separation, one-vs-rest separation

232. ❌ A Multi-task Large Reasoning Model for Molecular Science

作者: Pengfei Liu, Shuang Ge, Jun Tao, Zhixiang Ren 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12808v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出了一种用于分子科学的多任务大型推理模型，核心创新在于将结构化推理和反思机制嵌入深度学习架构。与关键词高度相关（10分）的有：1）‘Chain of Thought’（论文明确使用CoT框架）；2）‘AI for Science’（应用于分子科学/药物设计）。相关度较高（8分）的有：1）‘Large Language Models’（属于大模型在科学领域的应用）；2）‘System 2 Thinking’（涉及深度推理）；3）‘Self-Correction’（包含反思机制）；4）‘Explainable AI’（强调可解释性）。其他关键词未在论文中明确涉及或不是核心内容，得0分。

!!! tip deepseek-chat TL;DR

该研究针对分子科学中现有模型缺乏通用性和推理能力的问题，提出了一种融合链式思维和强化学习的多任务大型推理模型，在10个分子任务上平均性能提升50.3%，并以中枢神经系统药物设计案例验证了其高效学习和可解释性优势。

摘要翻译

分子科学人工智能的进展正推动研究范式从纯数据驱动预测向知识引导的计算推理转变。现有分子模型多为专有系统，缺乏通用分子智能与泛化能力，这凸显了开发能有效融合科学逻辑与深度学习架构的计算方法的必要性。本文提出一种多任务大推理模型，旨在通过结构化推理与反思机制模拟分子科学家的认知过程。该方法整合了多专家模块以提供多维度分子专业知识，并构建了由分子知识增强的强化学习优化的思维链框架，从而实现结构化反思推理。在涵盖10项分子任务与47项评估指标的系统性测试中，本模型相较于基础架构平均性能提升50.3%，在训练数据与计算资源显著减少的情况下，仍超越包括超大规模参数基础模型在内的20余个前沿基线模型。这证实嵌入显式推理机制可实现高效学习，使小规模模型在效能与可解释性上超越参数量庞大的模型。通过对中枢神经系统候选药物设计的案例研究，进一步验证了该计算框架的实用价值，展示了其连接数据驱动与知识融合方法以实现智能分子设计的潜力。

摘要 (Abstract)

Advancements in artificial intelligence for molecular science are necessitating a paradigm shift from purely data-driven predictions to knowledge-guided computational reasoning. Existing molecular models are predominantly proprietary, lacking general molecular intelligence and generalizability. This underscores the necessity for computational methods that can effectively integrate scientific logic with deep learning architectures. Here we introduce a multi-task large reasoning model designed to emulate the cognitive processes of molecular scientists through structured reasoning and reflection. Our approach incorporates multi-specialist modules to provide versatile molecular expertise and a chain-of-thought (CoT) framework enhanced by reinforcement learning infused with molecular knowledge, enabling structured and reflective reasoning. Systematic evaluations across 10 molecular tasks and 47 metrics demonstrate that our model achieves an average 50.3% improvement over the base architecture, outperforming over 20 state-of-the-art baselines, including ultra-large-parameter foundation models, despite using significantly fewer training data and computational resources. This validates that embedding explicit reasoning mechanisms enables high-efficiency learning, allowing smaller-scale models to surpass massive counterparts in both efficacy and interpretability. The practical utility of this computational framework was validated through a case study on the design of central nervous system (CNS) drug candidates, illustrating its capacity to bridge data-driven and knowledge-integrated approaches for intelligent molecular design.

关键词: multi-task large reasoning model, molecular science, chain-of-thought, reinforcement learning, structured reasoning, drug design, interpretability, knowledge-guided computation

233. ❌ A Fractional Fox H-Function Kernel for Support Vector Machines: Robust Classification via Weighted Transmutation Operators

作者: Gustavo Dorrego 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12794v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于支持向量机（SVM）中一种新型核函数（Fox-Dorrego Kernel）的数学推导与应用，其核心贡献在于利用分数阶扩散波方程和加权Sobolev空间中的保结构变换方法，提出了一种具有分数渐近幂律衰减和老化权重函数的非平稳Mercer核，以提高分类的鲁棒性并减少对异常值的敏感性。论文内容完全属于传统机器学习中的核方法、函数分析和数值计算领域，未涉及任何大语言模型（LLM）、深度学习、大模型技术原理、训练对齐方法、推理优化、智能体系统或AI for Science（如生物信息学）等相关主题。所有评分关键词均与大模型和深度学习技术及其应用直接相关，而本文研究与之无任何关联，因此所有关键词的相关度评分均为0分。

!!! tip deepseek-chat TL;DR

该论文针对支持向量机中高斯径向基函数核易受结构噪声和异常值影响导致过拟合的问题，提出了一种基于广义时空分数阶扩散波方程基本解的新型Fox-Dorrego核，通过在合成数据集和真实雷达数据上的实验表明，该核能显著降低分类错误率约50%并保持对异常值的结构鲁棒性。

摘要翻译

支持向量机（SVM）的性能在很大程度上依赖于核函数的选择，以将数据映射到高维特征空间。虽然高斯径向基函数（Gaussian RBF）是业界标准，但其指数衰减特性使其对结构噪声和异常值极为敏感，在复杂数据集中常导致严重的过拟合。本文提出了一类新颖的非平稳核函数，其源自广义时空分数阶扩散-波动方程的基本解。通过在加权索伯列夫空间（Weighted Sobolev Spaces）上运用结构保持的变换方法，我们引入了Fox-Dorrego核，这是一种由Fox H函数控制的精确解析Mercer核。与标准核函数不同，我们的公式引入了一个老化权重函数（“遗忘效应”）以惩罚远距离异常值，并采用分数阶渐近幂律衰减来实现鲁棒的重尾特征映射（类似于莱维飞行）。在合成数据集和真实世界高维雷达数据（Ionosphere）上的数值实验表明，所提出的Fox-Dorrego核函数始终优于标准高斯RBF基线，在保持对异常值结构鲁棒性的同时，将分类错误率降低了约50%。

摘要 (Abstract)

Support Vector Machines (SVMs) rely heavily on the choice of the kernel function to map data into high-dimensional feature spaces. While the Gaussian Radial Basis Function (RBF) is the industry standard, its exponential decay makes it highly susceptible to structural noise and outliers, often leading to severe overfitting in complex datasets. In this paper, we propose a novel class of non-stationary kernels derived from the fundamental solution of the generalized time-space fractional diffusion-wave equation. By leveraging a structure-preserving transmutation method over Weighted Sobolev Spaces, we introduce the Fox-Dorrego Kernel, an exact analytical Mercer kernel governed by the Fox H-function. Unlike standard kernels, our formulation incorporates an aging weight function (the “Amnesia Effect”) to penalize distant outliers and a fractional asymptotic power-law decay to allow for robust, heavy-tailed feature mapping (analogous to Lévy flights). Numerical experiments on both synthetic datasets and real-world high-dimensional radar data (Ionosphere) demonstrate that the proposed Fox-Dorrego kernel consistently outperforms the standard Gaussian RBF baseline, reducing the classification error rate by approximately 50% while maintaining structural robustness against outliers.

关键词: Support Vector Machines, Kernel Function, Fox H-function, Fractional Diffusion-wave Equation, Weighted Sobolev Spaces, Robust Classification, Outlier Resistance, Non-stationary Kernels

234. ❌ Upper Bounds for Local Learning Coefficients of Three-Layer Neural Networks

作者: Yuki Kurumadani 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12785v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究三层神经网络的奇异学习模型中的局部学习系数上界，属于深度学习理论分析范畴，但未涉及大模型、LLMs、MoE、SLMs、缩放定律、预训练、微调、对齐、RLHF、PEFT、RAG、长上下文、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI等关键词的具体技术或应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文针对三层神经网络奇异点处的局部学习系数，提出了一个可解释为预算约束和供需约束下计数规则的上界公式，适用于一般解析激活函数，并在输入维度为1时与已知学习系数一致，部分解决了先前上界与已知值不符的问题。

摘要翻译

已知三层神经网络会形成奇异学习模型，其贝叶斯渐近行为由学习系数（即实对数典则阈值）所决定。尽管该量在正则模型及某些特殊奇异模型中已得到阐明，但在神经网络中评估该系数的普适方法仍十分有限。

近期，针对半正则模型提出了局部学习系数的计算公式，该公式给出了学习系数的上界。然而，此公式仅适用于实现参数集中的非奇点，无法应用于奇点。特别地，对于三层神经网络，该上界在某些情况下已被证明与已知的学习系数值存在显著差异。

本文推导了三层神经网络在奇点处局部学习系数的上界公式。该公式可解释为预算约束与供需约束下的计数规则，且适用于一般的解析激活函数。特别地，其涵盖了swish函数与多项式函数，从而将先前结果推广至更广泛的激活函数类别。

我们进一步证明，当输入维度为一时，本文所得上界与已知学习系数完全一致，从而部分解决了上述差异。我们的结果也为理解三层神经网络的权重参数如何影响学习系数提供了系统性的视角。

摘要 (Abstract)

Three-layer neural networks are known to form singular learning models, and their Bayesian asymptotic behavior is governed by the learning coefficient, or real log canonical threshold. Although this quantity has been clarified for regular models and for some special singular models, broadly applicable methods for evaluating it in neural networks remain limited. Recently, a formula for the local learning coefficient of semiregular models was proposed, yielding an upper bound on the learning coefficient. However, this formula applies only to nonsingular points in the set of realization parameters and cannot be used at singular points. In particular, for three-layer neural networks, the resulting upper bound has been shown to differ substantially from learning coefficient values already known in some cases. In this paper, we derive an upper-bound formula for the local learning coefficient at singular points in three-layer neural networks. This formula can be interpreted as a counting rule under budget constraints and demand-supply constraints, and is applicable to general analytic activation functions. In particular, it covers the swish function and polynomial functions, extending previous results to a wider class of activation functions. We further show that, when the input dimension is one, the upper bound obtained here coincides with the already known learning coefficient, thereby partially resolving the discrepancy above. Our result also provides a systematic perspective on how the weight parameters of three-layer neural networks affect the learning coefficient.

关键词: three-layer neural networks, learning coefficient, real log canonical threshold, singular points, upper bound, activation functions, Bayesian asymptotic behavior, semiregular models

235. ❌ VecMol: Vector-Field Representations for 3D Molecule Generation

作者: Yuchen Hua, Xingang Peng, Jianzhu Ma, Muhan Zhang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12734v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于3D分子生成的深度学习新方法，与大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理等）完全无关。唯一相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学（药物发现、材料科学）和化学信息学领域的应用，得10分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为VecMol的新框架，通过将3D分子建模为连续向量场来生成分子，解决了现有方法中模态纠缠和几何-化学一致性约束的挑战，并在QM9和GEOM-Drugs基准上验证了其可行性。

摘要翻译

三维分子生成建模是药物发现与材料科学领域基础且具有挑战性的课题。现有方法通常将分子表示为三维图，并同时生成离散的原子类型与连续的原子坐标，这导致了固有的学习困难，例如异质模态纠缠与几何-化学一致性约束。我们提出VecMol，这是一个范式转换框架，它通过将三维分子建模为欧几里得空间上的连续向量场来重新构想分子表示，其中向量指向邻近原子并隐式编码分子结构。该向量场由神经场参数化，并利用潜在扩散模型生成，从而避免了显式图生成，并将结构学习与离散原子实例化解耦。在QM9和GEOM-Drugs基准测试上的实验验证了这一新方法的可行性，表明基于向量场的表示法为三维分子生成提供了一个前景广阔的新方向。

摘要 (Abstract)

Generative modeling of three-dimensional (3D) molecules is a fundamental yet challenging problem in drug discovery and materials science. Existing approaches typically represent molecules as 3D graphs and co-generate discrete atom types with continuous atomic coordinates, leading to intrinsic learning difficulties such as heterogeneous modality entanglement and geometry-chemistry coherence constraints. We propose VecMol, a paradigm-shifting framework that reimagines molecular representation by modeling 3D molecules as continuous vector fields over Euclidean space, where vectors point toward nearby atoms and implicitly encode molecular structure. The vector field is parameterized by a neural field and generated using a latent diffusion model, avoiding explicit graph generation and decoupling structure learning from discrete atom instantiation. Experiments on the QM9 and GEOM-Drugs benchmarks validate the feasibility of this novel approach, suggesting vector-field-based representations as a promising new direction for 3D molecular generation.

关键词: 3D molecule generation, vector field representation, neural field, latent diffusion model, drug discovery, materials science, generative modeling, molecular representation

236. ❌ Taming the Long Tail: Efficient Item-wise Sharpness-Aware Minimization for LLM-based Recommender Systems

作者: Jiaming Zhang, Yuyuan Li, Xiaohua Feng, Li Zhang, Longfei Li, Jun Zhou, Chaochao Chen 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12752v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于LLM在推荐系统中的应用，核心贡献是提出EISAM优化框架解决长尾问题。论文明确以LLM为基础，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术（如MoE、SFT、RAG、量化等），也未涉及科学领域应用（如生物信息学），因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文研究了基于大语言模型的推荐系统中存在的长尾问题，并提出了一种高效的逐项锐度感知最小化优化框架，显著提升了尾部项目的推荐性能。

摘要翻译

基于大语言模型的推荐系统（LRSs）作为一种新兴的序列推荐范式，近期通过直接采用大语言模型作为主干架构而出现。尽管LRSs展现出强大的知识利用与指令遵循能力，其在长期存在的长尾问题下的表现尚未得到系统研究。本文通过实证研究发现，LRSs面临两种不同类型的长尾现象：i) 先验长尾，即隐式继承自预训练语料库的长尾分布；ii) 数据长尾，源于推荐数据集本身的偏斜分布。我们的分析表明，这两种长尾均导致头部与尾部物品之间的性能差异，且两者的交集会引发更显著的头部效应。然而，LRSs的整体性能分布，尤其在尾部区域，仍主要受数据长尾主导。为应对这一挑战，我们提出高效逐项锐度感知最小化（Efficient Item-wise Sharpness-Aware Minimization, EISAM），这是一种新颖的优化框架，通过在物品层级自适应地正则化损失函数的几何形态，以提升尾部物品的推荐性能。EISAM设计了一种高效的惩罚项，能够捕捉细粒度的物品特定锐度，同时保持对大语言模型的计算可扩展性。此外，我们推导了EISAM的泛化界。理论分析表明，在逐项正则化下，该泛化界以更快的速率下降，为其有效性提供了理论支撑。在三个真实世界数据集上的大量实验证明，EISAM在保持整体推荐质量的同时，显著提升了尾部物品的推荐性能，从而为LRSs中的长尾问题建立了首个系统性解决方案。

摘要 (Abstract)

Large Language Model-based Recommender Systems (LRSs) have recently emerged as a new paradigm in sequential recommendation by directly adopting LLMs as backbones. While LRSs demonstrate strong knowledge utilization and instruction-following abilities, they have not been systematically studied under the long-standing long-tail problem. In this paper, we conduct an empirical study and reveal that LRSs face two distinct types of long-tail: i) prior long-tail, inherited implicitly from pretraining corpora, and ii) data long-tail, originating from skewed recommendation datasets. Our analysis shows that both contribute to the performance disparity between head and tail items, with the intersection of the two heads exhibiting an even stronger head effect. Nevertheless, the overall performance distribution in LRSs, especially on the tail, remains dominated by the data long-tail. To address this challenge, we propose Efficient Item-wise Sharpness-Aware Minimization (EISAM), a novel optimization framework that improves tail-item performance by adaptively regularizing the loss landscape at the item level. EISAM introduces an efficient penalty design that captures fine-grained item-specific sharpness while maintaining computational scalability for LLMs. In addition, we derive a generalization bound for EISAM. Our theoretical analysis shows that the bound decreases at a faster rate under our item-wise regularization, offering theoretical support for its effectiveness. Extensive experiments on three real-world datasets demonstrate that EISAM significantly boosts tail-item recommendation performance while preserving overall quality, establishing the first systematic solution to the long-tail problem in LRSs.

关键词: Large Language Models, Recommender Systems, Long-tail Problem, Sharpness-Aware Minimization, Item-wise Regularization, Optimization Framework, Tail-item Performance

237. ❌ Anchored Alignment: Preventing Positional Collapse in Multimodal Recommender Systems

作者: Yonghun Jeong, David Yoon Suk Kang, Yeon-Chang Lee 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12726v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多模态推荐系统中的对齐问题，提出AnchorRec框架通过锚点对齐避免位置崩溃。仅与关键词’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（8分），因为涉及多模态对齐技术，但并非LLM对齐。其他关键词均与论文无关（0分），论文不涉及大模型、深度学习技术原理、科学应用等主题。

!!! tip deepseek-chat TL;DR

该论文针对多模态推荐系统中直接对齐导致模态结构模糊和ID主导的问题，提出了AnchorRec框架，通过轻量级投影域中的锚点对齐来保持模态特异性结构，在四个亚马逊数据集上实现了竞争性的推荐准确性并改善了多模态表达一致性。

摘要翻译

多模态推荐系统（MMRS）通过融合图像、文本及交互信号来丰富物品表征。然而，近期基于对齐的MMRS方法强制构建统一嵌入空间，往往模糊了模态特有结构并加剧了ID主导问题。为此，我们提出AnchorRec——一种在轻量级投影域中通过锚点进行间接对齐的多模态推荐框架。通过将对齐过程与表征学习解耦，AnchorRec在保持跨模态一致性和避免位置塌陷的同时，保留了各模态的固有结构。在四个亚马逊数据集上的实验表明，AnchorRec在Top-N推荐准确度上达到可比性能，定性分析则验证了其提升的多模态表达力与连贯性。AnchorRec的代码库已发布于https://github.com/hun9008/AnchorRec。

摘要 (Abstract)

Multimodal recommender systems (MMRS) leverage images, text, and interaction signals to enrich item representations. However, recent alignment based MMRSs that enforce a unified embedding space often blur modality specific structures and exacerbate ID dominance. Therefore, we propose AnchorRec, a multimodal recommendation framework that performs indirect, anchor based alignment in a lightweight projection domain. By decoupling alignment from representation learning, AnchorRec preserves each modality’s native structure while maintaining cross modal consistency and avoiding positional collapse. Experiments on four Amazon datasets show that AnchorRec achieves competitive top N recommendation accuracy, while qualitative analyses demonstrate improved multimodal expressiveness and coherence. The codebase of AnchorRec is available at https://github.com/hun9008/AnchorRec.

关键词: multimodal recommender systems, alignment, anchor-based alignment, positional collapse, modality-specific structures, cross-modal consistency, recommendation accuracy, embedding space

238. ❌ SciDesignBench: Benchmarking and Improving Language Models for Scientific Inverse Design

作者: David van Dijk, Ivan Vrkic 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12724v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是评估和改进语言模型在科学逆设计任务中的表现，建立了跨14个科学领域的基准测试，并提出了RLSF训练方法。因此，与’Large Language Models’高度相关（10分），因为论文直接测试和优化LLMs（如Sonnet 4.5、Opus 4.6）。与’AI for Science’高度相关（10分），因为论文专注于科学领域的逆设计问题，涉及生物信息学、化学信息学等。其他关键词如MoE、SLMs、Scaling Laws、训练方法（SFT、RLHF、PEFT）、推理技术（RAG、CoT、Attention优化）、代理系统、模型压缩等，论文未涉及或仅隐含提及（如模拟器反馈可能涉及推理，但非核心），故评0分。

!!! tip deepseek-chat TL;DR

该论文针对科学逆设计问题，提出了SciDesignBench基准测试和RLSF训练方法，显著提升了语言模型在跨领域科学任务中的成功率。

摘要翻译

科学与工程领域中许多最重要的问题都属于逆问题范畴：在给定预期结果的前提下，寻找能够实现该结果的设计方案。评估候选方案是否符合要求通常是常规工作；例如计算结合能、模拟反应堆产率或预测药代动力学特征。然而，在组合设计空间中搜索满足这些目标的输入参数则具有本质上的更高难度。本文提出SciDesignBench基准测试集，涵盖14个科学领域及五种任务场景（包括单次设计、短周期反馈、长周期优化和种子设计优化），共包含520项基于模拟器的任务。在10个领域的核心共享子集中，尽管解析率显著更高，最佳零样本模型的成功率仅为29.0%。模拟器反馈能提升性能，但排行榜随任务周期呈现动态变化：Sonnet 4.5在单轮从头设计场景中表现最强，而Opus 4.6在经过20轮基于模拟器的优化后表现最佳。提供初始种子设计会再次重构排行榜，这表明受约束的修改能力与无约束的从头生成能力存在本质差异。我们进一步提出RLSF（基于模拟器反馈的训练方法），经RLSF调优的80亿参数模型在三个科学领域中将单轮设计成功率提升了8-17个百分点。这些研究成果共同表明，基于模拟器的逆设计既可作为科学推理能力的评估基准，又能为将昂贵的实时计算成本摊销至模型权重提供实践基础。

摘要 (Abstract)

Many of the most important problems in science and engineering are inverse problems: given a desired outcome, find a design that achieves it. Evaluating whether a candidate meets the spec is often routine; a binding energy can be computed, a reactor yield simulated, a pharmacokinetic profile predicted. But searching a combinatorial design space for inputs that satisfy those targets is fundamentally harder. We introduce SciDesignBench, a benchmark of 520 simulator-grounded tasks across 14 scientific domains and five settings spanning single-shot design, short-horizon feedback, long-horizon refinement, and seed-design optimization. On the 10-domain shared-core subset, the best zero-shot model reaches only 29.0% success despite substantially higher parse rates. Simulator feedback helps, but the leaderboard changes with horizon: Sonnet 4.5 is strongest in one-turn de novo design, whereas Opus 4.6 is strongest after 20 turns of simulator-grounded refinement. Providing a starting seed design reshuffles the leaderboard again, demonstrating that constrained modification requires a fundamentally different capability from unconstrained de novo generation. We then introduce RLSF, a simulator-feedback training recipe. An RLSF-tuned 8B model raises single-turn success rates by 8-17 percentage points across three domains. Together, these results position simulator-grounded inverse design as both a benchmark for scientific reasoning and a practical substrate for amortizing expensive test-time compute into model weights.

关键词: scientific inverse design, language models, benchmark, simulator-grounded tasks, RLSF training, cross-domain evaluation, feedback optimization, design space search

239. ❌ RXNRECer Enables Fine-grained Enzymatic Function Annotation through Active Learning and Protein Language Models

作者: Zhenkun Shi, Jun Zhu, Dehang Wang, BoYu Chen, Qianqian Yuan, Zhitao Mao, Fan Wei, Weining Wu, Xiaoping Liao, Hongwu Ma 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12694v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文RXNRECer专注于酶功能注释的生物信息学应用，属于AI for Science领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文使用蛋白质语言模型（一种特定领域的大语言模型），与关键词’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分）。摘要提到’interpretable rationales for predictions’，与可解释AI相关，因此给’Mechanistic Interpretability OR Explainable AI’ 5分。其他关键词如MoE、SLMs、Scaling Laws、训练技术、推理优化、代理系统等均未在论文中涉及，故给0分。

!!! tip deepseek-chat TL;DR

该论文提出RXNRECer框架，通过结合蛋白质语言模型和主动学习，直接预测酶催化反应，避免了传统EC编号方法的模糊性，在F1分数和准确率上分别提升了16.54%和15.43%。

摘要翻译

酶功能注释的一个核心挑战在于识别蛋白质催化的生化反应。现有方法大多依赖酶学委员会编号作为中介：先预测EC编号，再检索对应反应。这种间接策略因蛋白质、EC编号与反应间复杂的多对多映射关系而引入歧义，且EC编号的频繁更新与数据库间的不一致性进一步加剧了问题。为应对这些挑战，我们提出RXNRECer——一种基于Transformer的集成框架，能够在不依赖EC编号的情况下直接预测酶催化反应。该框架整合了蛋白质语言建模与主动学习，以同时捕捉高阶序列语义与细粒度转化模式。在精心构建的交叉验证集和时间测试集上的评估表明，相较于六种基于EC编号的基线方法，本框架性能持续提升，F1分数提高16.54%，准确率提升15.43%。除精度优势外，该框架在下游应用中展现出多重价值：可实现可扩展的全蛋白质组反应注释、提升通用反应模式细化的特异性、系统注释未收录蛋白质，并能可靠识别酶催化杂泛性。通过融合大语言模型，还能为预测提供可解释的依据。这些能力使RXNRECer成为无需EC编号的细粒度酶功能预测的稳健通用解决方案，在酶学研究和工业应用的多领域均具潜力。

摘要 (Abstract)

A key challenge in enzyme annotation is identifying the biochemical reactions catalyzed by proteins. Most existing methods rely on Enzyme Commission (EC) numbers as intermediaries: they first predict an EC number and then retrieve the associated reactions. This indirect strategy introduces ambiguity due to the complex many-to-many mappings among proteins, EC numbers, and reactions, and is further complicated by frequent updates to EC numbers and inconsistencies across databases. To address these challenges, we present RXNRECer, a transformer-based ensemble framework that directly predicts enzyme-catalyzed reactions without relying on EC numbers. It integrates protein language modeling and active learning to capture both high-level sequence semantics and fine-grained transformation patterns. Evaluations on curated cross-validation and temporal test sets demonstrate consistent improvements over six EC-based baselines, with gains of 16.54% in F1 score and 15.43% in accuracy. Beyond accuracy gains, the framework offers clear advantages for downstream applications, including scalable proteome-wide reaction annotation, enhanced specificity in refining generic reaction schemas, systematic annotation of previously uncurated proteins, and reliable identification of enzyme promiscuity. By incorporating large language models, it also provides interpretable rationales for predictions. These capabilities make RXNRECer a robust and versatile solution for EC-free, fine-grained enzyme function prediction, with potential applications across multiple areas of enzyme research and industrial applications.

关键词: enzyme annotation, protein language models, active learning, reaction prediction, EC-free prediction, bioinformatics, transformer-based ensemble, interpretable predictions

240. ❌ Colluding LoRA: A Composite Attack on LLM Safety Alignment

作者: Sihao Ding 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12681v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	15.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LoRA（参数高效微调）在LLM安全对齐中的组合攻击，因此与’PEFT/LoRA’高度相关（15分），直接涉及LLM安全对齐（10分），并基于LLM研究（10分）。其他关键词如MoE、SLMs、Scaling Laws、RAG等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Colluding LoRA（CoLoRA）的组合攻击方法，通过线性组合多个看似良性的LoRA适配器，可在不依赖对抗性提示的情况下有效破坏LLM的安全对齐，导致模型对有害请求的顺从。

摘要翻译

我们提出协同式低秩适应攻击（Colluding LoRA，简称CoLoRA），该攻击中每个适配器单独运行时均呈现良性且功能正常，但其线性组合会持续破坏模型安全性。与依赖特定输入触发器或提示模式的攻击不同，CoLoRA是一种通过组合触发的广泛拒绝抑制攻击：一旦加载特定适配器组合，模型将发生实质性的对齐退化，无需对抗性提示或后缀即可响应有害请求。此攻击利用了当前防御系统的组合盲区——穷举扫描所有组合在计算上是不可行的。在多个开源权重大语言模型上的实验表明，CoLoRA适配器单独运行时均表现正常，但组合后攻击成功率显著提升，这证明保障模块化大语言模型供应链安全需超越单模块验证，转向具备组合感知能力的防御机制。

摘要 (Abstract)

We introduce Colluding LoRA (CoLoRA), an attack in which each adapter appears benign and plausibly functional in isolation, yet their linear composition consistently compromises safety. Unlike attacks that depend on specific input triggers or prompt patterns, CoLoRA is a composition-triggered broad refusal suppression: once a particular set of adapters is loaded, the model undergoes effective alignment degradation, complying with harmful requests without requiring adversarial prompts or suffixes. This attack exploits the combinatorial blindness of current defense systems, where exhaustively scanning all compositions is computationally intractable. Across several open-weight LLMs, CoLoRA achieves benign behavior individually yet high attack success rate after composition, indicating that securing modular LLM supply-chains requires moving beyond single-module verification toward composition-aware defenses.

关键词: Colluding LoRA, LoRA, LLM safety alignment, adversarial attack, parameter-efficient fine-tuning, composition-triggered attack, modular LLM supply-chains, alignment degradation

241. ❌ Disentangled Latent Dynamics Manifold Fusion for Solving Parameterized PDEs

作者: Zhangyong Liang, Ji Zhang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12676v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于使用神经网络求解参数化偏微分方程（PDEs），提出了一种物理信息框架DLDMF，通过解耦空间、时间和参数来改进泛化能力和时间外推性能。所有关键词均与大语言模型（LLMs）、深度学习技术原理或特定AI应用（如生物信息学）直接相关，但本文的核心是科学计算中的PDE求解，属于AI for Science的广义范畴，因此仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），其他关键词均不涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为DLDMF的物理信息框架，通过解耦空间、时间和参数来改进神经网络在求解参数化偏微分方程时的泛化能力和时间外推性能，实验表明其在多个基准问题上优于现有方法。

摘要翻译

跨不同偏微分方程参数泛化神经代理模型仍然存在困难，因为偏微分方程系数的变化通常会使学习过程更复杂并降低优化稳定性。当模型还需在训练时间范围之外进行预测时，这一问题会变得更加严峻。现有方法通常无法同时处理参数泛化与时间外推问题。标准参数化模型仅将时间视为另一个输入维度，因而无法捕捉内在动力学特性；而近期的连续时间隐变量方法往往依赖对每个实例进行昂贵的测试时自解码，效率低下且可能破坏参数化解空间的连续性。为解决这些问题，我们提出了解耦隐变量动力学流形融合（DLDMF），这是一个显式分离空间、时间与参数的物理信息框架。DLDMF摒弃了不稳定的自解码过程，通过前馈网络将偏微分方程参数直接映射到连续隐嵌入表示。该嵌入用于初始化和调节隐状态，其演化由参数条件化的神经常微分方程控制。我们进一步引入了动态流形融合机制，通过共享解码器将空间坐标、参数嵌入与随时间演化的隐状态相结合，以重构对应的时空解。通过将预测建模为隐动态演化而非静态坐标拟合，DLDMF减少了参数变化与时间演化之间的相互干扰，同时保持了平滑连贯的解流形。因此，该方法在未见参数设置和长期时间外推任务中均表现优异。在多个基准问题上的实验表明，DLDMF在精度、参数泛化能力和外推鲁棒性方面持续优于当前最先进的基线方法。

摘要 (Abstract)

Generalizing neural surrogate models across different PDE parameters remains difficult because changes in PDE coefficients often make learning harder and optimization less stable. The problem becomes even more severe when the model must also predict beyond the training time range. Existing methods usually cannot handle parameter generalization and temporal extrapolation at the same time. Standard parameterized models treat time as just another input and therefore fail to capture intrinsic dynamics, while recent continuous-time latent methods often rely on expensive test-time auto-decoding for each instance, which is inefficient and can disrupt continuity across the parameterized solution space. To address this, we propose Disentangled Latent Dynamics Manifold Fusion (DLDMF), a physics-informed framework that explicitly separates space, time, and parameters. Instead of unstable auto-decoding, DLDMF maps PDE parameters directly to a continuous latent embedding through a feed-forward network. This embedding initializes and conditions a latent state whose evolution is governed by a parameter-conditioned Neural ODE. We further introduce a dynamic manifold fusion mechanism that uses a shared decoder to combine spatial coordinates, parameter embeddings, and time-evolving latent states to reconstruct the corresponding spatiotemporal solution. By modeling prediction as latent dynamic evolution rather than static coordinate fitting, DLDMF reduces interference between parameter variation and temporal evolution while preserving a smooth and coherent solution manifold. As a result, it performs well on unseen parameter settings and in long-term temporal extrapolation. Experiments on several benchmark problems show that DLDMF consistently outperforms state-of-the-art baselines in accuracy, parameter generalization, and extrapolation robustness.

关键词: parameterized PDEs, neural surrogate models, latent dynamics, physics-informed framework, temporal extrapolation, parameter generalization, Neural ODE, solution manifold

242. ❌ Sobolev–Ricci Curvature

作者: Kyoichi Iwasaki, Tam Le, Hideitsu Hino 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12652v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于纯数学领域（微分几何、图论、最优传输理论）的Sobolev-Ricci曲率理论发展及其在图变换中的应用，完全不涉及大模型、深度学习、AI技术或科学AI应用，与所有评分关键词均无关联。

!!! tip deepseek-chat TL;DR

该论文提出了Sobolev-Ricci曲率（SRC）作为图Ricci曲率的新定义，基于Sobolev传输几何构建，并展示了其在Ricci流式图重加权和流形结构保持的边剪枝中的应用。

摘要翻译

里奇曲率是微分几何中刻画局部几何结构的基本概念，其基于图的类比形式近年来已成为网络几何重加权、剪枝与重构的重要实用工具。我们提出索伯列夫-里奇曲率（Sobolev-Ricci Curvature, SRC），这是一种由索伯列夫传输几何规范诱导的图里奇曲率，可通过邻域测度上的树度量索伯列夫结构进行高效计算。我们建立了两种将SRC与经典传输曲率相衔接的一致性行为：（一）在赋予长度测度的树结构上，SRC在规范W1设定下恢复奥利维耶-里奇曲率（Ollivier-Ricci Curvature, ORC）；（二）SRC在狄拉克极限下趋于零，与测度论里奇曲率的平坦情形相符。我们通过两个代表性流程展示SRC作为可复用曲率基元的功能：通过将里奇流式重加权规则中的ORC替换为SRC，定义了索伯列夫-里奇流；同时利用SRC进行面向流形结构保持的曲率引导边剪枝。总体而言，SRC为可扩展的曲率驱动图变换与流形导向剪枝提供了基于传输理论的坚实基础。

摘要 (Abstract)

Ricci curvature is a fundamental concept in differential geometry for encoding local geometric structure, and its graph-based analogues have recently gained prominence as practical tools for reweighting, pruning, and reshaping network geometry. We propose Sobolev-Ricci Curvature (SRC), a graph Ricci curvature canonically induced by Sobolev transport geometry, which admits efficient evaluation via a tree-metric Sobolev structure on neighborhood measures. We establish two consistency behaviors that anchor SRC to classical transport curvature: (i) on trees endowed with the length measure, SRC recovers Ollivier-Ricci curvature (ORC) in the canonical W1 setting, and (ii) SRC vanishes in the Dirac limit, matching the flat case of measure-theoretic Ricci curvature. We demonstrate SRC as a reusable curvature primitive in two representative pipelines. We define Sobolev-Ricci Flow by replacing ORC with SRC in a Ricci-flow-style reweighting rule, and we use SRC for curvature-guided edge pruning aimed at preserving manifold structure. Overall, SRC provides a transport-based foundation for scalable curvature-driven graph transformation and manifold-oriented pruning.

关键词: Sobolev-Ricci curvature, graph Ricci curvature, Sobolev transport geometry, Ollivier-Ricci curvature, Ricci flow, graph transformation, edge pruning, manifold structure

243. ❌ Weakly Time-Coupled Approximation of Markov Decision Processes

作者: Negar Soheili, Selvaprabu Nadarajah, Bo Yang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12636v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究有限时域马尔可夫决策过程（MDP）的近似方法，属于运筹学、金融工程和随机优化领域，专注于算法设计和计算复杂度分析。所有评分关键词均涉及大模型、深度学习及相关技术（如训练方法、推理优化、对齐、应用等），而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文针对高维外生不确定性和内生状态的有限时域马尔可夫决策过程，提出了一种弱时间耦合近似方法，通过减少跨阶段依赖使计算复杂度独立于时域，并在伯姆达期权和乙醇生产实例中产生了比现有方法更紧的上界和接近最优的策略。

摘要翻译

具有高维外生不确定性和内生状态的有限时域马尔可夫决策过程（MDP）出现于运营与金融领域，包括百慕大期权与实物期权的估值与行权，但随着时域延长计算复杂度增长，其面临可扩展性障碍。一种常见的近似方法使用基函数表示价值函数，但拟合权重的方法在处理跨阶段优化时有所不同。最小二乘蒙特卡洛（LSM）通过后向递归与回归拟合权重，避免了联合优化，但误差会在时域内累积。近似线性规划（ALP）与路径优化（PO）联合拟合权重以产生上界，但时间耦合导致计算复杂度随时域增长。我们证明这种耦合是近似架构的产物，并开发了一种弱时间耦合近似（WTCA），其中跨阶段依赖性与时域无关。对于任意固定的基函数集，WTCA上界比ALP更紧、比PO更松，并随着基函数族的扩展收敛至最优策略值。我们将并行确定性块坐标下降法扩展至随机MDP场景，利用弱时间耦合特性。应用于WTCA时，弱耦合使得计算复杂度与时域无关。在相同时间预算内，求解WTCA比PO能容纳更多外生样本或基函数，从而产生更紧的上界，尽管PO在固定样本和基函数条件下上界更紧。在百慕大期权和乙醇生产实例中，WTCA在所有测试案例中均产生了比PO和LSM更紧的上界，并在较长时域下实现了接近最优的策略。

摘要 (Abstract)

Finite-horizon Markov decision processes (MDPs) with high-dimensional exogenous uncertainty and endogenous states arise in operations and finance, including the valuation and exercise of Bermudan and real options, but face a scalability barrier as computational complexity grows with the horizon. A common approximation represents the value function using basis functions, but methods for fitting weights treat cross-stage optimization differently. Least squares Monte Carlo (LSM) fits weights via backward recursion and regression, avoiding joint optimization but accumulating error over the horizon. Approximate linear programming (ALP) and pathwise optimization (PO) jointly fit weights to produce upper bounds, but temporal coupling causes computational complexity to grow with the horizon. We show this coupling is an artifact of the approximation architecture, and develop a weakly time-coupled approximation (WTCA) where cross-stage dependence is independent of horizon. For any fixed basis function set, the WTCA upper bound is tighter than that of ALP and looser than that of PO, and converges to the optimal policy value as the basis family expands. We extend parallel deterministic block coordinate descent to the stochastic MDP setting exploiting weak temporal coupling. Applied to WTCA, weak coupling yields computational complexity independent of the horizon. Within equal time budget, solving WTCA accommodates more exogenous samples or basis functions than PO, yielding tighter bounds despite PO being tighter for fixed samples and basis functions. On Bermudan option and ethanol production instances, WTCA produces tighter upper bounds than PO and LSM in every instance tested, with near-optimal policies at longer horizons.

关键词: Markov decision processes, weakly time-coupled approximation, computational complexity, upper bounds, Bermudan options, real options, basis functions, stochastic optimization

244. ❌ Adaptive Diffusion Posterior Sampling for Data and Model Fusion of Complex Nonlinear Dynamical Systems

作者: Dibyajyoti Chakraborty, Hojin Kim, Romit Maulik 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12635v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于使用扩散模型进行混沌非线性动力系统的代理建模、预测、自适应传感器放置和数据同化，属于AI for Science（科学AI）的应用范畴，因此仅与’AI for Science OR Bioinformatics OR Cheminformatics’关键词有中等关联（5分），与其他所有关键词（主要涉及大语言模型技术、训练方法、推理优化、代理系统等）完全无关（0分）。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于多步自回归扩散模型的统一框架，用于高维混沌非线性动力系统的概率预测、自适应传感器放置和数据同化，并在二维湍流和背向台阶流中验证了其有效性。

摘要翻译

对混沌、高维非线性动力系统进行高保真数值模拟的计算成本高昂，因此需要开发高效的代理模型。此类系统的大多数代理模型是确定性的，例如涉及神经算子的模型。然而，确定性模型往往无法捕捉混沌系统固有的分布不确定性。本研究提出了一种利用生成式机器学习的代理建模框架，其中采用深度学习扩散模型对湍流进行长时程概率预测。我们引入了一种多步自回归扩散目标函数，与标准的单步训练相比，显著增强了长时程推演的稳定性。为处理复杂的非结构化几何，我们采用了融合扩散预处理与体素网格池化的多尺度图变换器架构。更重要的是，我们的建模框架提供了一个统一平台，能够同时预测时空关键位置以部署传感器——既可通过不确定性估计实现，也可通过误差估计模块完成。最后，在这些动态变化的传感器位置获取的真实状态观测数据，通过无需重新训练代理模型的扩散后验采样方法进行同化。我们在二维均匀各向同性湍流及后向台阶流动案例中展示了该方法，证明了其在高维混沌系统的预测、自适应传感器布置和数据同化方面的实用性。

摘要 (Abstract)

High-fidelity numerical simulations of chaotic, high dimensional nonlinear dynamical systems are computationally expensive, necessitating the development of efficient surrogate models. Most surrogate models for such systems are deterministic, for example when neural operators are involved. However, deterministic models often fail to capture the intrinsic distributional uncertainty of chaotic systems. This work presents a surrogate modeling formulation that leverages generative machine learning, where a deep learning diffusion model is used to probabilistically forecast turbulent flows over long horizons. We introduce a multi-step autoregressive diffusion objective that significantly enhances long-rollout stability compared to standard single-step training. To handle complex, unstructured geometries, we utilize a multi-scale graph transformer architecture incorporating diffusion preconditioning and voxel-grid pooling. More importantly, our modeling framework provides a unified platform that also predicts spatiotemporally important locations for sensor placement, either via uncertainty estimates or through an error-estimation module. Finally, the observations of the ground truth state at these dynamically varying sensor locations are assimilated using diffusion posterior sampling requiring no retraining of the surrogate model. We present our methodology on two-dimensional homogeneous and isotropic turbulence and for a flow over a backwards-facing step, demonstrating its utility in forecasting, adaptive sensor placement, and data assimilation for high dimensional chaotic systems.

关键词: diffusion models, surrogate modeling, chaotic dynamical systems, data assimilation, adaptive sensor placement, turbulent flows, graph transformer, posterior sampling

作者: Chenkai Ma, Keqin Chen, Jonathan Scarlett 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12627v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是批量核化赌博机问题，属于经典机器学习优化领域，专注于理论分析和算法改进，不涉及大模型、深度学习、AI for Science或任何评分关键词中的技术。所有关键词均与大模型、深度学习技术原理或科学应用相关，而本文主题是核化赌博机优化，两者无直接关联。

!!! tip deepseek-chat TL;DR

本文研究了批量核化赌博机问题，改进了现有遗憾界，包括找到最优批次数、移除遗憾界中的B因子，并提出了自适应批量和鲁棒设置下的新下界和算法。

摘要翻译

本文研究了具有批量噪声反馈的黑箱优化问题，其中待优化的未知函数在某个再生核希尔伯特空间（Reproducing Kernel Hilbert Space，RKHS）中具有有界范数。我们将此问题称为批量核化赌博机（Batched Kernelized Bandits）问题，并对现有遗憾界结果进行了细化和拓展。在算法上界方面，（Li and Scarlett, 2022）的研究表明，仅需 $B=O(\log\log T)$ 个批次即可实现近乎最优的遗憾，其中 $T$ 为时间范围，$B$ 为批次数。我们进一步细化了这一结果：（i）找到了包含常数因子（在 $1+o(1)$ 范围内）的最优批次数；（ii）去除了遗憾界中的一个 $B$ 因子。在算法无关下界方面，注意到现有结果仅适用于批次大小预先固定的情况，我们提出了当批次大小自适应选择时的新下界，并证明自适应批次的极小极大遗憾尺度本质上与固定批次相同。此外，我们考虑了一个鲁棒性设定，其目标是在即使受到对抗性扰动后，所选点的函数值仍能保持较高水平。我们提出了鲁棒-BPE（robust-BPE）算法，证明了一个适当定义的累积遗憾概念与非鲁棒设定具有相同的界，并推导出了一个显著低于先前工作的简单遗憾界。

摘要 (Abstract)

In this paper, we consider the problem of black-box optimization with noisy feedback revealed in batches, where the unknown function to optimize has a bounded norm in some Reproducing Kernel Hilbert Space (RKHS). We refer to this as the Batched Kernelized Bandits problem, and refine and extend existing results on regret bounds. For algorithmic upper bounds, (Li and Scarlett, 2022) shows that $B=O(\log\log T)$ batches suffice to attain near-optimal regret, where $T$ is the time horizon and $B$ is the number of batches. We further refine this by (i) finding the optimal number of batches including constant factors (to within $1+o(1)$), and (ii) removing a factor of $B$ in the regret bound. For algorithm-independent lower bounds, noticing that existing results only apply when the batch sizes are fixed in advance, we present novel lower bounds when the batch sizes are chosen adaptively, and show that adaptive batches have essentially same minimax regret scaling as fixed batches. Furthermore, we consider a robust setting where the goal is to choose points for which the function value remains high even after an adversarial perturbation. We present the robust-BPE algorithm, and show that a suitably-defined cumulative regret notion incurs the same bound as the non-robust setting, and derive a simple regret bound significantly below that of previous work.

关键词: Batched Kernelized Bandits, black-box optimization, regret bounds, adaptive batches, robust optimization, RKHS, minimax regret, noisy feedback

246. ❌ Maximizing Incremental Information Entropy for Contrastive Learning

作者: Jiansong Zhang, Zhuoqin Yang, Xu Wu, Xiaoling Luo, Peizhong Liu, Linlin Shen 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12594v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于对比学习的自监督表示学习，提出了一种基于信息熵增益优化的框架IE-CL，并在图像数据集（CIFAR-10/100、STL-10、ImageNet）上进行实验验证。论文内容与所有评分关键词（均围绕大模型、深度学习技术原理、AI科学应用等）无直接关联，未涉及大模型、语言模型、模型训练技术、推理方法、代理系统、模型优化等主题，也未应用于生物信息学或化学信息学等科学领域。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于信息熵增益优化的对比学习框架IE-CL，通过联合优化可学习的变换和编码器正则化器来提升小批量设置下的性能，并在多个图像数据集上验证了其有效性。

摘要翻译

对比学习在自监督表征学习中取得了显著成功，其目标通常由互信息最大化等信息论准则所引导。受静态数据增强与刚性不变性约束的局限性启发，我们提出了IE-CL（增量熵对比学习）框架，该框架在保持语义一致性的同时，显式优化增强视图间的熵增益。我们的理论框架通过将编码器识别为信息瓶颈来重构这一挑战，并提出对两个组件的联合优化：一个用于生成熵的可学习变换模块，以及一个用于保持熵的编码器正则化模块。在CIFAR-10/100、STL-10和ImageNet数据集上的实验表明，IE-CL在小批量训练设置下能持续提升性能。此外，我们的核心模块可以无缝集成到现有框架中。这项工作连接了理论原理与实践应用，为对比学习提供了新的视角。

摘要 (Abstract)

Contrastive learning has achieved remarkable success in self-supervised representation learning, often guided by information-theoretic objectives such as mutual information maximization. Motivated by the limitations of static augmentations and rigid invariance constraints, we propose IE-CL (Incremental-Entropy Contrastive Learning), a framework that explicitly optimizes the entropy gain between augmented views while preserving semantic consistency. Our theoretical framework reframes the challenge by identifying the encoder as an information bottleneck and proposes a joint optimization of two components: a learnable transformation for entropy generation and an encoder regularizer for its preservation. Experiments on CIFAR-10/100, STL-10, and ImageNet demonstrate that IE-CL consistently improves performance under small-batch settings. Moreover, our core modules can be seamlessly integrated into existing frameworks. This work bridges theoretical principles and practice, offering a new perspective in contrastive learning.

关键词: Contrastive Learning, Information Entropy, Self-supervised Representation Learning, Incremental-Entropy Contrastive Learning, Encoder Regularizer, Small-batch Settings, Mutual Information Maximization, Semantic Consistency

247. ❌ Human-AI Collaborative Autonomous Experimentation With Proxy Modeling for Comparative Observation

作者: Arpan Biswas, Hiroshi Funakubo, Yongtao Liu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12618v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究的是人机协作自主实验框架（px-BO），通过贝叶斯优化和代理建模进行材料空间探索。论文内容主要涉及贝叶斯优化、主动学习、人机协作、代理建模等传统机器学习方法，但未涉及任何大语言模型（LLM）、深度学习技术原理或大模型相关技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学（材料科学）领域的应用，但并非核心内容，只是应用场景，因此给5分。其他所有关键词均与大模型技术、训练方法、推理优化、代理系统等无关，全部给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于代理建模的人机协作贝叶斯优化框架（px-BO），通过人类偏好投票和代理模型替代，改善了传统数据驱动方法在材料空间探索中的效果，实现了更高效的材料发现。

摘要翻译

在面向目标应用的材料表征、合成与功能特性优化等任务中，通常需在多维控制参数空间中进行快速策略搜索，例如通过贝叶斯优化（Bayesian optimization, BO）等主动学习方法实现。然而，此类高维实验物理描述符往往复杂且含有噪声，从中提取低维数学标量指标或构建目标函数易产生误差。此外，在传统纯数据驱动的自主探索中，这类目标函数常忽略物理描述符的细微变化与关键特征，因而可能无法发现材料系统的未知现象。为解决此问题，本文提出一种基于代理模型的贝叶斯优化（proxy-modelled Bayesian optimization, px-BO）方法，通过人类与人工智能代理的实时协同实现优化。在贝叶斯优化循环中，我们不直接从实验数据定义数学目标函数，而是引入实时投票系统：将新实验结果与已有实验进行比较，由人类专家选择更优样本。随后，通过拟合布拉德利-特里（Bradley-Terry, BT）模型，将这些人工指导的比较结果转化为基于代理模型的目标函数。为减少人工干预，该迭代训练的代理模型还可作为人工智能代理，在后续优化中替代人类进行投票。这些代理投票结果会定期由人类专家验证，其修正信息将被代理模型实时学习。我们在模拟数据及钛酸铅锆（PTO）样品产生的BEPS数据中验证了所提px-BO框架的性能。结果表明，与传统数据驱动探索相比，该方法使领域专家能更好地引导搜索过程，从而提升探索效率，这凸显了人机协同在加速且有意义材料空间探索中的重要性。

摘要 (Abstract)

Optimization for different tasks like material characterization, synthesis, and functional properties for desired applications over multi-dimensional control parameters need a rapid strategic search through active learning such as Bayesian optimization (BO). However, such high-dimensional experimental physical descriptors are complex and noisy, from which realization of a low-dimensional mathematical scalar metrics or objective functions can be erroneous. Moreover, in traditional purely data-driven autonomous exploration, such objective functions often ignore the subtle variation and key features of the physical descriptors, thereby can fail to discover unknown phenomenon of the material systems. To address this, here we present a proxy-modelled Bayesian optimization (px-BO) via on-the-fly teaming between human and AI agents. Over the loop of BO, instead of defining a mathematical objective function directly from the experimental data, we introduce a voting system on the fly where the new experimental outcome will be compared with existing experiments, and the human agents will choose the preferred samples. These human-guided comparisons are then transformed into a proxy-based objective function via fitting Bradley-Terry (BT) model. Then, to minimize human interaction, this iteratively trained proxy model also acts as an AI agent for future surrogate human votes. Finally, these surrogate votes are periodically validated by human agents, and the corrections are then learned by the proxy model on-the-fly. We demonstrated the performance of the proposed px-BO framework into simulated and BEPS data generated from PTO sample. We find that our approach provided better control of the domain experts for an improved search over traditional data-driven exploration, thus, signifies the importance of human-AI teaming in an accelerated and meaningful material space exploration.

关键词: Human-AI collaboration, Bayesian optimization, Proxy modeling, Autonomous experimentation, Material science, Active learning, Bradley-Terry model, Comparative observation

248. ❌ Deferred is Better: A Framework for Multi-Granularity Deferred Interaction of Heterogeneous Features

作者: Yi Xu, Moyu Zhang, Chaofan Fan, Jinxin Hu, Yu Zhang, Xiaoyi Zeng 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12586v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 该论文研究的是点击率（CTR）预测模型中的特征交互问题，提出了一种多粒度信息感知延迟交互网络（MGDIN）来处理特征异质性。论文内容专注于传统的推荐系统、CTR预测和深度学习模型架构优化，并未涉及大语言模型（LLM）、大模型技术原理（如MoE、Scaling Laws、预训练、对齐、推理优化等）或大模型在科学领域的应用。所有评分关键词均与大模型或相关技术直接相关，而本文研究领域与之完全不同，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对点击率预测中特征异质性（稀疏性和信息量差异）导致模型性能下降的问题，提出了一种多粒度信息感知延迟交互网络（MGDIN），通过特征分组和分层掩码策略延迟低信息特征的引入，从而学习更鲁棒的特征表示并提升CTR预测性能。

摘要翻译

点击率（CTR）预测模型通过在海量特征空间中建模交互来估计用户对物品的点击概率。一个基础但常被忽视的挑战在于这些特征固有的异质性：其稀疏性和信息含量差异巨大。例如，物品ID等类别特征极为稀疏，而物品价格等数值特征则相对稠密。主流的CTR模型大多忽视了这种异质性，采用统一的特征交互策略，将所有特征同时输入交互层。这种方法并非最优，因为过早引入低信息特征会注入显著噪声，并掩盖高信息特征中的信号，从而导致模型崩溃并阻碍稳健表征的学习。为解决上述挑战，我们提出了一种多粒度信息感知延迟交互网络（Multi-Granularity Information-Aware Deferred Interaction Network, MGDIN），它能自适应地延迟特征在特征交互过程中的引入。MGDIN的核心机制分两个阶段运行：首先，它采用多粒度特征分组策略，将原始特征划分为具有更均匀信息密度的不同组别（在不同粒度上），从而缓解极端个体特征稀疏性的影响，并使模型能够从多样视角捕捉特征交互。其次，通过分层掩蔽策略实现延迟交互机制，该策略通过在浅层掩蔽低信息特征组、并随着网络加深逐步解除掩蔽，来控制每组特征参与交互的时机与方式。这种延迟引入机制使得模型能够基于高信息特征建立稳健理解，再逐步纳入其他组别的稀疏信息……

摘要 (Abstract)

Click-through rate (CTR) prediction models estimates the probability of a user-item click by modeling interactions across a vast feature space. A fundamental yet often overlooked challenge is the inherent heterogeneity of these features: their sparsity and information content vary dramatically. For instance, categorical features like item IDs are extremely sparse, whereas numerical features like item price are relatively dense. Prevailing CTR models have largely ignored this heterogeneity, employing a uniform feature interaction strategy that inputs all features into the interaction layers simultaneously. This approach is suboptimal, as the premature introduction of low-information features can inject significant noise and mask the signals from information-rich features, which leads to model collapse and hinders the learning of robust representations. To address the above challenge, we propose a Multi-Granularity Information-Aware Deferred Interaction Network (MGDIN), which adaptively defers the introduction of features into the feature interaction process. MGDIN’s core mechanism operates in two stages: First, it employs a multi-granularity feature grouping strategy to partition the raw features into distinct groups with more homogeneous information density in different granularities, thereby mitigating the effects of extreme individual feature sparsity and enabling the model to capture feature interactions from diverse perspectives. Second, a delayed interaction mechanism is implemented through a hierarchical masking strategy, which governs when and how each group participates by masking low-information groups in the early layers and progressively unmasking them as the network deepens. This deferred introduction allows the model to establish a robust understanding based on high-information features before gradually incorporating sparser information from other groups…

关键词: Click-through rate prediction, Feature heterogeneity, Multi-granularity feature grouping, Deferred interaction, Hierarchical masking, Sparse features, Feature interaction, Deep learning model

249. ❌ A Spectral Revisit of the Distributional Bellman Operator under the Cramér Metric

作者: Keru Wang, Yixin Deng, Yao Lyu, Stephen Redmond, Shengbo Eben Li 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12576v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是分布强化学习（DRL）中分布贝尔曼算子在Cramér度量下的谱分析，属于强化学习的理论分支。所有评分关键词均聚焦于大模型（LLMs）及其相关技术（如训练、对齐、推理、应用等），而本文完全不涉及大模型、深度学习或任何评分关键词中的具体技术，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文从CDF层面分析了分布强化学习中贝尔曼算子的结构，构建了正则化的谱希尔伯特表示来精确实现CDF几何，为DRL的进一步函数和算子理论分析提供了基础。

摘要翻译

分布强化学习（DRL）研究贝尔曼更新下完整回报分布的演变，而非仅关注期望值。一个经典结论是，分布贝尔曼算子在克拉默度量下具有压缩性，该度量对应于累积分布函数（CDF）差值的$L^2$几何结构。尽管这种压缩性保证了策略评估的稳定性，现有分析主要停留在度量层面，侧重于压缩性质而未阐明贝尔曼更新对分布的结构性作用。本文直接在CDF层面分析分布贝尔曼动态，将克拉默几何视为本征分析框架。在此层面，贝尔曼更新对CDFs具有仿射作用，对CDFs间的差值具有线性作用，其压缩性为此线性作用提供了统一上界。基于此本征表述，我们构建了一族正则化谱希尔伯特表示，通过精确共轭实现CDF层面的几何结构，且不改变底层贝尔曼动态。正则化仅影响几何结构，并在正则化趋近于零时恢复原始的克拉默度量。该框架阐明了分布贝尔曼更新背后的算子结构，为DRL中进一步的函数分析与算子理论分析奠定了基础。

摘要 (Abstract)

Distributional reinforcement learning (DRL) studies the evolution of full return distributions under Bellman updates rather than focusing on expected values. A classical result is that the distributional Bellman operator is contractive under the Cramér metric, which corresponds to an $L^2$ geometry on differences of cumulative distribution functions (CDFs). While this contraction ensures stability of policy evaluation, existing analyses remain largely metric, focusing on contraction properties without elucidating the structural action of the Bellman update on distributions. In this work, we analyse distributional Bellman dynamics directly at the level of CDFs, treating the Cramér geometry as the intrinsic analytical setting. At this level, the Bellman update acts affinely on CDFs and linearly on differences between CDFs, and its contraction property yields a uniform bound on this linear action. Building on this intrinsic formulation, we construct a family of regularised spectral Hilbert representations that realise the CDF-level geometry by exact conjugation, without modifying the underlying Bellman dynamics. The regularisation affects only the geometry and vanishes in the zero-regularisation limit, recovering the native Cramér metric. This framework clarifies the operator structure underlying distributional Bellman updates and provides a foundation for further functional and operator-theoretic analyses in DRL.

关键词: Distributional Reinforcement Learning, Bellman Operator, Cramér Metric, Cumulative Distribution Functions, Spectral Hilbert Representations, Operator Theory, Policy Evaluation, Contraction Property

250. ❌ Accelerating materials discovery using foundation model based In-context active learning

作者: Jeffrey Hu, Rongzhi Dong, Ying Feng, Ming Hu, Jianjun Hu 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12567v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是使用基于Transformer的预训练基础模型TabPFN来加速材料科学中的主动学习过程。与关键词高度相关的是：1）‘Foundation Models’（论文明确使用TabPFN作为基础模型）；2）‘Pre-training’（TabPFN在数百万合成任务上进行了预训练）；3）‘In-context Learning’（论文方法名为’In-Context Active Learning’，利用基础模型的上下文学习能力）；4）‘AI for Science’（应用于材料发现，属于科学AI领域）。其他关键词如MoE、SFT、RAG、RLHF等未在论文中涉及，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于基础模型（TabPFN）的上下文主动学习方法，用于加速材料发现，相比传统高斯过程和随机森林方法，在10个材料数据集中有8个表现更优，平均减少了52%和29.77%的实验评估成本。

摘要翻译

主动学习（Active Learning, AL）作为一种强大的范式，通过迭代地将实验导向最有前景的候选材料，减少了昂贵的合成与表征循环，从而加速了材料发现。然而，当前的主动学习主要依赖于高斯过程（Gaussian Process, GP）和随机森林（Random Forest, RF）代理模型，二者存在互补的局限性：由于刚性核函数假设，GP难以充分拟合复杂的成分-性能关系；而RF在小数据体系（这正是大多数材料数据集所处的状态，样本量通常小于500）中会产生不可靠的不确定性估计。本文提出基于基础模型的上下文主动学习（In-Context Active Learning, ICAL），用TabPFN替代传统代理模型。TabPFN是一种基于Transformer架构的基础模型，通过在数百万合成任务上进行预训练，以元学习方式掌握表格数据的通用先验知识。TabPFN在单次前向传播中即可执行基于原则的贝叶斯推断，无需针对特定数据集进行重新训练，并在GP和RF表现最差的领域提供了校准良好的预测不确定性。在涵盖铜合金硬度与电导率、块体金属玻璃形成能力以及晶体晶格热导率等10个材料数据集上，与GP和RF进行基准测试比较，TabPFN在其中8个数据集上表现最优，相对于GP平均节省了52%的额外实验/评估成本，相对于RF平均节省了29.77%。交叉验证分析证实，TabPFN的优势源于其更优的不确定性校准能力，在所有代理模型中取得了最低的负对数似然值和稀疏化误差曲线下面积。我们的研究表明，预训练的基础模型可以作为一种高效的代理模型，加速基于主动学习的材料发现进程。

摘要 (Abstract)

Active learning (AL) has emerged as a powerful paradigm for accelerating materials discovery by iteratively steering experiments toward the most promising candidates, reducing costly synthesis-and-characterization cycles. However, current AL relies predominantly on Gaussian Process (GP) and Random Forest (RF) surrogates with complementary limitations: GP underfits complex composition–property landscapes due to rigid kernel assumptions, while RF produces unreliable uncertainty estimates in small-data regimes, precisely where most materials datasets reside (with < 500 samples). Here we propose foudaiton model based In-Context Active Learning (ICAL), replacing conventional surrogates with TabPFN, a transformer-based foundation model pre-trained on millions of synthetic tasks to meta-learn a universal prior over tabular data. TabPFN performs principled Bayesian inference in a single forward pass without dataset-specific retraining, delivering well-calibrated predictive uncertainty where GP and RF fail most severely. Benchmarked against GP and RF across 10 materials datasets spanning copper alloy hardness and electrical conductivity, bulk metallic glass-forming ability, and crystal lattice thermal conductivity, TabPFN wins on 8 out of 10 datasets, achieving a mean saving of 52% in extra experiments/evaluations relative to GP and 29.77% relative to RF. Cross-validation analysis confirms that TabPFN’s advantage stems from superior uncertainty calibration,achieving the lowest Negative Log-Likelihood and Area Under the Sparsification Error curve among all surrogates. Our work demonstrates that a pre-trained foundation model can serve as a highly effective surrogate for accelerating active learning-based materials discovery.

关键词: foundation model, active learning, materials discovery, TabPFN, in-context learning, uncertainty calibration, transformer, pre-trained model

251. ❌ Variational Garrote for Sparse Inverse Problems

作者: Kanghun Lee, Hyungjoon Soh, Junghyo Jo 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12562v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究稀疏正则化在逆问题中的应用，特别是比较L1正则化和Variational Garrote（一种近似L0稀疏性的概率方法）。论文与大多数关键词无关，因为这些关键词主要涉及大语言模型、训练技术、推理优化、对齐、代理系统等。唯一相关的关键词是’Mixture of Experts OR MoE OR Sparse Models’，因为论文涉及稀疏模型（Sparse Models），但相关性有限（5分），因为论文专注于传统稀疏正则化方法，而非大模型中的MoE或稀疏激活技术。其他关键词如AI for Science可能间接相关，但论文未明确涉及生物信息学或化学信息学应用，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文研究了稀疏正则化在逆问题中的效果，通过比较L1正则化和Variational Garrote方法，发现后者在强欠定情况下能实现更低的最小泛化误差和更好的稳定性。

摘要翻译

稀疏正则化在解决由不完整或受损测量产生的反问题中起着核心作用。不同的正则化器对应着关于未知信号结构的不同先验假设，重建性能取决于这些先验与数据内在稀疏性的匹配程度。本研究通过比较传统的L1正则化与变分门控（Variational Garrote, VG）——一种通过变分二元门控变量近似L0稀疏性的概率方法——来探究稀疏先验在反问题中的效果。研究构建了一个统一的实验框架，涵盖信号重采样、信号去噪和稀疏视图计算机断层扫描等多个重建任务。为了使具有不同参数化的模型之间能够进行一致比较，研究在大范围内扫描正则化强度，并通过训练-泛化误差曲线分析重建行为。实验揭示了不同任务中典型的偏差-方差权衡模式，并证明在精确支撑集恢复至关重要的强欠定区域，VG通常能够实现更低的最小泛化误差和更高的稳定性。这些结果表明，当底层系数分布具有强稀疏性时，更接近尖峰-板结构的稀疏先验能够提供优势。本研究强调了先验-数据对齐在稀疏反问题中的重要性，并为变分L0型方法在不同信息瓶颈下的行为提供了实证见解。

摘要 (Abstract)

Sparse regularization plays a central role in solving inverse problems arising from incomplete or corrupted measurements. Different regularizers correspond to different prior assumptions about the structure of the unknown signal, and reconstruction performance depends on how well these priors match the intrinsic sparsity of the data. This work investigates the effect of sparsity priors in inverse problems by comparing conventional L1 regularization with the Variational Garrote (VG), a probabilistic method that approximates L0 sparsity through variational binary gating variables. A unified experimental framework is constructed across multiple reconstruction tasks including signal resampling, signal denoising, and sparse-view computed tomography. To enable consistent comparison across models with different parameterizations, regularization strength is swept across wide ranges and reconstruction behavior is analyzed through train-generalization error curves. Experiments reveal characteristic bias-variance tradeoff patterns across tasks and demonstrate that VG frequently achieves lower minimum generalization error and improved stability in strongly underdetermined regimes where accurate support recovery is critical. These results suggest that sparsity priors closer to spike-and-slab structure can provide advantages when the underlying coefficient distribution is strongly sparse. The study highlights the importance of prior-data alignment in sparse inverse problems and provides empirical insights into the behavior of variational L0-type methods across different information bottlenecks.

关键词: sparse regularization, inverse problems, Variational Garrote, L1 regularization, generalization error, support recovery, bias-variance tradeoff, sparsity priors

252. ❌ Lyapunov Stable Graph Neural Flow

作者: Haoyu Chu, Xiaotong Chen, Wei Zhou, Wenjun Cui, Kai Zhao, Shikui Wei, Qiyu Kang 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12557v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究图神经网络（GNNs）的对抗鲁棒性，提出基于Lyapunov稳定性的防御框架。所有评分关键词均与大模型、深度学习技术原理或科学应用相关，但论文专注于图神经网络这一特定深度学习子领域，未涉及大语言模型、MoE、缩放定律、训练对齐、推理优化、智能体、量化压缩、幻觉缓解、可解释性、模型合并、上下文学习或科学AI应用等主题。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对图神经网络易受对抗攻击的问题，提出了一种基于Lyapunov稳定性的新型防御框架，通过约束特征更新动态和引入可学习的Lyapunov函数，在理论上提供稳定性保证，并在实验中显著提升了对抗鲁棒性。

摘要翻译

图神经网络（Graph Neural Networks, GNNs）对拓扑结构和节点特征的对抗性扰动高度敏感，这使得学习鲁棒表征成为一个关键挑战。本研究将图神经网络与控制理论相结合，提出了一种基于整数阶与分数阶李雅普诺夫稳定性的新型防御框架。不同于依赖资源密集的对抗训练或数据净化的传统策略，我们的方法从本质上约束了图神经网络底层的特征更新动态。我们设计了一种自适应的、可学习的李雅普诺夫函数，并结合一种新颖的投影机制，该机制将网络状态映射到一个稳定空间中，从而提供了理论上可证明的稳定性保证。值得注意的是，该机制与现有防御方法正交，可无缝集成对抗训练等技术以实现累积鲁棒性。大量实验表明，我们的李雅普诺夫稳定图神经流在标准基准测试和各种对抗攻击场景下，显著优于基础神经流及当前最先进的基线方法。

摘要 (Abstract)

Graph Neural Networks (GNNs) are highly vulnerable to adversarial perturbations in both topology and features, making the learning of robust representations a critical challenge. In this work, we bridge GNNs with control theory to introduce a novel defense framework grounded in integer- and fractional-order Lyapunov stability. Unlike conventional strategies that rely on resource-heavy adversarial training or data purification, our approach fundamentally constrains the underlying feature-update dynamics of the GNN. We propose an adaptive, learnable Lyapunov function paired with a novel projection mechanism that maps the network’s state into a stable space, thereby offering theoretically provable stability guarantees. Notably, this mechanism is orthogonal to existing defenses, allowing for seamless integration with techniques like adversarial training to achieve cumulative robustness. Extensive experiments demonstrate that our Lyapunov-stable graph neural flows substantially outperform base neural flows and state-of-the-art baselines across standard benchmarks and various adversarial attack scenarios.

关键词: Graph Neural Networks, Adversarial Robustness, Lyapunov Stability, Control Theory, Feature Update Dynamics, Theoretical Guarantees, Defense Framework

253. ❌ Asymptotic and Finite-Time Guarantees for Langevin-Based Temperature Annealing in InfoNCE

作者: Faris Chaudhry 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12552v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究对比学习中的InfoNCE损失函数温度参数的理论分析，使用Langevin动力学和模拟退火方法，属于机器学习理论领域。所有关键词均与大模型、深度学习技术原理创新或AI科学应用直接相关，而本文专注于对比学习的理论分析，不涉及大模型、深度学习技术或具体科学领域应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文通过Langevin动力学和模拟退火理论，分析了对比学习中InfoNCE损失函数温度参数在固定与退火调度下的动态行为，证明了慢对数逆温度调度能确保收敛到全局最优表示，而快速调度可能导致陷入局部最优。

摘要翻译

对比学习中的InfoNCE损失函数高度依赖于温度参数，然而其在固定与退火调度下的动态特性仍未得到充分理解。本文通过建立紧致黎曼流形上朗之万动力学下的嵌入演化模型，进行了理论分析。在温和的光滑性与能量势垒假设下，我们证明经典的模拟退火保证可扩展至此场景：缓慢的对数型逆温度调度能确保以概率收敛至全局最优表示集，而较快的调度则可能陷入局部极小值。我们的研究建立了对比学习与模拟退火之间的联系，为理解和调整温度调度提供了理论依据。

摘要 (Abstract)

The InfoNCE loss in contrastive learning depends critically on a temperature parameter, yet its dynamics under fixed versus annealed schedules remain poorly understood. We provide a theoretical analysis by modeling embedding evolution under Langevin dynamics on a compact Riemannian manifold. Under mild smoothness and energy-barrier assumptions, we show that classical simulated annealing guarantees extend to this setting: slow logarithmic inverse-temperature schedules ensure convergence in probability to a set of globally optimal representations, while faster schedules risk becoming trapped in suboptimal minima. Our results establish a link between contrastive learning and simulated annealing, providing a principled basis for understanding and tuning temperature schedules.

关键词: InfoNCE loss, contrastive learning, temperature parameter, Langevin dynamics, simulated annealing, embedding evolution, Riemannian manifold, convergence guarantees

254. ❌ Deep Distance Measurement Method for Unsupervised Multivariate Time Series Similarity Retrieval

作者: Susumu Naito, Kouta Nakata, Yasunori Taguchi 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12544v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于无监督多元时间序列相似性检索的深度距离测量方法，属于时间序列分析和工业应用领域。论文未涉及任何大语言模型（LLM）、深度学习技术原理创新或大模型在不同领域的应用。唯一的相关点是“AI for Science”关键词，因为论文将深度学习方法应用于工业工厂（造纸厂）的时间序列数据分析，这可以视为AI在工业科学/工程领域的应用，但并非核心内容，因此给予5分（有一定关联）。所有其他关键词均与论文内容完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种深度距离测量方法（DDMM），用于提高无监督多元时间序列相似性检索的准确性，并在工业工厂数据集上验证了其优于现有方法的性能。

摘要翻译

本文提出深度距离度量方法（Deep Distance Measurement Method, DDMM），以提升无监督多元时间序列相似性检索的准确性。DDMM能够学习整个时间序列中状态内部的细微差异，从而识别状态间的微小差别，这对工业工厂用户具有重要意义。为实现这一目标，DDMM采用一种学习算法：从整个时间序列中任意采样锚点样本与正样本对，依据样本对内的欧氏距离为其分配权重，并以权重加权的方式学习样本对间的差异。该算法既能学习状态内部的细微差异，又能从整个时间序列中采样样本对。我们的实证研究表明，在纸浆造纸厂数据集上，DDMM显著优于当前先进的时间序列表示学习方法，并证明了其在工业场景中的有效性。此外，通过组合模型实验，我们发现将DDMM与现有特征提取方法结合可进一步提升检索精度。

摘要 (Abstract)

We propose the Deep Distance Measurement Method (DDMM) to improve retrieval accuracy in unsupervised multivariate time series similarity retrieval. DDMM enables learning of minute differences within states in the entire time series and thereby recognition of minute differences between states, which are of interest to users in industrial plants. To achieve this, DDMM uses a learning algorithm that assigns a weight to each pair of an anchor and a positive sample, arbitrarily sampled from the entire time series, based on the Euclidean distance within the pair and learns the differences within the pairs weighted by the weights. This algorithm allows both learning minute differences within states and sampling pairs from the entire time series. Our empirical studies showed that DDMM significantly outperformed state-of-the-art time series representation learning methods on the Pulp-and-paper mill dataset and demonstrated the effectiveness of DDMM in industrial plants. Furthermore, we showed that accuracy can be further improved by linking DDMM with existing feature extraction methods through experiments with the combined model.

关键词: multivariate time series, similarity retrieval, unsupervised learning, deep distance measurement, industrial plants, representation learning, feature extraction, pulp-and-paper mill

255. ❌ As Language Models Scale, Low-order Linear Depth Dynamics Emerge

作者: Buddhika Nettasinghe, Geethu Joseph 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12541v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的深度动态特性，发现随着模型规模增大，其层间动态呈现出低阶线性特征，这直接与’Large Language Models’和’Mechanistic Interpretability’高度相关（10分）。论文还探讨了模型规模与线性代理模型准确性的关系，这与’Scaling Laws’有一定关联（5分）。其他关键词如MoE、SFT、RAG、Agents等均未在论文中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该研究发现随着语言模型规模增大，其Transformer深度动态在上下文内呈现出低阶线性特征，并揭示了模型规模与线性代理准确性之间的单调改善关系，为分析和控制大语言模型提供了系统理论基础。

摘要翻译

大型语言模型常被视为高维非线性系统并作为黑箱处理。本文研究表明，在特定上下文中，Transformer的深度动力学可通过低阶线性代理模型精确刻画。在毒性、反讽、仇恨言论和情感分析等多项任务中，一个32维线性代理模型能以近乎完美的一致性复现GPT-2-large的层间敏感度分布，准确捕捉各层在加性注入扰动下最终输出的变化规律。我们进而发现一个惊人的缩放定律：对于固定阶数的线性代理模型，其与完整模型的一致性在GPT-2系列中随模型规模增大而单调提升。该线性代理模型还可实现原理性多层干预策略，当应用于完整模型时，其所需能量低于标准启发式调度方案。综上，我们的研究揭示：随着语言模型规模扩大，低阶线性深度动力学会在上下文中涌现，这为分析和控制语言模型提供了系统理论的基础框架。

摘要 (Abstract)

Large language models are often viewed as high-dimensional nonlinear systems and treated as black boxes. Here, we show that transformer depth dynamics admit accurate low-order linear surrogates within context. Across tasks including toxicity, irony, hate speech and sentiment, a 32-dimensional linear surrogate reproduces the layerwise sensitivity profile of GPT-2-large with near-perfect agreement, capturing how the final output shifts under additive injections at each layer. We then uncover a surprising scaling principle: for a fixed-order linear surrogate, agreement with the full model improves monotonically with model size across the GPT-2 family. This linear surrogate also enables principled multi-layer interventions that require less energy than standard heuristic schedules when applied to the full model. Together, our results reveal that as language models scale, low-order linear depth dynamics emerge within contexts, offering a systems-theoretic foundation for analyzing and controlling them.

关键词: Large Language Models, Transformer Depth Dynamics, Linear Surrogates, Scaling Principle, Model Size, GPT-2, Mechanistic Interpretability, Systems Theory

256. ❌ A Reduction Algorithm for Markovian Contextual Linear Bandits

作者: Kaan Buyukkalayci, Osama Hanna, Christina Fragouli 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12530v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是马尔可夫上下文线性赌博机（Markovian contextual linear bandits）的算法问题，属于强化学习/在线学习领域，专注于理论算法设计和后悔界分析。所有评分关键词均与大模型、深度学习技术原理或AI科学应用相关，而本文完全不涉及这些主题。论文内容是关于经典机器学习算法（线性赌博机）在特定上下文环境下的扩展，没有使用或讨论任何大模型技术、训练方法、推理优化、对齐技术或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文针对动作集随外生马尔可夫链演化的马尔可夫上下文线性赌博机问题，提出了一种基于平稳替代动作集的约简算法，并在已知和未知转移分布两种设置下获得了与底层线性赌博机oracle匹配的高概率最坏情况后悔界。

摘要翻译

近期研究表明，当上下文独立同分布时，线性上下文赌博机问题可简化为单上下文线性赌博机问题。这种“上下文成本低廉”的视角具有显著优势，它支持更精确的有限时间分析，并能借鉴线性赌博机文献中的成熟技术，例如处理模型误设和对抗性数据污染的方法。受具有时间相关性可用性的应用场景启发，我们将此视角拓展至马尔可夫上下文线性赌博机问题，其中动作集通过外生马尔可夫链演化。我们的核心贡献是在一致几何遍历性条件下提出了一种归约方法。通过构建一个平稳的替代动作集，我们利用标准线性赌博机预言机求解该问题，并采用延迟更新方案来控制由非平稳条件上下文分布引起的偏差。针对转移分布未知的情况，我们进一步提出了一种分阶段算法，可在线上学习替代映射关系。在这两种设定下，我们获得了与底层线性赌博机预言机相匹配的高概率最坏情况遗憾界，且仅对混合时间具有低阶依赖。

摘要 (Abstract)

Recent work shows that when contexts are drawn i.i.d., linear contextual bandits can be reduced to single-context linear bandits. This ``contexts are cheap” perspective is highly advantageous, as it allows for sharper finite-time analyses and leverages mature techniques from the linear bandit literature, such as those for misspecification and adversarial corruption. Motivated by applications with temporally correlated availability, we extend this perspective to Markovian contextual linear bandits, where the action set evolves via an exogenous Markov chain. Our main contribution is a reduction that applies under uniform geometric ergodicity. We construct a stationary surrogate action set to solve the problem using a standard linear bandit oracle, employing a delayed-update scheme to control the bias induced by the nonstationary conditional context distributions. We further provide a phased algorithm for unknown transition distributions that learns the surrogate mapping online. In both settings, we obtain a high-probability worst-case regret bound matching that of the underlying linear bandit oracle, with only lower-order dependence on the mixing time.

关键词: Markovian contextual bandits, linear bandits, reduction algorithm, regret bound, geometric ergodicity, surrogate action set, delayed-update scheme, phased algorithm

257. ❌ Probabilistic Joint and Individual Variation Explained (ProJIVE) for Data Integration

作者: Raphiel J. Murden, Ganzhong Tian, Deqiang Qiu, Benajmin B. Risk 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12351v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文提出了一种用于多模态数据整合的概率模型ProJIVE，属于统计学和生物信息学领域。论文内容与大多数关键词（涉及大模型技术、训练方法、推理优化等）完全无关，仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文应用于阿尔茨海默病的脑成像和认知数据分析，属于生物信息学/科学AI应用，但并非核心的大模型或深度学习技术，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一个用于多模态数据整合的概率模型ProJIVE，通过期望最大化算法估计联合和个体变异成分，并应用于阿尔茨海默病的脑形态测量和认知数据分析，发现了与现有生物标志物相关的生物学意义变异模式。

摘要翻译

在现代科学应用中，如基因组学、代谢组学和神经影像学等领域，针对同一组研究对象收集多种类型的数据已成为常见做法。联合与个体方差解释（Joint and Individual Variance Explained, JIVE）方法旨在对共同研究对象所捕获的两个或多个特征集之间的联合变异进行低秩近似，并将这种变异与各特征集独有的变异分离开来。本研究开发了一种期望最大化（Expectation-Maximization, EM）算法，用于估计JIVE框架下的概率模型。该模型将概率主成分分析扩展至多数据集场景。我们的最大似然估计方法能够同时估计联合成分与个体成分，相比其他方法可能获得更高的准确性。我们将概率化JIVE（ProJIVE）应用于阿尔茨海默病的脑形态测量与认知指标分析。ProJIVE能够识别出具有生物学意义的变异轨迹，且其联合形态测量与认知受试者评分与现有成本更高的生物标志物表现出强相关性。本文分析所用数据来自阿尔茨海默病神经影像学倡议（Alzheimer’s Disease Neuroimaging Initiative, ADNI）数据库。重现分析的代码已发布于我们的GitHub页面。

摘要 (Abstract)

Collecting multiple types of data on the same set of subjects is common in modern scientific applications including, genomics, metabolomics, and neuroimaging. Joint and Individual Variance Explained (JIVE) seeks a low-rank approximation of the joint variation between two or more sets of features captured on common subjects and isolates this variation from that unique to eachset of features. We develop an expectation-maximization (EM) algorithm to estimate a probabilistic model for the JIVE framework. The model extends probabilistic principal components analysis to multiple data sets. Our maximum likelihood approach simultaneously estimates joint and individual components, which can lead to greater accuracy compared to other methods. We apply ProJIVE to measures of brain morphometry and cognition in Alzheimer’s disease. ProJIVE learns biologically meaningful courses of variation, and the joint morphometry and cognition subject scores are strongly related to more expensive existing biomarkers. Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. Code to reproduce the analysis is available on our GitHub page.

关键词: data integration, probabilistic model, joint variation, individual variation, expectation-maximization, Alzheimer’s disease, brain morphometry, cognition

258. ❌ Budget-Sensitive Discovery Scoring: A Formally Verified Framework for Evaluating AI-Guided Scientific Selection

作者: Abhinaba Basu, Pavan Chakraborty 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12349v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是提出一个用于评估AI引导科学发现选择策略的预算敏感框架（BSDS/DQS），并以药物发现为案例研究，评估LLMs在现有ML流程中是否增加边际价值。论文高度相关于"Large Language Models OR LLMs OR Foundation Models"（权重1.0），因为摘要明确提到LLMs并评估了28种LLM配置（零样本和少样本），这是研究的核心部分。论文也高度相关于"AI for Science OR Bioinformatics OR Cheminformatics"（权重1.0），因为研究应用于药物发现（MoleculeNet HIV、Tox21等生物信息学/化学信息学领域），属于AI for Science范畴。其他关键词（如MoE、Scaling Laws、RLHF等）在摘要中未提及或与论文内容无关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个预算敏感的发现评分框架（BSDS/DQS）来评估AI引导的科学选择策略，并以药物发现为例，发现LLMs在现有ML分类器基础上未提供额外价值。

摘要翻译

科学发现日益依赖人工智能系统筛选候选对象以进行昂贵的实验验证，然而目前缺乏用于比较筛选策略的原则性、预算感知评估框架——这一空白因大语言模型（LLM）的出现而加剧，LLM能够生成看似合理的科学提案却缺乏可靠的下游评估方法。我们提出预算敏感发现分数（Budget-Sensitive Discovery Score, BSDS），这是一个经过形式化验证的指标（包含20个由Lean 4证明辅助器机器验证的定理），该指标在每一预算层级上同时惩罚错误发现（λ加权的错误发现率）与过度弃选（γ加权的覆盖缺口）。其预算平均形式——发现质量分数（Discovery Quality Score, DQS）——提供了单一汇总统计量，确保任何提案者无法通过在某特定预算下的优异表现人为提升评分。

作为案例研究，我们将BSDS/DQS应用于以下问题：在现有药物发现候选化合物筛选的机器学习流程中，LLM是否具有边际价值？我们在MoleculeNet HIV数据集（41,127种化合物，活性率3.5%，1,000次自助抽样）上，通过SMILES分子表示法评估了39种提案者——包括11种机制变体、14种零样本LLM配置和14种少样本LLM配置，并采用随机分割与骨架分割两种方式。研究得出三项结论：首先，基于随机森林的简单贪婪机器学习提案者（Greedy-ML）获得最佳DQS（-0.046），优于所有多层感知机变体及LLM配置；其次，在HIV或Tox21数据集上的零样本或少样本评估中，没有任何LLM能超越Greedy-ML基线，这证实LLM在现有训练好的分类器基础上未能提供边际价值；最后，提案者性能排序在五个MoleculeNet基准测试中具有普适性，这些测试涵盖0.18%-46.2%的活性率范围、非药物自动驾驶安全领域，以及9×7维惩罚参数网格（τ ≥ 0.636，平均τ = 0.863）。该框架适用于任何在预算约束和非对称错误成本下进行候选对象筛选的场景。

摘要 (Abstract)

Scientific discovery increasingly relies on AI systems to select candidates for expensive experimental validation, yet no principled, budget-aware evaluation framework exists for comparing selection strategies – a gap intensified by large language models (LLMs), which generate plausible scientific proposals without reliable downstream evaluation. We introduce the Budget-Sensitive Discovery Score (BSDS), a formally verified metric – 20 theorems machine-checked by the Lean 4 proof assistant – that jointly penalizes false discoveries (lambda-weighted FDR) and excessive abstention (gamma-weighted coverage gap) at each budget level. Its budget-averaged form, the Discovery Quality Score (DQS), provides a single summary statistic that no proposer can inflate by performing well at a cherry-picked budget. As a case study, we apply BSDS/DQS to: do LLMs add marginal value to an existing ML pipeline for drug discovery candidate selection? We evaluate 39 proposers – 11 mechanistic variants, 14 zero-shot LLM configurations, and 14 few-shot LLM configurations – using SMILES representations on MoleculeNet HIV (41,127 compounds, 3.5% active, 1,000 bootstrap replicates) under both random and scaffold splits. Three findings emerge. First, the simple RF-based Greedy-ML proposer achieves the best DQS (-0.046), outperforming all MLP variants and LLM configurations. Second, no LLM surpasses the Greedy-ML baseline under zero-shot or few-shot evaluation on HIV or Tox21, establishing that LLMs provide no marginal value over an existing trained classifier. Third, the proposer hierarchy generalizes across five MoleculeNet benchmarks spanning 0.18%-46.2% prevalence, a non-drug AV safety domain, and a 9x7 grid of penalty parameters (tau >= 0.636, mean tau = 0.863). The framework applies to any setting where candidates are selected under budget constraints and asymmetric error costs.

关键词: Budget-Sensitive Discovery Score, LLM evaluation, drug discovery, AI for science, formal verification, candidate selection, MoleculeNet, HIV dataset

259. ❌ Self-Reported Side Effects of Semaglutide and Tirzepatide in Online Communities

作者: Neil K. R. Sehgal, Jena Shaw Tronieri, Lyle Ungar, Sharath Chandra Guntuku 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12341v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究GLP-1受体激动剂（如司美格鲁肽和替尔泊肽）在社交媒体上的副作用报告，属于药物安全监测和公共卫生领域。论文未涉及任何大模型、深度学习技术原理或创新方法，仅使用传统文本分析方法处理社交媒体数据。唯一相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于生物信息学/医学信息学应用，但论文未明确使用AI技术，仅涉及基础数据分析，因此给予5分（有一定关联）。其他所有关键词均与大模型技术、训练方法、推理优化、代理系统等无关，评分为0。

!!! tip deepseek-chat TL;DR

该研究通过分析410,198条Reddit帖子，揭示了患者使用司美格鲁肽和替尔泊肽后自我报告的副作用，发现胃肠道症状最常见，并识别出未被临床试验充分记录的生殖和体温相关潜在副作用。

摘要翻译

社交媒体能够揭示患者使用胰高血糖素样肽-1受体激动剂（GLP-1 RAs）的体验，这些信息超出了临床试验数据的范畴。我们分析了410,198条提及司美格鲁肽（semaglutide）或替尔泊肽（tirzepatide）的Reddit帖子（时间范围为2019年5月至2025年6月）。共有67,008名用户自述使用过这些药物，其中43.5%描述了至少一种副作用。胃肠道症状最为普遍，包括恶心（36.9%）、疲劳（16.7%）、呕吐（16.3%）、便秘（15.3%）和腹泻（12.6%）。值得注意的是，生殖系统症状（如月经不规律）和温度相关不适（如寒战、潮热）作为未被认知的潜在效应显现出来。这些发现凸显了当前药品说明书或试验未能充分捕捉的患者关切。大规模社交媒体分析能够通过检测新出现的安全信号并扩展对GLP-1 RAs真实世界安全性特征的理解，从而对传统药物警戒形成补充。

摘要 (Abstract)

Social media can reveal patient experiences with glucagon-like peptide-1 receptor agonists (GLP-1 RAs) that extend beyond clinical trial data. We analyzed 410,198 Reddit posts (May 2019-June 2025) mentioning semaglutide or tirzepatide. A total of 67,008 users self-reported using these medications, and 43.5% described at least one side effect. Gastrointestinal symptoms predominated, including nausea (36.9%), fatigue (16.7%), vomiting (16.3%), constipation (15.3%), and diarrhea (12.6%). Notably, reproductive symptoms (e.g., menstrual irregularities) and temperature-related complaints (e.g., chills, hot flashes) emerged as unrecognized potential effects. These findings highlight patient concerns not well captured in current labeling or trials. Large-scale social media analysis can complement traditional pharmacovigilance by detecting emerging safety signals and expanding understanding of the real-world safety profile of GLP-1 RAs.

关键词: semaglutide, tirzepatide, side effects, social media analysis, GLP-1 receptor agonists, pharmacovigilance, patient-reported outcomes, Reddit

260. ❌ SHREC: A Spectral Embedding-Based Approach for Ab-Initio Reconstruction of Helical Molecules

作者: Guy Shapira, Yoel Shkolnisky 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12307v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于冷冻电镜（cryo-EM）中螺旋分子结构的计算重建算法（SHREC），属于计算生物学/结构生物学的具体应用。论文未涉及任何大模型、深度学习技术原理或相关训练/推理方法。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于生物信息学/计算生物学领域，但论文本身并未强调AI或机器学习（使用的是谱嵌入等传统计算方法），因此给予5分（有一定关联）。其他所有关键词均与大模型、深度学习技术无关，评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SHREC的算法，用于从冷冻电镜图像中无需先验螺旋对称参数即可重建螺旋分子三维结构，解决了传统方法依赖初始参数估计导致不可靠的问题。

摘要翻译

冷冻电子显微镜（cryo-EM）已成为一种在近原子分辨率下解析生物分子三维结构的强大技术。然而，螺旋组装体的重构因其固有的对称性以及需要确定未知螺旋对称参数而面临独特挑战。传统方法需要对这些参数进行精确的初始估计，这通常通过反复试验或先验知识获得。这些要求可能导致错误的重构，从而限制了从头螺旋重构的可靠性。

本研究提出了一种名为SHREC（Spectral Helical REConstruction）的算法，该算法可直接从螺旋片段的二维冷冻电镜图像中恢复其投影角度，而无需螺旋对称参数的先验知识。我们的方法利用了螺旋片段的投影形成一维流形的洞见，该流形可通过谱嵌入技术进行恢复。在公开数据集上的实验验证表明，SHREC在准确恢复螺旋参数的同时实现了高分辨率重构，且仅需知晓样品的轴向对称群。通过消除对初始对称性估计的需求，SHREC为冷冻电镜中螺旋结构的测定提供了一条更稳健且自动化的途径。

摘要 (Abstract)

Cryo-electron microscopy (cryo-EM) has emerged as a powerful technique for determining the three-dimensional structures of biological molecules at near-atomic resolution. However, reconstructing helical assemblies presents unique challenges due to their inherent symmetry and the need to determine unknown helical symmetry parameters. Traditional approaches require an accurate initial estimation of these parameters, which is often obtained through trial and error or prior knowledge. These requirements can lead to incorrect reconstructions, limiting the reliability of ab initio helical reconstruction. In this work, we present SHREC (Spectral Helical REConstruction), an algorithm that directly recovers the projection angles of helical segments from their two-dimensional cryo-EM images, without requiring prior knowledge of helical symmetry parameters. Our approach leverages the insight that projections of helical segments form a one-dimensional manifold, which can be recovered using spectral embedding techniques. Experimental validation on publicly available datasets demonstrates that SHREC achieves high resolution reconstructions while accurately recovering helical parameters, requiring only knowledge of the specimen’s axial symmetry group. By eliminating the need for initial symmetry estimates, SHREC offers a more robust and automated pathway for determining helical structures in cryo-EM.

关键词: cryo-electron microscopy, helical reconstruction, spectral embedding, ab initio, three-dimensional structure, biological molecules, SHREC, symmetry parameters

261. ❌ A Conceptual Shift In Our Understanding of Degenerate Radical Spin Systems: Spin-Rotation Coupling Turned On Its Head

作者: Linqing Peng, Titouan Duston, Nadine Bradbury, Mansi Bhati, Xuecheng Tao, Michael Rosen, Joseph E. Subotnik 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13211v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究化学物理领域的自由基自旋系统，聚焦于自旋-旋转耦合和Born-Oppenheimer理论的替代方法，属于理论化学和分子物理范畴。所有关键词均涉及大模型、深度学习、AI技术及其应用，而本文完全不涉及这些主题。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，但论文未使用AI方法，而是纯理论计算和物理模型，因此仅给予5分（有一定关联，因属于科学计算领域，但非AI驱动）。其他关键词与论文内容无任何关联，均评0分。

!!! tip deepseek-chat TL;DR

该论文挑战了传统Born-Oppenheimer理论对自由基自旋系统的描述，提出了一种基于核位置和动量的新方法，成功预测了非简并的自旋依赖势能面，并与实验观测的自旋-旋转耦合定量一致，揭示了自旋-核运动耦合的新物理机制。

摘要翻译

对于大多数化学家而言，克拉默斯简并指的是：对于任何自由基体系，所有核位置 $\mathbf{X}$ 下的每个势能面都至少是双重简并的（包含自旋向上与自旋向下、时间反演的态）。尽管如此，正如自旋化学领域的研究者所熟知，实验上可以观测到双态体系中几乎每个转动能级的分裂——这揭示了核运动会打破此类玻恩-奥本海默（Born-Oppenheimer, BO）电子态的自旋简并。因此，就预测实验光谱而言，BO简并的意义非常有限，除非进一步以严格方式完整处理核-电子纠缠；事实上，在玻恩-奥本海默势能面的范式下理解自由基分子（及其定态的简并性）可能极违反直觉。近年来，作为BO理论的一种替代方案，新理论提出将自由基势能面描述为核位置 $\mathbf{X}$ 与核动量 $\mathbf{P}$ 的函数，该方法已被证明能够复现BO理论之外的大量可观测量，例如振动圆二色性、拉曼光学活性和λ型分裂。本文中，我们证明该技术预测不同自旋态将遵循不同的（非简并）势能面，并且这些自旋依赖势能面之间的差异在定量上与实验观测的自旋-转动耦合一致——且完全不与克拉默斯简并原理相矛盾。因此，本研究提示关于自旋分辨的分子反应性仍有许多有待探索之处，这要求我们在理解耦合的自旋-核运动时进行概念上的转变，尤其是在已知会出现自旋分离的手性分子与材料体系中。

摘要 (Abstract)

For most chemists, Kramers’ degeneracy refers to the fact that for any radical system, every potential energy surface is at least doubly degenerate (with spin up and spin down, time-reversed solutions) for all nuclear positions $\mathbf{X}$. That being said, as is well-known to the community of spin chemists, one can experimentally detect a splitting of almost every rotational energy level for a doublet system – highlighting the fact that nuclear motion breaks the spin degeneracy of such BO electronic states. Thus, as far as predicting experimental spectra, the implications of BO degeneracy are very limited unless one further includes a complete treatment of nuclear-electronic entanglement in a robust fashion; indeed, understanding radical molecules (and the degeneracy of their stationary states) can be extremely non-intuitive within the paradigm of Born-Oppenheimer potential energy surfaces. Now, as an alternative to BO theory, recent theory has suggested characterizing radical potential energy surfaces as functions of both nuclear position $\mathbf{X}$ and nuclear momentum $\mathbf{P}$, an approach which has been shown to recover a host of observables outside of BO theory, e.g., vibrational circular dichroism, Raman optical activity, and lambda doubling. Here, we show that such a technique predicts that different spin states will follow different (nondegenerate) potential energy surfaces and that the differences in these spin-dependent surfaces is quantitatively consistent with experimental spin-rotation couplings – all without any contradiction with regard to Kramers’ degeneracy. Thus, the present finding suggests there is still a great deal to learn about spin-resolved molecular reactivity, demanding a conceptual shift in our understanding of coupled spin-nuclear motion, especially in the context of chiral molecules and materials where spin-separation is known to arise.

关键词: radical spin systems, spin-rotation coupling, Born-Oppenheimer theory, potential energy surfaces, Kramers degeneracy, nuclear-electronic entanglement, chiral molecules, spin-resolved reactivity

262. ❌ Resource-efficient Quantum Algorithms for Selected Hamiltonian Subspace Diagonalization

作者: Vincent Graves, Manqoba Q. Hlatshwayo, Theodoros Kapourniotis, Konstantinos Georgopoulos 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13160v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于量子计算算法（特别是量子选择配置相互作用QSCI和量子选择热浴CI）在量子化学中的应用，用于分子哈密顿量的子空间对角化。论文内容与深度学习、大模型技术完全无关，所有关键词（除最后一个外）均涉及深度学习/大模型的技术原理、训练方法、推理优化、对齐、应用范式等，与该论文的量子算法研究无任何关联。唯一可能的相关点是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文涉及量子算法在化学分子模拟中的应用，属于科学计算/AI for Science的广义范畴，但论文本身并未使用AI/机器学习方法，而是纯量子算法，因此给予5分（有一定关联，但非核心）。

!!! tip deepseek-chat TL;DR

该论文提出了两种资源高效的量子算法（CIM-QSCI和QSHCI）用于分子哈密顿量的子空间对角化，在保持精度的同时显著减少了量子资源需求，并通过模拟验证了其在N₂和萘分子上的性能。

摘要翻译

在NISQ时代，用于选择哈密顿量子空间以进行对角化的量子算法已成为变分算法的一种有前景的替代方案。迄今为止，此类算法——包括量子选择组态相互作用（QSCI）和基于采样的量子对角化（SQD）——均在福克空间的第二量子化框架中表述，这导致了量子比特资源的低效使用。我们首次提出了在CI矩阵（CIM）框架下开发的QSCI算法，该框架已知具有最优的量子比特标度，即精确为$\lceil \log_2 (N_{CSF}) \rceil$，其中$N$为CIM的尺寸。此外，我们引入了一种新颖的单比特翻转误差缓解技术，其开销仅需一个额外量子比特，并将此技术与从qDRIFT方法改进而来的随机近似特罗特演化相结合。通过在量子硬件上模拟基准分子N$_2$和萘，我们的结果取得了与SQD方法相当的精度，但消耗的量子资源显著减少。然而，对于相同任务，我们的CIM-QSCI算法和SQD方法均未能达到经典热浴CI（HCI）的性能水平。因此，我们提出了一种增强版的QSCI，称为量子选择热浴CI（QSHCI）。该变体方法利用QSCI的量子采样替代经典热浴采样，从而实现了与HCI相当的性能。我们注意到，当前方法的一个缺点在于构建CIM并进行泡利分解所需的预处理成本为$\mathcal{O}(N^2\log N)$。通过为随机特罗特演化设计高效的CIM访问模型，这一成本有望进一步降低。

摘要 (Abstract)

Quantum algorithms for selecting a subspace of Hamiltonians to diagonalize have emerged as a promising alternative to variational algorithms in the NISQ era. So far, such algorithms, which include the quantum selected configuration interaction (QSCI) and sample-based quantum diagonalization (SQD), have been formulated in second-quantization in Fock space, which leads to inefficient usage of qubit resources. We introduce the first QSCI algorithm developed in the CI-matrix (CIM) framework, which is known to have optimal qubit scaling of exactly $\lceil \log_2 (N_{CSF}) \rceil$ where $N$ is the size of the CIM. In addition, we introduce a novel single-bit flip error mitigation which comes at the overhead of a single qubit and we combine this with a stochastic approximate Trotterization evolution adapted from qDRIFT. Simulating benchmark N$_2$ and naphthalene molecules on quantum hardware, our results achieved similar accuracy as SQD methods but with significantly less quantum resources. However, our CIM-QSCI algorithm and SQD methods could not match the performance of classical heat-bath CI (HCI) for the same task. Hence, we introduce an augmented version of QSCI called quantum selected heat-bath CI (QSHCI). This variant replaces classical heat-bath sampling with quantum sampling from QSCI to achieve performance comparable to HCI. We note that a current drawback of our approach is the preprocessing cost of $\mathcal{O}(N^2\log N)$ for constructing the CIM and performing the Pauli decomposition. This can be further improved by considering efficient CIM access models for the stochastic Trotter evolution.

关键词: Quantum algorithms, Hamiltonian subspace diagonalization, Quantum selected configuration interaction (QSCI), CI-matrix framework, Quantum selected heat-bath CI (QSHCI), Resource-efficient, Error mitigation, Molecular simulation

263. ❌ Is the matrix completion of reduced density matrices unique?

作者: Gustavo E. Massaccesi, Ofelia B. Oña, Luis Lain, Alicia Torre, Juan E. Peralta, Diego R. Alcoba, Gustavo E. Scuseria 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.13087v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究量子多体系统中的约化密度矩阵（RDM）的矩阵补全问题，属于计算化学/量子物理领域。论文内容与绝大多数关键词（涉及大模型、深度学习、AI技术原理等）完全无关，因此评分为0。仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文涉及科学计算（电子结构理论）和算法（量子-随机混合算法），属于AI在科学领域的潜在应用范畴，但论文本身未明确使用AI或深度学习方法，故给予5分（中等关联）。

!!! tip deepseek-chat TL;DR

该论文研究了在量子多体系统中，从部分数据重构两粒子约化密度矩阵（2-RDM）的矩阵补全问题是否具有唯一性，基于Rosina定理证明了在特定条件下补全是唯一的，并提出了一个能实现精确矩阵补全的量子-随机混合算法，在Fermi-Hubbard模型中进行了验证。

摘要翻译

约化密度矩阵是多体量子系统中描述可观测量问题的核心。在电子结构理论中，二粒子约化密度矩阵（2-RDM）足以确定能量及其他关键性质。近期研究利用矩阵补全方法，结合约化密度矩阵的低秩结构和近似理论模型，从部分数据中重构二粒子约化密度矩阵，从而降低计算成本。然而，矩阵补全通常是一个欠定问题。通过重新审视罗西纳定理[M. Rosina, Queen’s Papers on Pure and Applied Mathematics No. 11, 369 (1968)]，本文证明在特定条件下矩阵补全具有唯一性，并确定了能够从不完整信息中精确重构二粒子约化密度矩阵的元素子集。基于此，我们提出一种混合量子-随机算法，可实现精确的矩阵补全，并通过费米-哈伯德模型的应用验证了该方法的有效性。

摘要 (Abstract)

Reduced density matrices are central to describing observables in many-body quantum systems. In electronic structure theory, the two-particle reduced density matrix (2-RDM) suffices to determine the energy and other key properties. Recent work has used matrix completion, leveraging the low-rank structure of RDMs and approximate theoretical models, to reconstruct the 2-RDM from partial data and thus reduce computational cost. However, matrix completion is, in general, an under-determined problem. Revisiting Rosina’s theorem [M. Rosina, Queen’s Papers on Pure and Applied Mathematics No. 11, 369 (1968)], we here show that the matrix completion is unique under certain conditions, identifying the subset of 2-RDM elements that enables its exact reconstruction from incomplete information. Building on this, we introduce a hybrid quantum-stochastic algorithm that achieves exact matrix completion, demonstrated through applications to the Fermi-Hubbard model.

关键词: reduced density matrix, matrix completion, quantum many-body systems, electronic structure theory, 2-RDM, Rosina’s theorem, hybrid quantum-stochastic algorithm, Fermi-Hubbard model

264. ❌ Electron confinement within a fluctuation “box” in liquid water

作者: Korenobu Matsuzaki, Hikaru Kuramochi, Tahei Tahara 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12537v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究液态水中水合电子的实验观测，属于物理化学和光谱学领域，与所有评分关键词（均涉及大模型、深度学习技术及其应用）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文通过瞬态二维电子光谱实验研究了液态水中水合电子的非均匀形状和尺寸及其在30飞秒时间尺度内的显著波动。

摘要翻译

电子在小体积内的受限作为“势箱中的粒子”这一量子力学教科书经典模型的现实呈现，具有重要研究价值。尽管电子受限在固态体系中较易设想，但其同样存在于液体中——液体的局部空腔可充当限制电子的“箱子”。液体中这些柔性空腔对电子的束缚机制预期与固体存在本质差异。本文通过瞬态二维电子光谱实验，对液态水中受限的电子（即水合电子）进行了研究。实验揭示了水合电子的形状与尺寸具有高度非均匀性，并在短于30飞秒的时间尺度上表现出显著涨落。

摘要 (Abstract)

Electron confinement within a small volume is intriguing as a realization of the particle-in-a-box system, which appears in every quantum mechanics textbook. While the electron confinement is readily imaginable in solid-state systems, it also occurs in liquids, where the local voids in the liquid serve as confining “boxes.” Confinement within these flexible cavities in liquids is expected to differ fundamentally from that in solids. Here, we experimentally investigate the electrons confined in liquid water, which are called hydrated electrons, using transient two-dimensional electronic spectroscopy. Our experiment reveals the large nonuniformity of the shape and the size of hydrated electrons with significant fluctuation at the timescale shorter than 30 fs.

关键词: electron confinement, hydrated electrons, liquid water, transient two-dimensional electronic spectroscopy, fluctuation, particle-in-a-box, nonuniformity, quantum mechanics

265. ❌ Reaction-Level Consistency within the Variational Quantum Eigensolver: Homodesmotic Ring Strain Energies of Cyclic Hydrocarbons

作者: L. Roy, M. Sarkar, M. Tewari, A. Kumar, M. Paranjothy 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12737v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于量子计算化学领域，使用变分量子本征求解器（VQE）计算环烃的环应变能，属于计算化学和量子模拟范畴。所有关键词均与大模型、深度学习、AI技术原理或应用相关，而本文完全不涉及这些主题。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于计算化学（与化学信息学相关），但论文未使用AI或大模型方法，仅使用量子计算和传统量子化学方法，因此给予5分（有一定关联）。其他关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该研究解决了使用变分量子本征求解器（VQE）计算化学反应能量时电子关联处理不一致的问题，通过对称性引导的活性空间选择协议结合同键反应方案，成功计算了从环丙烷到金刚烷的一系列环烃的环应变能，结果达到了化学精度并与密度泛函理论（DFT）和耦合簇（CCSD）基准一致。

摘要翻译

在量子计算平台上利用变分量子本征求解器（VQE）等量子-经典混合算法模拟化学反应时，反应能量评估中电子关联的一致处理构成了挑战。本研究采用先前报道的一种对称性引导的活性空间选择方案，通过同键反应方案计算环状碳氢化合物的环应变能。该方案通过选择能产生相同对称性匹配分数（SMF）值的活性空间，强制所有反应物与产物间的对称性一致，从而在反应层面确保关联处理的平衡性。当给定分子存在多个满足此标准的活性空间时，较大的活性空间通常能提供更优的关联处理；然而，在同键反应框架内，由于有利的误差抵消效应，较小的对称性一致活性空间也能产生可比拟的一致性结果。利用此框架，我们对从环丙烷到结构复杂的金刚烷等一系列饱和及不饱和环状碳氢化合物的环应变能进行了评估。所得能量相对于密度泛函理论（DFT）达到了化学精度，并与耦合簇单双激发（CCSD）基准值高度吻合。在分子复杂度递增过程中表现出的系统性性能，突显了将同键反应设计与对称性一致的VQE计算相结合的有效性。该方法通过强制反应物种间遵循物理基础的一致性，明确展示了将基于反应的量子模拟拓展至更大分子体系和更广泛化学反应类别的潜力。

摘要 (Abstract)

Simulation of chemical reactions on quantum computing platforms using quantum classical hybrid algorithms such as the Variational Quantum Eigensolver (VQE) is challenged by the need for a reaction consistent treatment of electron correlation in reaction energy evaluations. In this work, we employ a previously reported symmetry guided active space selection protocol to compute ring strain energies of cyclic hydrocarbons using homodesmotic reaction schemes. The protocol enforces symmetry consistency across all reactants and products by selecting active spaces that yield identical symmetry matched fraction (SMF) values, thereby ensuring balanced correlation treatment at the reaction level. When multiple active spaces satisfy this criterion for a given molecule, larger active spaces often provide improved correlation treatment; however, smaller symmetry consistent active spaces can also yield comparable agreement due to favorable error cancellation within the homodesmotic framework. Using this framework, ring strain energies were evaluated for a series of saturated and unsaturated cyclic hydrocarbons, ranging from cyclopropane to the structurally complex adamantane. The resulting energies achieve chemical accuracy relative to density functional theory (DFT) and remain in close agreement with coupled cluster singles and doubles (CCSD) benchmarks. The systematic performance across increasing molecular complexity highlights the effectiveness of combining homodesmotic reaction design with symmetry-consistent VQE calculations. This approach, which enforces physically grounded consistency across reaction species, demonstrates clear potential for extending reaction based quantum simulations to larger molecular systems and broader classes of chemical reactions.

关键词: Variational Quantum Eigensolver, VQE, ring strain energies, cyclic hydrocarbons, homodesmotic reaction, symmetry consistent active space, quantum simulation, chemical accuracy

266. ❌ Polymer-Residue Accessibility Shapes Sequence Dependence of Critical Temperatures for Phase Separation

作者: J. Pedro de Souza, Benjamin Sorkin, Amala Akkiraju, Athanassios Z. Panagiotopoulos, Howard A. Stone 期刊/来源: arxiv 发布日期: 2026-03-13 arXiv链接: http://arxiv.org/abs/2603.12534v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究生物聚合物（如内在无序蛋白质）的相分离临界温度与序列依赖性的物理机制，属于计算生物物理学领域。论文内容与绝大多数关键词（涉及大模型、深度学习技术原理、训练方法、推理优化、对齐、代理等）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及生物信息学相关的计算模拟（Monte-Carlo simulations）和理论分析，属于科学计算应用，但并非直接使用AI或大模型技术，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文研究了生物聚合物相分离临界温度的序列依赖性，提出了一种基于残基可及性参数（RAP）的解析微扰理论，该理论能有效解释不同长度和序列的两字母聚合物溶液在大量蒙特卡洛模拟中观察到的临界温度变化。

摘要翻译

生物聚合物，如固有无序蛋白质，在细胞生物学中发挥着核心作用，包括介导相分离和控制生物凝聚体的活性。生物聚合物的物理性质和功能由其残基序列决定。近年来，大量的计算和理论工作致力于刻画生物聚合物相图那组合复杂的序列依赖性。本文中，我们定量地表明单体可及性是决定配对相互作用强度的关键。我们构建了一种解析微扰方法，从现象学上排除了两个聚合物质心在相关空穴内重叠的可能性。该理论通过一个残基可及性参数（Residue-Accessibility Parameter, RAP）对平均场相互作用强度进行了修正，该参数解释了内部单体参与相互作用的有限性。尽管方法简单，RAP合理解释了在针对数千种不同长度和序列的双字母聚合物溶液进行的广泛蒙特卡洛模拟中发现的临界温度变化。因此，对于任意聚合物长度、单体类型集合和聚合物混合物，RAP都可能有效地用于解读相图对聚合物序列的依赖性。

摘要 (Abstract)

Biological polymers, such as intrinsically disordered proteins, play a central role in cellular biology, including mediating phase separation and controlling activity of biological condensates. The physical properties and functions of biopolymers are determined by their residue sequence. Recently, significant computational and theoretical efforts have been devoted to characterizing the combinatorially complex sequence dependence of biopolymer phase diagrams. Here, we quantitatively show that monomer accessibility is central to determining the strength of pair interactions. We formulate an analytical perturbative approach, phenomenologically precluding two polymers’ centers of mass from overlapping within a correlation hole. This theory yields the correction to the strength of mean-field interactions in terms of a residue-accessibility parameter (RAP), which accounts for the limited availability of inner monomers to interactions. Despite the simplicity of the approach, RAP rationalizes the variations in critical temperatures found in extensive Monte-Carlo simulations for thousands of two-letter polymer solutions of varying length and sequence. RAP may thus be effective for deciphering the polymer-sequence dependence of phase diagrams given any polymer length, set of monomer types, and polymer mixtures.

关键词: polymer phase separation, critical temperature, sequence dependence, residue accessibility parameter (RAP), Monte-Carlo simulations, intrinsically disordered proteins, biopolymers, analytical perturbative approach

267. ❌ Nuclear-Electronic Quantum Dynamics in a Plasmonic Nanocavity

作者: Jonathan H. Fetherolf, Tao E. Li, Sharon Hammes-Schiffer 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12373v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究等离子体纳米腔中的核-电子量子动力学，使用实时核-电子轨道时变密度泛函理论（RT-NEO-TDDFT）模拟化学系统在电磁环境中的行为。论文内容完全专注于计算化学、量子动力学和纳米光子学领域，未涉及任何大语言模型、深度学习技术或人工智能方法。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于计算化学领域，可视为科学计算应用，但论文本身未使用AI方法，因此给予5分（有一定关联）。其他所有关键词均与论文内容完全无关，给予0分。

!!! tip deepseek-chat TL;DR

该论文使用实时核-电子轨道时变密度泛函理论（RT-NEO-TDDFT）研究等离子体纳米腔中化学系统的核-电子量子动力学，发现多模腔可探测和调控质子转移反应，并在强耦合条件下形成极化激元抑制反应或产生拉比振荡。

摘要翻译

等离激元纳米腔是实现强光-物质耦合及单分子水平增强光谱学的理想平台。由于其强烈的多模态特性和极短的腔寿命，这类纳米尺度环境的理论建模极具挑战。本文采用实时核-电子轨道含时密度泛函理论（RT-NEO-TDDFT），结合包含腔损耗的多经典腔模方法，系统研究了此类环境的影响。在RT-NEO-TDDFT中，所有电子及特定核（通常为质子）的量子力学密度均进行实时演化。研究表明，具有多频率模式的腔可用于探测并调控化学体系的核-电子量子动力学。超快激发态质子转移反应可通过多模腔的时间与能量分辨腔发射谱进行探测。在强耦合条件下，腔能够改变反应动力学，在某些情况下可抑制质子转移，并因极化激元形成而出现类拉比振荡的腔发射现象。通过采用实验相关的纳米颗粒-镜像单分子腔的光谱密度，我们证明即使激发态质子转移体系初始与腔主峰失谐，仍可演化至与腔共振。在此情况下，将腔主峰调谐至与电子跃迁共振时，少量分子集合即可形成极化激元。结合多模腔的RT-NEO理论框架，能够高效模拟真实电磁环境中的化学反应，为理解其动力学过程及相关光谱特征提供基础理论依据。

摘要 (Abstract)

Plasmonic nanocavities are a promising platform for strong light-matter coupling and enhanced spectroscopies at the single-molecule level. These nanoscale environments are challenging to model due to their strongly multimodal character and short cavity lifetimes. Herein, we study the effects of these environments using real-time nuclear-electronic orbital time-dependent density functional theory (RT-NEO-TDDFT) coupled to multiple classical cavity modes in a manner that includes cavity loss. In RT-NEO-TDDFT, the quantum mechanical densities of all electrons and specified nuclei, typically protons, are propagated in real time. We show that a cavity with many modes at different frequencies can be used to probe and modify the nuclear-electronic quantum dynamics of chemical systems. Ultrafast excited-state proton transfer reactions can be probed through the time- and energy-resolved cavity emission of a multimode cavity. Under strong coupling conditions, the cavity can modify the dynamics, in some cases suppressing proton transfer and exhibiting Rabi-like oscillations of the cavity emission due to polariton formation. Utilizing the spectral density for an experimentally relevant nanoparticle-on-mirror single-molecule cavity, we show that an excited-state proton transfer system can evolve into resonance with the cavity even when initially out of resonance with the dominant cavity peak. In this case, tuning the dominant cavity peak to be resonant with the electronic transition leads to polariton formation for a small collection of molecules. The RT-NEO framework with multimode cavities enables the efficient simulation of chemical reactions in physically realistic electromagnetic environments, providing fundamental insights into the dynamics and associated spectroscopic signatures.

关键词: Plasmonic nanocavity, Nuclear-electronic quantum dynamics, RT-NEO-TDDFT, Proton transfer, Polariton formation, Multimode cavity, Strong light-matter coupling, Real-time simulation

Token 消耗统计

总计: 843,086 tokens（输入 569,981 / 输出 273,105）

模型	输入	输出	合计
deepseek-chat	479,909	266,666	746,575
glm-4.7	90,072	6,439	96,511

📊 ArXiv 研究报告 (2026-03-16)#

📌 配置信息#

关键词列表（共 27 个，总权重 27.0）#

评分设置#

📈 论文统计#

⭐ 及格论文详细分析#

1. ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models#

2. NeuroLoRA: Context-Aware Neuromodulation for Parameter-Efficient Multi-Task Adaptation#

3. SteerRM: Debiasing Reward Models via Sparse Autoencoders#

SteerRM：基于稀疏自编码器的奖励模型去偏方法#

4. Test-Time Attention Purification for Backdoored Large Vision Language Models#

5. Continual Learning in Large Language Models: Methods, Challenges, and Opportunities#

6. Topo-R1: Detecting Topological Anomalies via Vision-Language Models#

7. NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document#

8. AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network#

9. Expert Pyramid Tuning: Efficient Parameter Fine-Tuning for Expertise-Driven Task Allocation#

专家金字塔调优：面向专长驱动任务分配的高效参数微调#

📋 所有论文列表#

1. ✅ ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models#

2. ✅ NeuroLoRA: Context-Aware Neuromodulation for Parameter-Efficient Multi-Task Adaptation#

3. ✅ SteerRM: Debiasing Reward Models via Sparse Autoencoders#

4. ✅ Test-Time Attention Purification for Backdoored Large Vision Language Models#

5. ✅ Continual Learning in Large Language Models: Methods, Challenges, and Opportunities#

6. ✅ Topo-R1: Detecting Topological Anomalies via Vision-Language Models#

7. ✅ NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval#

8. ✅ AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network#

9. ✅ Expert Pyramid Tuning: Efficient Parameter Fine-Tuning for Expertise-Driven Task Allocation#

10. ❌ Design-Specification Tiling for ICL-based CAD Code Generation#

11. ❌ Graph In-Context Operator Networks for Generalizable Spatiotemporal Prediction#

12. ❌ Using a Human-AI Teaming Approach to Create and Curate Scientific Datasets with the SCILIRE System#

13. ❌ Scaling Laws and Pathologies of Single-Layer PINNs: Network Width and PDE Nonlinearity#

14. ❌ DAST: A Dual-Stream Voice Anonymization Attacker with Staged Training#

15. ❌ MotionAnymesh: Physics-Grounded Articulation for Simulation-Ready Digital Twins#

16. ❌ Wear Classification of Abrasive Flap Wheels using a Hierarchical Deep Learning Approach#

17. ❌ From AI Weather Prediction to Infrastructure Resilience: A Correction-Downscaling Framework for Tropical Cyclone Impacts#

18. ❌ Hydrogen-atom roaming reactions in water clusters: Unveiling an unusual dimension of water reactivity through first-principles calculations and machine learning#

19. ❌ Empowering Semantic-Sensitive Underwater Image Enhancement with VLM#

20. ❌ From Experiments to Expertise: Scientific Knowledge Consolidation for AI-Driven Computational Research#

21. ❌ PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization#

22. ❌ Visual-ERM: Reward Modeling for Visual Equivalence#

23. ❌ Learnability and Privacy Vulnerability are Entangled in a Few Critical Weights#

24. ❌ LLM Constitutional Multi-Agent Governance#

25. ❌ MXNorm: Reusing MXFP block scales for efficient tensor normalisation#

26. ❌ Clustering Astronomical Orbital Synthetic Data Using Advanced Feature Extraction and Dimensionality Reduction Techniques#

27. ❌ Semantic Invariance in Agentic AI#

28. ❌ Developing and evaluating a chatbot to support maternal health care#

29. ❌ ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation#

30. ❌ When Right Meets Wrong: Bilateral Context Conditioning with Reward-Confidence Correction for GRPO#

31. ❌ Steve-Evolving: Open-World Embodied Self-Evolution via Fine-Grained Diagnosis and Dual-Track Knowledge Distillation#

32. ❌ Developing the PsyCogMetrics AI Lab to Evaluate Large Language Models and Advance Cognitive Science – A Three-Cycle Action Design Science Study#

33. ❌ Geometry-Guided Camera Motion Understanding in VideoLLMs#

34. ❌ BoSS: A Best-of-Strategies Selector as an Oracle for Deep Active Learning#

35. ❌ Evaluating VLMs’ Spatial Reasoning Over Robot Motion: A Step Towards Robot Planning with Motion Preferences#

36. ❌ Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation#

37. ❌ Human-in-the-Loop LLM Grading for Handwritten Mathematics Assessments#

38. ❌ GeoChemAD: Benchmarking Unsupervised Geochemical Anomaly Detection for Mineral Exploration#

39. ❌ L2GTX: From Local to Global Time Series Explanations#

40. ❌ Competition-Aware CPC Forecasting with Near-Market Coverage#

41. ❌ Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach#

42. ❌ Are General-Purpose Vision Models All We Need for 2D Medical Image Segmentation? A Cross-Dataset Empirical Study#

43. ❌ Interrogating Design Homogenization in Web Vibe Coding#

44. ❌ Purify Once, Edit Freely: Breaking Image Protections under Model Mismatch#

45. ❌ SortScrews: A Dataset and Baseline for Real-time Screw Classification#

46. ❌ SAW: Toward a Surgical Action World Model via Controllable and Scalable Video Generation#

47. ❌ ARL-Tangram: Unleash the Resource Efficiency in Agentic Reinforcement Learning#

48. ❌ daVinci-Env: Open SWE Environment Synthesis at Scale#

49. ❌ Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation#

50. ❌ Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation#

51. ❌ Fair Lung Disease Diagnosis from Chest CT via Gender-Adversarial Attention Multiple Instance Learning#

52. ❌ Efficient Real-World Autonomous Racing via Attenuated Residual Policy Optimization#

53. ❌ Delta1 with LLM: symbolic and neural integration for credible and explainable reasoning#

54. ❌ Thinking in Streaming Video#

55. ❌ Efficient and Interpretable Multi-Agent LLM Routing via Ant Colony Optimization#

56. ❌ ODRL Policy Comparison Through Normalisation#

57. ❌ Surprised by Attention: Predictable Query Dynamics for Time Series Anomaly Detection#

58. ❌ Stake the Points: Structure-Faithful Instance Unlearning#

59. ❌ FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts#

60. ❌ Learning from Child-Directed Speech in Two-Language Scenarios: A French-English Case Study#

61. ❌ Human-Centered Evaluation of an LLM-Based Process Modeling Copilot: A Mixed-Methods Study with Domain Experts#

62. ❌ Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models#

📊 ArXiv 研究报告 (2026-03-16)

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

评分设置

📈 论文统计

⭐ 及格论文详细分析

1. ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models

2. NeuroLoRA: Context-Aware Neuromodulation for Parameter-Efficient Multi-Task Adaptation

3. SteerRM: Debiasing Reward Models via Sparse Autoencoders

SteerRM：基于稀疏自编码器的奖励模型去偏方法

4. Test-Time Attention Purification for Backdoored Large Vision Language Models

5. Continual Learning in Large Language Models: Methods, Challenges, and Opportunities

6. Topo-R1: Detecting Topological Anomalies via Vision-Language Models

7. NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document

8. AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network

9. Expert Pyramid Tuning: Efficient Parameter Fine-Tuning for Expertise-Driven Task Allocation

专家金字塔调优：面向专长驱动任务分配的高效参数微调

📋 所有论文列表

1. ✅ ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models

2. ✅ NeuroLoRA: Context-Aware Neuromodulation for Parameter-Efficient Multi-Task Adaptation

3. ✅ SteerRM: Debiasing Reward Models via Sparse Autoencoders

4. ✅ Test-Time Attention Purification for Backdoored Large Vision Language Models

5. ✅ Continual Learning in Large Language Models: Methods, Challenges, and Opportunities

6. ✅ Topo-R1: Detecting Topological Anomalies via Vision-Language Models

7. ✅ NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

8. ✅ AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network

9. ✅ Expert Pyramid Tuning: Efficient Parameter Fine-Tuning for Expertise-Driven Task Allocation

10. ❌ Design-Specification Tiling for ICL-based CAD Code Generation

11. ❌ Graph In-Context Operator Networks for Generalizable Spatiotemporal Prediction

12. ❌ Using a Human-AI Teaming Approach to Create and Curate Scientific Datasets with the SCILIRE System

13. ❌ Scaling Laws and Pathologies of Single-Layer PINNs: Network Width and PDE Nonlinearity

14. ❌ DAST: A Dual-Stream Voice Anonymization Attacker with Staged Training

15. ❌ MotionAnymesh: Physics-Grounded Articulation for Simulation-Ready Digital Twins

16. ❌ Wear Classification of Abrasive Flap Wheels using a Hierarchical Deep Learning Approach

17. ❌ From AI Weather Prediction to Infrastructure Resilience: A Correction-Downscaling Framework for Tropical Cyclone Impacts

18. ❌ Hydrogen-atom roaming reactions in water clusters: Unveiling an unusual dimension of water reactivity through first-principles calculations and machine learning

19. ❌ Empowering Semantic-Sensitive Underwater Image Enhancement with VLM

20. ❌ From Experiments to Expertise: Scientific Knowledge Consolidation for AI-Driven Computational Research

21. ❌ PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization

22. ❌ Visual-ERM: Reward Modeling for Visual Equivalence

23. ❌ Learnability and Privacy Vulnerability are Entangled in a Few Critical Weights

24. ❌ LLM Constitutional Multi-Agent Governance

25. ❌ MXNorm: Reusing MXFP block scales for efficient tensor normalisation

26. ❌ Clustering Astronomical Orbital Synthetic Data Using Advanced Feature Extraction and Dimensionality Reduction Techniques

27. ❌ Semantic Invariance in Agentic AI

28. ❌ Developing and evaluating a chatbot to support maternal health care

29. ❌ ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

30. ❌ When Right Meets Wrong: Bilateral Context Conditioning with Reward-Confidence Correction for GRPO

31. ❌ Steve-Evolving: Open-World Embodied Self-Evolution via Fine-Grained Diagnosis and Dual-Track Knowledge Distillation

32. ❌ Developing the PsyCogMetrics AI Lab to Evaluate Large Language Models and Advance Cognitive Science – A Three-Cycle Action Design Science Study

33. ❌ Geometry-Guided Camera Motion Understanding in VideoLLMs

34. ❌ BoSS: A Best-of-Strategies Selector as an Oracle for Deep Active Learning

35. ❌ Evaluating VLMs’ Spatial Reasoning Over Robot Motion: A Step Towards Robot Planning with Motion Preferences

36. ❌ Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation

37. ❌ Human-in-the-Loop LLM Grading for Handwritten Mathematics Assessments

38. ❌ GeoChemAD: Benchmarking Unsupervised Geochemical Anomaly Detection for Mineral Exploration

39. ❌ L2GTX: From Local to Global Time Series Explanations

40. ❌ Competition-Aware CPC Forecasting with Near-Market Coverage

41. ❌ Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach

42. ❌ Are General-Purpose Vision Models All We Need for 2D Medical Image Segmentation? A Cross-Dataset Empirical Study

43. ❌ Interrogating Design Homogenization in Web Vibe Coding

44. ❌ Purify Once, Edit Freely: Breaking Image Protections under Model Mismatch

45. ❌ SortScrews: A Dataset and Baseline for Real-time Screw Classification

46. ❌ SAW: Toward a Surgical Action World Model via Controllable and Scalable Video Generation

47. ❌ ARL-Tangram: Unleash the Resource Efficiency in Agentic Reinforcement Learning

48. ❌ daVinci-Env: Open SWE Environment Synthesis at Scale

49. ❌ Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation

50. ❌ Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation

51. ❌ Fair Lung Disease Diagnosis from Chest CT via Gender-Adversarial Attention Multiple Instance Learning

52. ❌ Efficient Real-World Autonomous Racing via Attenuated Residual Policy Optimization

53. ❌ Delta1 with LLM: symbolic and neural integration for credible and explainable reasoning

54. ❌ Thinking in Streaming Video

55. ❌ Efficient and Interpretable Multi-Agent LLM Routing via Ant Colony Optimization

56. ❌ ODRL Policy Comparison Through Normalisation

57. ❌ Surprised by Attention: Predictable Query Dynamics for Time Series Anomaly Detection

58. ❌ Stake the Points: Structure-Faithful Instance Unlearning

59. ❌ FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts

60. ❌ Learning from Child-Directed Speech in Two-Language Scenarios: A French-English Case Study

61. ❌ Human-Centered Evaluation of an LLM-Based Process Modeling Copilot: A Mixed-Methods Study with Domain Experts

62. ❌ Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models