📊 ArXiv 研究报告 (2026-04-01)

生成时间: 2026-04-01 09:28:02 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 326 篇
及格论文: 8 篇 (2.5%)

⭐ 及格论文详细分析

1. HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention

作者: Yufei Xu, Fanxu Meng, Fan Jiang, Yuxuan Wang, Ruijie Zhou, Jiexi Wu, Zhixin Pan, Zhaohui Wang, Xiaojuan Tang, Wenjie Pei, Tongxuan Liu, Di yin, Xing Sun, Muhan Zhang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28458v1

评分: 55.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	15.0/10	15.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	10.0/10	10.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	10.0/10	10.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	10.0/10	10.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文HISA专注于改进大语言模型（LLMs）中的稀疏注意力机制，特别是针对DeepSeek Sparse Attention（DSA）的索引器进行优化。核心贡献是提出一种分层索引方法，将O(L²)的瓶颈降低，从而加速长上下文（如32K和128K）下的推理。因此，与"Large Language Models"、“Mixture of Experts” OR “MoE” OR “Sparse Models”（因为DSA是一种稀疏模型）、“Context Window Extension” OR “Long Context LLMs”（针对长上下文优化）、“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”（属于注意力机制优化）和"Speculative Decoding" OR “Inference Acceleration”（直接提升推理速度）高度相关，评分为10或15。其他关键词如训练方法、对齐、代理、科学AI等与论文内容无关，评分为0。

!!! tip deepseek-chat TL;DR

论文HISA提出了一种分层索引方法，用于改进大语言模型中细粒度稀疏注意力的索引器，解决了长上下文下O(L²)的计算瓶颈，在保持选择精度的同时实现了2-4倍的加速。

摘要翻译

以DeepSeek稀疏注意力为代表的令牌级稀疏注意力机制，通过一个轻量级索引器为每个查询对每个历史令牌进行评分，并仅对选中的子集计算注意力，从而实现了细粒度的关键令牌选择。尽管下游的稀疏注意力能够高效扩展，但索引器仍需为每个查询扫描整个前缀序列，这引入了每层O($L^2$)的计算瓶颈，随着上下文长度的增长，该瓶颈将变得难以承受。我们提出了HISA（分层索引稀疏注意力），作为一种可直接替换的索引器方案，它将搜索过程从平坦的令牌扫描转变为两阶段分层处理。首先，块级粗粒度过滤器对池化的块代表进行评分，以剪除无关区域。随后，令牌级细粒度处理仅在剩余的候选块内部应用原始索引器。HISA保留了下游稀疏多头注意力运算符所需的精确令牌级Top-K稀疏模式，且无需额外训练。在内核级基准测试中，HISA在32K上下文长度下实现了2$\times$的加速，在128K长度下实现了4$\times$的加速。在Needle-in-a-Haystack和LongBench评测中，我们直接将DeepSeek-V3.2中的索引器替换为HISA，未进行任何微调。HISA在质量上与原始DSA高度吻合，同时显著优于块稀疏基线方法。此外，HISA与原始DSA产生的令牌选择集合的平均交并比大于99%，表明其效率提升几乎未对选择保真度产生影响。

摘要 (Abstract)

Token-level sparse attention mechanisms, exemplified by DeepSeek Sparse Attention (DSA), achieve fine-grained key selection by scoring every historical token for each query using a lightweight indexer, and then computing attention only over the selected subset. While the downstream sparse attention scales efficiently, the indexer still scans the entire prefix for every query, introducing an O($L^2$) per-layer bottleneck that becomes prohibitive as context length grows. We propose HISA (Hierarchical Indexed Sparse Attention), a drop-in replacement for the indexer that transforms the search process from a flat token scan into a two-stage hierarchical procedure. First, a block-level coarse filter scores pooled block representatives to prune irrelevant regions. Then, a token-level refinement applies the original indexer only within the remaining candidate blocks. HISA preserves the exact token-level top-k sparsity pattern required by the downstream Sparse MLA operator and requires no additional training. On kernel-level benchmarks, HISA achieves a 2$\times$ speedup at 32K context length and 4$\times$ at 128K. On Needle-in-a-Haystack and LongBench, we directly replace the indexer in DeepSeek-V3.2 with HISA, without any fine-tuning. HISA closely matches the original DSA in quality while significantly outperforming block-sparse baselines. Moreover, the token selection sets produced by HISA and the original DSA exhibit a mean IoU greater than 99%, indicating that the efficiency gains come with virtually no impact on selection fidelity.

关键词: Sparse Attention, Hierarchical Indexing, Long Context, Inference Acceleration, DeepSeek, Token-level Selection, Attention Mechanism, Computational Efficiency

2. Fine-Tuning Large Language Models for Cooperative Tactical Deconfliction of Small Unmanned Aerial Sy

作者: Iman Sharifi, Alex Zongo, Peng Wei 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28561v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究LLMs在无人机战术冲突解除中的应用，核心涉及LLMs作为决策者、监督微调（SFT）、LoRA参数高效微调、LLM智能体以及多智能体系统。这些关键词与论文内容高度相关（10分），其他关键词如MoE、量化、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文研究如何通过监督微调和基于偏好的微调策略（使用LoRA和GRPO）来优化大型语言模型（Qwen-Math-7B），以提升其在密集、部分可观测的异构多智能体环境中（如小型无人机系统）进行合作战术冲突解除的决策准确性、一致性和分离性能。

摘要翻译

随着小型无人机系统在低空空域的日益广泛部署，在安全关键约束下实现可靠的战术冲突解脱需求日益增长。战术冲突解脱涉及在密集、部分可观测且异构的多智能体环境中进行短时域决策，必须同时保持协同间隔保障与运行效率。尽管大语言模型展现出强大的推理能力，但其直接应用于空中交通管制仍受限于领域知识基础不足和输出不可预测的不一致性。本文研究将大语言模型作为协同多智能体战术冲突解脱的决策者，采用微调策略使模型输出与人类操作员的启发式规则对齐。我们提出了一种基于BlueSky空中交通模拟器的仿真-语言数据生成流程，该流程能产生符合既定安全实践、规则一致的冲突解脱数据集。采用两种参数高效策略对预训练的Qwen-Math-7B模型进行微调：基于低秩适配的监督微调，以及结合低秩适配与组相对策略优化的偏好微调。在验证数据集和闭环仿真上的实验结果表明，与预训练大语言模型相比，监督式低秩适配微调显著提升了决策准确性、一致性和间隔保持性能，并大幅减少了近距空中碰撞事件。组相对策略优化提供了额外的协同优势，但在与异构智能体策略交互时表现出鲁棒性下降。

摘要 (Abstract)

The growing deployment of small Unmanned Aerial Systems (sUASs) in low-altitude airspaces has increased the need for reliable tactical deconfliction under safety-critical constraints. Tactical deconfliction involves short-horizon decision-making in dense, partially observable, and heterogeneous multi-agent environments, where both cooperative separation assurance and operational efficiency must be maintained. While Large Language Models (LLMs) exhibit strong reasoning capabilities, their direct application to air traffic control remains limited by insufficient domain grounding and unpredictable output inconsistency. This paper investigates LLMs as decision-makers in cooperative multi-agent tactical deconfliction using fine-tuning strategies that align model outputs to human operator heuristics. We propose a simulation-to-language data generation pipeline based on the BlueSky air traffic simulator that produces rule-consistent deconfliction datasets reflecting established safety practices. A pretrained Qwen-Math-7B model is fine-tuned using two parameter-efficient strategies: supervised fine-tuning with Low-Rank Adaptation (LoRA) and preference-based fine-tuning combining LoRA with Group-Relative Policy Optimization (GRPO). Experimental results on validation datasets and closed-loop simulations demonstrate that supervised LoRA fine-tuning substantially improves decision accuracy, consistency, and separation performance compared to the pretrained LLM, with significant reductions in near mid-air collisions. GRPO provides additional coordination benefits but exhibits reduced robustness when interacting with heterogeneous agent policies.

关键词: Large Language Models, Supervised Fine-tuning, LoRA, LLM Agents, Multi-agent Systems, Tactical Deconfliction, Small Unmanned Aerial Systems, Parameter-efficient Fine-tuning

3. Beyond the Answer: Decoding the Behavior of LLMs as Scientific Reasoners

作者: Rohan Pandey, Eric Ye, Michael Li 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28038v1

评分: 44.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	8.0/10	8.0

评分理由: 论文核心研究LLMs在科学推理任务中的行为分析，通过GEPA方法优化提示并分析推理模式，与"Large Language Models"高度相关（10分）。研究涉及推理过程分析，与"Chain of Thought"和"System 2 Thinking"相关（各8分）。论文强调解释LLMs的内部启发式方法，与"Mechanistic Interpretability"高度相关（10分）。研究应用于科学推理任务，与"AI for Science"相关（8分）。其他关键词如MoE、SLMs、训练方法、优化技术、代理系统等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该研究通过GEPA方法优化LLMs在科学推理任务中的提示，分析发现性能提升往往依赖于模型特定的、难以泛化的启发式逻辑（称为"局部"逻辑），并论证提示优化可作为模型可解释性的工具。

摘要翻译

随着大型语言模型（LLM）在复杂推理任务上展现出日益精密的性能，当前架构已成为前沿模型内部启发式机制的关键代理。描述其涌现的推理能力对于长期的可解释性与安全性至关重要。此外，理解提示如何调节这些过程亦不可或缺，因为自然语言很可能成为与通用人工智能（AGI）系统交互的主要界面。在本研究中，我们采用遗传帕累托（GEPA）的自定义变体，系统性地优化科学推理任务的提示，并分析提示如何影响推理行为。我们探究了GEPA优化提示中固有的结构模式与逻辑启发式，评估了它们的可迁移性与脆弱性。研究结果表明，科学推理能力的提升往往对应着模型特定的启发式策略，这些策略无法在不同系统间泛化，我们称之为“局部”逻辑。通过将提示优化构建为模型可解释性的工具，我们认为，绘制LLM偏好的推理结构图谱，是未来与超人类智能有效协作的重要前提。

摘要 (Abstract)

As Large Language Models (LLMs) achieve increasingly sophisticated performance on complex reasoning tasks, current architectures serve as critical proxies for the internal heuristics of frontier models. Characterizing emergent reasoning is vital for long-term interpretability and safety. Furthermore, understanding how prompting modulates these processes is essential, as natural language will likely be the primary interface for interacting with AGI systems. In this work, we use a custom variant of Genetic Pareto (GEPA) to systematically optimize prompts for scientific reasoning tasks, and analyze how prompting can affect reasoning behavior. We investigate the structural patterns and logical heuristics inherent in GEPA-optimized prompts, and evaluate their transferability and brittleness. Our findings reveal that gains in scientific reasoning often correspond to model-specific heuristics that fail to generalize across systems, which we call “local” logic. By framing prompt optimization as a tool for model interpretability, we argue that mapping these preferred reasoning structures for LLMs is an important prerequisite for effectively collaborating with superhuman intelligence.

关键词: Large Language Models, Scientific Reasoning, Prompt Optimization, Model Interpretability, Genetic Pareto (GEPA), Reasoning Behavior, Local Logic, AGI Systems

4. EffiSkill: Agent Skill Based Automated Code Efficiency Optimization

作者: Zimu Wang, Yuling Shi, Mengfan Li, Zijun Liu, Jie M. Zhang, Chengcheng Wan, Xiaodong Gu 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27850v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	5.0/10	5.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based agents在代码效率优化中的应用，与"Large Language Models"和"LLM Agents"高度相关（10分）。涉及技能检索和计划组合，与"Retrieval-Augmented Generation"、“Chain of Thought”、“Tool Use"和"In-context Learning"有一定关联（5分）。其他关键词如MoE、量化、对齐等未在摘要中体现，评为0分。

!!! tip deepseek-chat TL;DR

EffiSkill框架通过将代码优化模式建模为可重用技能，使LLM智能体能够在不依赖运行时反馈的情况下，显著提高代码效率优化的成功率。

摘要翻译

代码效率是软件质量的基本维度，然而如何利用大语言模型（LLM）优化程序仍具挑战。现有方法多尝试一次性重写、检索示例或基于提示的搜索，但未能显式提炼可复用的优化知识，这限制了其在具体实例之外的泛化能力。
本文提出EffiSkill，一个面向代码效率优化的框架，旨在为基于LLM的智能体构建可移植的优化工具箱。其核心思想是将反复出现的“慢速到快速”代码转换建模为可复用的智能体技能，这些技能既捕捉具体的转换机制，也涵盖更高层次的优化策略。EffiSkill采用两阶段设计：第一阶段从大规模慢速/快速程序对中挖掘操作符技能（Operator Skill）与元技能（Meta Skill），构建技能库；第二阶段将该技能库应用于未见程序，通过免执行的诊断、技能检索、计划组合与候选代码生成来完成优化，无需运行时反馈。
在EffiBench-X上的实验结果表明，EffiSkill实现了更高的优化成功率，在不同模型与编程语言设置下，相比最强基线提升了3.69至12.52个百分点。这些发现表明，机制层面的技能重用为免执行的代码优化提供了有益基础，且所构建的技能库可作为可复用的资源，服务于更广泛的智能体工作流。

摘要 (Abstract)

Code efficiency is a fundamental aspect of software quality, yet how to harness large language models (LLMs) to optimize programs remains challenging. Prior approaches have sought for one-shot rewriting, retrieved exemplars, or prompt-based search, but they do not explicitly distill reusable optimization knowledge, which limits generalization beyond individual instances. In this paper, we present EffiSkill, a framework for code-efficiency optimization that builds a portable optimization toolbox for LLM-based agents. The key idea is to model recurring slow-to-fast transformations as reusable agent skills that capture both concrete transformation mechanisms and higher-level optimization strategies. EffiSkill adopts a two-stage design: Stage I mines Operator and Meta Skills from large-scale slow/fast program pairs to build a skill library; Stage II applies this library to unseen programs through execution-free diagnosis, skill retrieval, plan composition, and candidate generation, without runtime feedback. Results on EffiBench-X show that EffiSkill achieves higher optimization success rates, improving over the strongest baseline by 3.69 to 12.52 percentage points across model and language settings. These findings suggest that mechanism-level skill reuse provides a useful foundation for execution-free code optimization, and that the resulting skill library can serve as a reusable resource for broader agent workflows.

关键词: code efficiency optimization, large language models, LLM-based agents, agent skills, skill library, execution-free optimization, EffiBench-X, optimization success rate

5. MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models

作者: Han Wang, Yifan Sun, Brian Ko, Mann Talati, Jiawen Gong, Zimeng Li, Naicheng Yu, Xucheng Yu, Wei Shen, Vedant Jolly, Huan Zhang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28590v1

评分: 38.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs中Chain-of-Thought（CoT）的可监控性问题，与"Large Language Models"和"Chain of Thought"高度相关（10分）。CoT监控涉及模型行为解释，与"Mechanistic Interpretability"强相关（8分）。CoT监控旨在确保推理过程忠实反映决策因素，与"System 2 Thinking"和"Hallucination Mitigation"有一定关联（5分）。论文未涉及其他关键词如MoE、SLMs、训练技术、推理优化、代理系统等，这些评0分。

!!! tip deepseek-chat TL;DR

该论文提出了MonitorBench基准，用于评估大语言模型中思维链（CoT）的可监控性，发现当最终输出需要基于决策关键因素进行结构化推理时CoT可监控性更高，且闭源模型通常可监控性更低，在压力测试下可监控性可能下降高达30%。

摘要翻译

大语言模型（LLM）生成的思维链（CoT）并不总是对其最终输出具有因果决定性。当出现这种不匹配时，思维链便无法忠实反映驱动模型行为的关键决策因素，从而导致思维链可监控性降低的问题。然而，目前仍缺乏一个全面且完全开源的基准来研究思维链可监控性。为填补这一空白，我们提出了MonitorBench，一个用于评估大语言模型中思维链可监控性的系统性基准。MonitorBench提供：（1）一套包含1,514个测试实例的多样化数据集，这些实例涵盖7大类别的19项任务，并精心设计了关键决策因素，用以刻画思维链在何种情况下可用于监控驱动大语言模型行为的因素；（2）两种压力测试设置，用于量化思维链可监控性可能下降的程度。在多个不同能力水平的流行大语言模型上进行的大量实验表明，当生成最终目标回答需要通过对关键决策因素进行结构化推理时，思维链可监控性更高。闭源大语言模型通常表现出较低的可监控性，且可监控性与模型能力之间存在负相关关系。此外，开源和闭源大语言模型在压力测试下均可有意降低可监控性，在一些无需对关键决策因素进行结构化推理的任务中，可监控性下降幅度高达30%。除这些实证发现外，MonitorBench为进一步研究评估未来大语言模型、探索高级压力测试可监控性技术以及开发新型监控方法提供了基础。

摘要 (Abstract)

Large language models (LLMs) can generate chains of thought (CoTs) that are not always causally responsible for their final outputs. When such a mismatch occurs, the CoT no longer faithfully reflects the decision-critical factors driving the model’s behavior, leading to the reduced CoT monitorability problem. However, a comprehensive and fully open-source benchmark for studying CoT monitorability remains lacking. To address this gap, we propose MonitorBench, a systematic benchmark for evaluating CoT monitorability in LLMs. MonitorBench provides: (1) a diverse set of 1,514 test instances with carefully designed decision-critical factors across 19 tasks spanning 7 categories to characterize when CoTs can be used to monitor the factors driving LLM behavior; and (2) two stress-test settings to quantify the extent to which CoT monitorability can be degraded. Extensive experiments across multiple popular LLMs with varying capabilities show that CoT monitorability is higher when producing the final target response requires structural reasoning through the decision-critical factor. Closed-source LLMs generally show lower monitorability, and there exists a negative relationship between monitorability and model capability. Moreover, both open- and closed-source LLMs can intentionally reduce monitorability under stress-tests, with monitorability dropping by up to 30% in some tasks that do not require structural reasoning over the decision-critical factors. Beyond these empirical insights, MonitorBench provides a basis for further research on evaluating future LLMs, studying advanced stress-test monitorability techniques, and developing new monitoring approaches.

关键词: Large Language Models, Chain of Thought, Monitorability, Benchmark, Interpretability, Reasoning, Faithfulness, Stress-test

6. Rethinking Easy-to-Hard: Limits of Curriculum Learning in Post-Training for Deductive Reasoning

作者: Maximilian Mordig, Andreas Opedal, Weiyang Liu, Bernhard Schölkopf 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27226v1

评分: 36.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在演绎推理任务上的后训练方法（特别是课程学习），直接涉及"Large Language Models"和"Post-training/SFT"关键词（10分）。研究内容聚焦于组合推理和推理复杂性，与"Chain of Thought"和"System 2 Thinking"相关（8分）。其他关键词如MoE、量化、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该研究系统评估了课程学习在大型语言模型后训练中对演绎推理任务的影响，发现基于难度的训练序列安排相比随机采样在准确性和响应长度上并未带来稳健优势，挑战了课程学习在此类任务中的实际效用。

摘要翻译

课程学习（Curriculum Learning, CL）基于“按难度递增顺序学习应有助于提升泛化能力”的直觉，在大语言模型（LLMs）的预训练与后训练阶段均被广泛采用。对于组合推理任务——即复杂问题由基础推理规则构建而成——CL的直觉尤其具有吸引力；然而，CL对此类任务的实际影响在很大程度上仍未得到充分探索。本文通过合成算术与逻辑基准测试（其中难度以推理复杂度而非表层特征为衡量标准），对LLMs后训练中的CL进行了系统的实证研究。出乎意料的是，在多种模型架构和课程进度安排下，我们发现基于难度的训练序列在准确率或响应长度上均未表现出相对于标准随机抽样的稳定优势。这一结论在监督微调（Supervised Fine-Tuning, SFT）和强化学习（Reinforcement Learning, RL）方法中均保持一致。我们的研究表明，在演绎推理的语境下，训练样本的具体排序对于实现组合泛化的作用微乎其微，这对基于课程的后训练的实际效用提出了挑战。

摘要 (Abstract)

Curriculum learning (CL), motivated by the intuition that learning in increasing order of difficulty should ease generalization, is commonly adopted both in pre-training and post-training of large language models (LLMs). The intuition of CL is particularly compelling for compositional reasoning, where complex problems are built from elementary inference rules; however, the actual impact of CL on such tasks remains largely underexplored. We present a systematic empirical study of CL for post-training of LLMs, using synthetic arithmetic and logical benchmarks where difficulty is characterized by reasoning complexity rather than surface-level proxies. Surprisingly, across multiple model families and curriculum schedules, we find no robust advantage in difficulty-based sequencing over standard random sampling in either accuracy or response length. These findings persist across both supervised fine-tuning (SFT) and reinforcement learning (RL) methods. Our study suggests that, in the context of deductive reasoning, the specific ordering of training examples plays a negligible role in achieving compositional generalization, challenging the practical utility of curriculum-based post-training.

关键词: Curriculum Learning, Post-training, Large Language Models, Deductive Reasoning, Compositional Generalization, Supervised Fine-tuning, Reinforcement Learning, Reasoning Complexity

7. Moving Beyond Review: Applying Language Models to Planning and Translation in Reflection

作者: Seyed Parsa Neshaei, Richard Lee Davis, Tanja Käser 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28596v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	5.0/10	5.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在反思性写作规划与翻译阶段的应用，因此与"Large Language Models"高度相关（10分）。论文涉及反思深度、结构化思考，与"Chain of Thought”、“System 2 Thinking”、“Self-Reflection"有一定关联（各5分）。论文将LLM作为对话代理支持写作过程，与"LLM Agents"有一定关联（5分）。其他关键词如MoE、量化、RAG等未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文研究如何应用大语言模型支持学生在反思性写作的规划和翻译阶段，实验结果表明这种支持能显著提高反思深度和结构质量，但效果在延迟后测中减弱。

摘要翻译

反思性写作被认为有助于发展学生的元认知技能，但学习者往往难以进行深度反思，从而限制了学习收获。尽管大型语言模型（LLMs）已被证明能提升写作能力，但其作为对话代理用于反思性写作的效果参差不齐，且主要集中于对反思文本提供反馈，而非在规划与组织阶段提供支持。本文受写作认知过程理论（Cognitive Process Theory of writing，简称CPT）启发，首次提出将LLMs应用于反思性写作的规划与转换步骤。我们介绍了Pensée工具，它通过对话代理搭建结构化反思规划支架，并借助自动提取关键概念来支持转换阶段，以此探究在这些阶段提供显式人工智能支持的效果。我们在一项受控的组间实验（N=93）中对Pensée进行评估，通过控制不同写作阶段的人工智能支持来观察效果。结果显示，当学习者在CPT的规划与转换阶段获得支持时，其反思深度与结构质量显著提高，尽管这些效果在延迟后测中有所减弱。对学习者行为与感知的进一步分析，揭示了符合CPT理论的对话支持如何塑造反思过程与学习体验，从而为理论驱动的大型语言模型在人工智能辅助反思性写作中的应用提供了实证依据。

摘要 (Abstract)

Reflective writing is known to support the development of students’ metacognitive skills, yet learners often struggle to engage in deep reflection, limiting learning gains. Although large language models (LLMs) have been shown to improve writing skills, their use as conversational agents for reflective writing has produced mixed results and has largely focused on providing feedback on reflective texts, rather than support during planning and organizing. In this paper, inspired by the Cognitive Process Theory of writing (CPT), we propose the first application of LLMs to the planning and translation steps of reflective writing. We introduce Pensée, a tool to explore the effects of explicit AI support during these stages by scaffolding structured reflection planning using a conversational agent, and supporting translation by automatically extracting key concepts. We evaluate Pensée in a controlled between-subjects experiment (N=93), manipulating AI support across writing phases. Results show significantly greater reflection depth and structural quality when learners receive support during planning and translation stages of CPT, though these effects reduce in a delayed post-test. Analyses of learner behavior and perceptions further illustrate how CPT-aligned conversational support shapes reflection processes and learner experience, contributing empirical evidence for theory-driven uses of LLMs in AI-supported reflective writing.

关键词: Large Language Models, Reflective Writing, Cognitive Process Theory, Planning, Translation, Conversational Agent, Metacognitive Skills, AI Support

8. The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residua

作者: Isaac Llorente-Saguer 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27412v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）中通过分析残差流激活的几何特征来检测有害提示的训练免费方法，直接高度相关于"Large Language Models"和"Mechanistic Interpretability”（因其探究模型内部表示机制）。同时，论文明确评估了指令微调（instruction-tuned）模型，与"Instruction Tuning"高度相关。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、Context Window、推理方法、代理、量化、加速、幻觉缓解、世界模型、模型合并、上下文学习、科学AI等均未在论文标题或摘要中提及或隐含，故评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为LatentBiopsy的训练免费方法，通过分析大语言模型残差流激活的几何角度偏差来检测有害提示，并在多个模型变体上实现了高检测性能，同时揭示了有害意图表示与下游拒绝机制之间的几何解耦。

摘要翻译

本文提出LatentBiopsy，一种无需训练、通过分析大语言模型残差流激活几何形态来检测有害提示的方法。给定200个安全的规范提示，LatentBiopsy在目标层计算其激活向量的主导主成分，并通过新提示的激活向量与该参考方向的径向偏离角$θ$进行表征。异常分数基于规范分布的高斯拟合下$θ$的负对数似然计算，对方向无关的偏离进行对称标记。该方法无需任何有害样本进行训练。
我们在Qwen3.5-0.8B和Qwen2.5-0.5B两个完整模型三元组（基础模型、指令微调模型及通过正交化手术移除拒绝方向的\emph{消除模型}）上进行了评估。在所有六个变体中，LatentBiopsy在有害-规范检测任务上取得AUROC≥0.937，在区分有害提示与良性攻击性提示（XSTest）上达到AUROC=1.000，且单次查询开销低于毫秒级。
我们得出三项实证发现：首先，几何特征在拒绝消除后依然存在——两个消除模型的AUROC最多仅比对应指令微调模型低0.015，这确立了有害意图表征与下游生成拒绝机制之间的几何解耦。其次，有害提示呈现近乎简并的角分布（$σ_θ\approx 0.03$弧度），比规范分布（$σ_θ\approx 0.27$弧度）紧致一个数量级，且该特征在所有对齐阶段（包括消除模型）中均保持。第三，两个模型家族在相同深度呈现相反的环向取向：有害提示在Qwen3.5-0.8B中占据外环，而在Qwen2.5-0.5B中占据内环，这直接启发了我们采用方向无关的评分规则。

摘要 (Abstract)

We present LatentBiopsy, a training-free method for detecting harmful prompts by analysing the geometry of residual-stream activations in large language models. Given 200 safe normative prompts, LatentBiopsy computes the leading principal component of their activations at a target layer and characterises new prompts by their radial deviation angle $θ$ from this reference direction. The anomaly score is the negative log-likelihood of $θ$ under a Gaussian fit to the normative distribution, flagging deviations symmetrically regardless of orientation. No harmful examples are required for training. We evaluate two complete model triplets from the Qwen3.5-0.8B and Qwen2.5-0.5B families: base, instruction-tuned, and \emph{abliterated} (refusal direction surgically removed via orthogonalisation). Across all six variants, LatentBiopsy achieves AUROC $\geq$0.937 for harmful-vs-normative detection and AUROC = 1.000 for discriminating harmful from benign-aggressive prompts (XSTest), with sub-millisecond per-query overhead. Three empirical findings emerge. First, geometry survives refusal ablation: both abliterated variants achieve AUROC at most 0.015 below their instruction-tuned counterparts, establishing a geometric dissociation between harmful-intent representation and the downstream generative refusal mechanism. Second, harmful prompts exhibit a near-degenerate angular distribution ($σ_θ\approx 0.03$ rad), an order of magnitude tighter than the normative distribution ($σ_θ\approx 0.27$ rad), preserved across all alignment stages including abliteration. Third, the two families exhibit opposite ring orientations at the same depth: harmful prompts occupy the outer ring in Qwen3.5-0.8B but the inner ring in Qwen2.5-0.5B, directly motivating the direction-agnostic scoring rule.

关键词: Large Language Models, Harmful Prompt Detection, Residual Stream Activations, Training-Free Method, Angular Deviation, Instruction Tuning, Mechanistic Interpretability, Anomaly Detection

📋 所有论文列表

1. ✅ HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention

评分: 55.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	15.0/10	15.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	10.0/10	10.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	10.0/10	10.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	10.0/10	10.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

论文HISA提出了一种分层索引方法，用于改进大语言模型中细粒度稀疏注意力的索引器，解决了长上下文下O(L²)的计算瓶颈，在保持选择精度的同时实现了2-4倍的加速。

摘要翻译

以DeepSeek稀疏注意力为代表的令牌级稀疏注意力机制，通过一个轻量级索引器为每个查询对每个历史令牌进行评分，并仅对选中的子集计算注意力，从而实现了细粒度的关键令牌选择。尽管下游的稀疏注意力能够高效扩展，但索引器仍需为每个查询扫描整个前缀序列，这引入了每层O($L^2$)的计算瓶颈，随着上下文长度的增长，该瓶颈将变得难以承受。我们提出了HISA（分层索引稀疏注意力），作为一种可直接替换的索引器方案，它将搜索过程从平坦的令牌扫描转变为两阶段分层处理。首先，块级粗粒度过滤器对池化的块代表进行评分，以剪除无关区域。随后，令牌级细粒度处理仅在剩余的候选块内部应用原始索引器。HISA保留了下游稀疏多头注意力运算符所需的精确令牌级Top-K稀疏模式，且无需额外训练。在内核级基准测试中，HISA在32K上下文长度下实现了2$\times$的加速，在128K长度下实现了4$\times$的加速。在Needle-in-a-Haystack和LongBench评测中，我们直接将DeepSeek-V3.2中的索引器替换为HISA，未进行任何微调。HISA在质量上与原始DSA高度吻合，同时显著优于块稀疏基线方法。此外，HISA与原始DSA产生的令牌选择集合的平均交并比大于99%，表明其效率提升几乎未对选择保真度产生影响。

摘要 (Abstract)

Token-level sparse attention mechanisms, exemplified by DeepSeek Sparse Attention (DSA), achieve fine-grained key selection by scoring every historical token for each query using a lightweight indexer, and then computing attention only over the selected subset. While the downstream sparse attention scales efficiently, the indexer still scans the entire prefix for every query, introducing an O($L^2$) per-layer bottleneck that becomes prohibitive as context length grows. We propose HISA (Hierarchical Indexed Sparse Attention), a drop-in replacement for the indexer that transforms the search process from a flat token scan into a two-stage hierarchical procedure. First, a block-level coarse filter scores pooled block representatives to prune irrelevant regions. Then, a token-level refinement applies the original indexer only within the remaining candidate blocks. HISA preserves the exact token-level top-k sparsity pattern required by the downstream Sparse MLA operator and requires no additional training. On kernel-level benchmarks, HISA achieves a 2$\times$ speedup at 32K context length and 4$\times$ at 128K. On Needle-in-a-Haystack and LongBench, we directly replace the indexer in DeepSeek-V3.2 with HISA, without any fine-tuning. HISA closely matches the original DSA in quality while significantly outperforming block-sparse baselines. Moreover, the token selection sets produced by HISA and the original DSA exhibit a mean IoU greater than 99%, indicating that the efficiency gains come with virtually no impact on selection fidelity.

关键词: Sparse Attention, Hierarchical Indexing, Long Context, Inference Acceleration, DeepSeek, Token-level Selection, Attention Mechanism, Computational Efficiency

2. ✅ Fine-Tuning Large Language Models for Cooperative Tactical Deconfliction of Small Unmanned Aerial Systems

作者: Iman Sharifi, Alex Zongo, Peng Wei 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28561v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究如何通过监督微调和基于偏好的微调策略（使用LoRA和GRPO）来优化大型语言模型（Qwen-Math-7B），以提升其在密集、部分可观测的异构多智能体环境中（如小型无人机系统）进行合作战术冲突解除的决策准确性、一致性和分离性能。

摘要翻译

随着小型无人机系统在低空空域的日益广泛部署，在安全关键约束下实现可靠的战术冲突解脱需求日益增长。战术冲突解脱涉及在密集、部分可观测且异构的多智能体环境中进行短时域决策，必须同时保持协同间隔保障与运行效率。尽管大语言模型展现出强大的推理能力，但其直接应用于空中交通管制仍受限于领域知识基础不足和输出不可预测的不一致性。本文研究将大语言模型作为协同多智能体战术冲突解脱的决策者，采用微调策略使模型输出与人类操作员的启发式规则对齐。我们提出了一种基于BlueSky空中交通模拟器的仿真-语言数据生成流程，该流程能产生符合既定安全实践、规则一致的冲突解脱数据集。采用两种参数高效策略对预训练的Qwen-Math-7B模型进行微调：基于低秩适配的监督微调，以及结合低秩适配与组相对策略优化的偏好微调。在验证数据集和闭环仿真上的实验结果表明，与预训练大语言模型相比，监督式低秩适配微调显著提升了决策准确性、一致性和间隔保持性能，并大幅减少了近距空中碰撞事件。组相对策略优化提供了额外的协同优势，但在与异构智能体策略交互时表现出鲁棒性下降。

摘要 (Abstract)

The growing deployment of small Unmanned Aerial Systems (sUASs) in low-altitude airspaces has increased the need for reliable tactical deconfliction under safety-critical constraints. Tactical deconfliction involves short-horizon decision-making in dense, partially observable, and heterogeneous multi-agent environments, where both cooperative separation assurance and operational efficiency must be maintained. While Large Language Models (LLMs) exhibit strong reasoning capabilities, their direct application to air traffic control remains limited by insufficient domain grounding and unpredictable output inconsistency. This paper investigates LLMs as decision-makers in cooperative multi-agent tactical deconfliction using fine-tuning strategies that align model outputs to human operator heuristics. We propose a simulation-to-language data generation pipeline based on the BlueSky air traffic simulator that produces rule-consistent deconfliction datasets reflecting established safety practices. A pretrained Qwen-Math-7B model is fine-tuned using two parameter-efficient strategies: supervised fine-tuning with Low-Rank Adaptation (LoRA) and preference-based fine-tuning combining LoRA with Group-Relative Policy Optimization (GRPO). Experimental results on validation datasets and closed-loop simulations demonstrate that supervised LoRA fine-tuning substantially improves decision accuracy, consistency, and separation performance compared to the pretrained LLM, with significant reductions in near mid-air collisions. GRPO provides additional coordination benefits but exhibits reduced robustness when interacting with heterogeneous agent policies.

关键词: Large Language Models, Supervised Fine-tuning, LoRA, LLM Agents, Multi-agent Systems, Tactical Deconfliction, Small Unmanned Aerial Systems, Parameter-efficient Fine-tuning

3. ✅ Beyond the Answer: Decoding the Behavior of LLMs as Scientific Reasoners

作者: Rohan Pandey, Eric Ye, Michael Li 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28038v1

评分: 44.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	8.0/10	8.0

!!! tip deepseek-chat TL;DR

该研究通过GEPA方法优化LLMs在科学推理任务中的提示，分析发现性能提升往往依赖于模型特定的、难以泛化的启发式逻辑（称为"局部"逻辑），并论证提示优化可作为模型可解释性的工具。

摘要翻译

随着大型语言模型（LLM）在复杂推理任务上展现出日益精密的性能，当前架构已成为前沿模型内部启发式机制的关键代理。描述其涌现的推理能力对于长期的可解释性与安全性至关重要。此外，理解提示如何调节这些过程亦不可或缺，因为自然语言很可能成为与通用人工智能（AGI）系统交互的主要界面。在本研究中，我们采用遗传帕累托（GEPA）的自定义变体，系统性地优化科学推理任务的提示，并分析提示如何影响推理行为。我们探究了GEPA优化提示中固有的结构模式与逻辑启发式，评估了它们的可迁移性与脆弱性。研究结果表明，科学推理能力的提升往往对应着模型特定的启发式策略，这些策略无法在不同系统间泛化，我们称之为“局部”逻辑。通过将提示优化构建为模型可解释性的工具，我们认为，绘制LLM偏好的推理结构图谱，是未来与超人类智能有效协作的重要前提。

摘要 (Abstract)

As Large Language Models (LLMs) achieve increasingly sophisticated performance on complex reasoning tasks, current architectures serve as critical proxies for the internal heuristics of frontier models. Characterizing emergent reasoning is vital for long-term interpretability and safety. Furthermore, understanding how prompting modulates these processes is essential, as natural language will likely be the primary interface for interacting with AGI systems. In this work, we use a custom variant of Genetic Pareto (GEPA) to systematically optimize prompts for scientific reasoning tasks, and analyze how prompting can affect reasoning behavior. We investigate the structural patterns and logical heuristics inherent in GEPA-optimized prompts, and evaluate their transferability and brittleness. Our findings reveal that gains in scientific reasoning often correspond to model-specific heuristics that fail to generalize across systems, which we call “local” logic. By framing prompt optimization as a tool for model interpretability, we argue that mapping these preferred reasoning structures for LLMs is an important prerequisite for effectively collaborating with superhuman intelligence.

关键词: Large Language Models, Scientific Reasoning, Prompt Optimization, Model Interpretability, Genetic Pareto (GEPA), Reasoning Behavior, Local Logic, AGI Systems

4. ✅ EffiSkill: Agent Skill Based Automated Code Efficiency Optimization

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	5.0/10	5.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

EffiSkill框架通过将代码优化模式建模为可重用技能，使LLM智能体能够在不依赖运行时反馈的情况下，显著提高代码效率优化的成功率。

摘要翻译

代码效率是软件质量的基本维度，然而如何利用大语言模型（LLM）优化程序仍具挑战。现有方法多尝试一次性重写、检索示例或基于提示的搜索，但未能显式提炼可复用的优化知识，这限制了其在具体实例之外的泛化能力。
本文提出EffiSkill，一个面向代码效率优化的框架，旨在为基于LLM的智能体构建可移植的优化工具箱。其核心思想是将反复出现的“慢速到快速”代码转换建模为可复用的智能体技能，这些技能既捕捉具体的转换机制，也涵盖更高层次的优化策略。EffiSkill采用两阶段设计：第一阶段从大规模慢速/快速程序对中挖掘操作符技能（Operator Skill）与元技能（Meta Skill），构建技能库；第二阶段将该技能库应用于未见程序，通过免执行的诊断、技能检索、计划组合与候选代码生成来完成优化，无需运行时反馈。
在EffiBench-X上的实验结果表明，EffiSkill实现了更高的优化成功率，在不同模型与编程语言设置下，相比最强基线提升了3.69至12.52个百分点。这些发现表明，机制层面的技能重用为免执行的代码优化提供了有益基础，且所构建的技能库可作为可复用的资源，服务于更广泛的智能体工作流。

摘要 (Abstract)

Code efficiency is a fundamental aspect of software quality, yet how to harness large language models (LLMs) to optimize programs remains challenging. Prior approaches have sought for one-shot rewriting, retrieved exemplars, or prompt-based search, but they do not explicitly distill reusable optimization knowledge, which limits generalization beyond individual instances. In this paper, we present EffiSkill, a framework for code-efficiency optimization that builds a portable optimization toolbox for LLM-based agents. The key idea is to model recurring slow-to-fast transformations as reusable agent skills that capture both concrete transformation mechanisms and higher-level optimization strategies. EffiSkill adopts a two-stage design: Stage I mines Operator and Meta Skills from large-scale slow/fast program pairs to build a skill library; Stage II applies this library to unseen programs through execution-free diagnosis, skill retrieval, plan composition, and candidate generation, without runtime feedback. Results on EffiBench-X show that EffiSkill achieves higher optimization success rates, improving over the strongest baseline by 3.69 to 12.52 percentage points across model and language settings. These findings suggest that mechanism-level skill reuse provides a useful foundation for execution-free code optimization, and that the resulting skill library can serve as a reusable resource for broader agent workflows.

关键词: code efficiency optimization, large language models, LLM-based agents, agent skills, skill library, execution-free optimization, EffiBench-X, optimization success rate

5. ✅ MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models

评分: 38.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了MonitorBench基准，用于评估大语言模型中思维链（CoT）的可监控性，发现当最终输出需要基于决策关键因素进行结构化推理时CoT可监控性更高，且闭源模型通常可监控性更低，在压力测试下可监控性可能下降高达30%。

摘要翻译

大语言模型（LLM）生成的思维链（CoT）并不总是对其最终输出具有因果决定性。当出现这种不匹配时，思维链便无法忠实反映驱动模型行为的关键决策因素，从而导致思维链可监控性降低的问题。然而，目前仍缺乏一个全面且完全开源的基准来研究思维链可监控性。为填补这一空白，我们提出了MonitorBench，一个用于评估大语言模型中思维链可监控性的系统性基准。MonitorBench提供：（1）一套包含1,514个测试实例的多样化数据集，这些实例涵盖7大类别的19项任务，并精心设计了关键决策因素，用以刻画思维链在何种情况下可用于监控驱动大语言模型行为的因素；（2）两种压力测试设置，用于量化思维链可监控性可能下降的程度。在多个不同能力水平的流行大语言模型上进行的大量实验表明，当生成最终目标回答需要通过对关键决策因素进行结构化推理时，思维链可监控性更高。闭源大语言模型通常表现出较低的可监控性，且可监控性与模型能力之间存在负相关关系。此外，开源和闭源大语言模型在压力测试下均可有意降低可监控性，在一些无需对关键决策因素进行结构化推理的任务中，可监控性下降幅度高达30%。除这些实证发现外，MonitorBench为进一步研究评估未来大语言模型、探索高级压力测试可监控性技术以及开发新型监控方法提供了基础。

摘要 (Abstract)

Large language models (LLMs) can generate chains of thought (CoTs) that are not always causally responsible for their final outputs. When such a mismatch occurs, the CoT no longer faithfully reflects the decision-critical factors driving the model’s behavior, leading to the reduced CoT monitorability problem. However, a comprehensive and fully open-source benchmark for studying CoT monitorability remains lacking. To address this gap, we propose MonitorBench, a systematic benchmark for evaluating CoT monitorability in LLMs. MonitorBench provides: (1) a diverse set of 1,514 test instances with carefully designed decision-critical factors across 19 tasks spanning 7 categories to characterize when CoTs can be used to monitor the factors driving LLM behavior; and (2) two stress-test settings to quantify the extent to which CoT monitorability can be degraded. Extensive experiments across multiple popular LLMs with varying capabilities show that CoT monitorability is higher when producing the final target response requires structural reasoning through the decision-critical factor. Closed-source LLMs generally show lower monitorability, and there exists a negative relationship between monitorability and model capability. Moreover, both open- and closed-source LLMs can intentionally reduce monitorability under stress-tests, with monitorability dropping by up to 30% in some tasks that do not require structural reasoning over the decision-critical factors. Beyond these empirical insights, MonitorBench provides a basis for further research on evaluating future LLMs, studying advanced stress-test monitorability techniques, and developing new monitoring approaches.

关键词: Large Language Models, Chain of Thought, Monitorability, Benchmark, Interpretability, Reasoning, Faithfulness, Stress-test

6. ✅ Rethinking Easy-to-Hard: Limits of Curriculum Learning in Post-Training for Deductive Reasoning

作者: Maximilian Mordig, Andreas Opedal, Weiyang Liu, Bernhard Schölkopf 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27226v1

评分: 36.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该研究系统评估了课程学习在大型语言模型后训练中对演绎推理任务的影响，发现基于难度的训练序列安排相比随机采样在准确性和响应长度上并未带来稳健优势，挑战了课程学习在此类任务中的实际效用。

摘要翻译

课程学习（Curriculum Learning, CL）基于“按难度递增顺序学习应有助于提升泛化能力”的直觉，在大语言模型（LLMs）的预训练与后训练阶段均被广泛采用。对于组合推理任务——即复杂问题由基础推理规则构建而成——CL的直觉尤其具有吸引力；然而，CL对此类任务的实际影响在很大程度上仍未得到充分探索。本文通过合成算术与逻辑基准测试（其中难度以推理复杂度而非表层特征为衡量标准），对LLMs后训练中的CL进行了系统的实证研究。出乎意料的是，在多种模型架构和课程进度安排下，我们发现基于难度的训练序列在准确率或响应长度上均未表现出相对于标准随机抽样的稳定优势。这一结论在监督微调（Supervised Fine-Tuning, SFT）和强化学习（Reinforcement Learning, RL）方法中均保持一致。我们的研究表明，在演绎推理的语境下，训练样本的具体排序对于实现组合泛化的作用微乎其微，这对基于课程的后训练的实际效用提出了挑战。

摘要 (Abstract)

Curriculum learning (CL), motivated by the intuition that learning in increasing order of difficulty should ease generalization, is commonly adopted both in pre-training and post-training of large language models (LLMs). The intuition of CL is particularly compelling for compositional reasoning, where complex problems are built from elementary inference rules; however, the actual impact of CL on such tasks remains largely underexplored. We present a systematic empirical study of CL for post-training of LLMs, using synthetic arithmetic and logical benchmarks where difficulty is characterized by reasoning complexity rather than surface-level proxies. Surprisingly, across multiple model families and curriculum schedules, we find no robust advantage in difficulty-based sequencing over standard random sampling in either accuracy or response length. These findings persist across both supervised fine-tuning (SFT) and reinforcement learning (RL) methods. Our study suggests that, in the context of deductive reasoning, the specific ordering of training examples plays a negligible role in achieving compositional generalization, challenging the practical utility of curriculum-based post-training.

关键词: Curriculum Learning, Post-training, Large Language Models, Deductive Reasoning, Compositional Generalization, Supervised Fine-tuning, Reinforcement Learning, Reasoning Complexity

7. ✅ Moving Beyond Review: Applying Language Models to Planning and Translation in Reflection

作者: Seyed Parsa Neshaei, Richard Lee Davis, Tanja Käser 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28596v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	5.0/10	5.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究如何应用大语言模型支持学生在反思性写作的规划和翻译阶段，实验结果表明这种支持能显著提高反思深度和结构质量，但效果在延迟后测中减弱。

摘要翻译

反思性写作被认为有助于发展学生的元认知技能，但学习者往往难以进行深度反思，从而限制了学习收获。尽管大型语言模型（LLMs）已被证明能提升写作能力，但其作为对话代理用于反思性写作的效果参差不齐，且主要集中于对反思文本提供反馈，而非在规划与组织阶段提供支持。本文受写作认知过程理论（Cognitive Process Theory of writing，简称CPT）启发，首次提出将LLMs应用于反思性写作的规划与转换步骤。我们介绍了Pensée工具，它通过对话代理搭建结构化反思规划支架，并借助自动提取关键概念来支持转换阶段，以此探究在这些阶段提供显式人工智能支持的效果。我们在一项受控的组间实验（N=93）中对Pensée进行评估，通过控制不同写作阶段的人工智能支持来观察效果。结果显示，当学习者在CPT的规划与转换阶段获得支持时，其反思深度与结构质量显著提高，尽管这些效果在延迟后测中有所减弱。对学习者行为与感知的进一步分析，揭示了符合CPT理论的对话支持如何塑造反思过程与学习体验，从而为理论驱动的大型语言模型在人工智能辅助反思性写作中的应用提供了实证依据。

摘要 (Abstract)

Reflective writing is known to support the development of students’ metacognitive skills, yet learners often struggle to engage in deep reflection, limiting learning gains. Although large language models (LLMs) have been shown to improve writing skills, their use as conversational agents for reflective writing has produced mixed results and has largely focused on providing feedback on reflective texts, rather than support during planning and organizing. In this paper, inspired by the Cognitive Process Theory of writing (CPT), we propose the first application of LLMs to the planning and translation steps of reflective writing. We introduce Pensée, a tool to explore the effects of explicit AI support during these stages by scaffolding structured reflection planning using a conversational agent, and supporting translation by automatically extracting key concepts. We evaluate Pensée in a controlled between-subjects experiment (N=93), manipulating AI support across writing phases. Results show significantly greater reflection depth and structural quality when learners receive support during planning and translation stages of CPT, though these effects reduce in a delayed post-test. Analyses of learner behavior and perceptions further illustrate how CPT-aligned conversational support shapes reflection processes and learner experience, contributing empirical evidence for theory-driven uses of LLMs in AI-supported reflective writing.

关键词: Large Language Models, Reflective Writing, Cognitive Process Theory, Planning, Translation, Conversational Agent, Metacognitive Skills, AI Support

8. ✅ The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams

作者: Isaac Llorente-Saguer 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27412v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了一种名为LatentBiopsy的训练免费方法，通过分析大语言模型残差流激活的几何角度偏差来检测有害提示，并在多个模型变体上实现了高检测性能，同时揭示了有害意图表示与下游拒绝机制之间的几何解耦。

摘要翻译

本文提出LatentBiopsy，一种无需训练、通过分析大语言模型残差流激活几何形态来检测有害提示的方法。给定200个安全的规范提示，LatentBiopsy在目标层计算其激活向量的主导主成分，并通过新提示的激活向量与该参考方向的径向偏离角$θ$进行表征。异常分数基于规范分布的高斯拟合下$θ$的负对数似然计算，对方向无关的偏离进行对称标记。该方法无需任何有害样本进行训练。
我们在Qwen3.5-0.8B和Qwen2.5-0.5B两个完整模型三元组（基础模型、指令微调模型及通过正交化手术移除拒绝方向的\emph{消除模型}）上进行了评估。在所有六个变体中，LatentBiopsy在有害-规范检测任务上取得AUROC≥0.937，在区分有害提示与良性攻击性提示（XSTest）上达到AUROC=1.000，且单次查询开销低于毫秒级。
我们得出三项实证发现：首先，几何特征在拒绝消除后依然存在——两个消除模型的AUROC最多仅比对应指令微调模型低0.015，这确立了有害意图表征与下游生成拒绝机制之间的几何解耦。其次，有害提示呈现近乎简并的角分布（$σ_θ\approx 0.03$弧度），比规范分布（$σ_θ\approx 0.27$弧度）紧致一个数量级，且该特征在所有对齐阶段（包括消除模型）中均保持。第三，两个模型家族在相同深度呈现相反的环向取向：有害提示在Qwen3.5-0.8B中占据外环，而在Qwen2.5-0.5B中占据内环，这直接启发了我们采用方向无关的评分规则。

摘要 (Abstract)

We present LatentBiopsy, a training-free method for detecting harmful prompts by analysing the geometry of residual-stream activations in large language models. Given 200 safe normative prompts, LatentBiopsy computes the leading principal component of their activations at a target layer and characterises new prompts by their radial deviation angle $θ$ from this reference direction. The anomaly score is the negative log-likelihood of $θ$ under a Gaussian fit to the normative distribution, flagging deviations symmetrically regardless of orientation. No harmful examples are required for training. We evaluate two complete model triplets from the Qwen3.5-0.8B and Qwen2.5-0.5B families: base, instruction-tuned, and \emph{abliterated} (refusal direction surgically removed via orthogonalisation). Across all six variants, LatentBiopsy achieves AUROC $\geq$0.937 for harmful-vs-normative detection and AUROC = 1.000 for discriminating harmful from benign-aggressive prompts (XSTest), with sub-millisecond per-query overhead. Three empirical findings emerge. First, geometry survives refusal ablation: both abliterated variants achieve AUROC at most 0.015 below their instruction-tuned counterparts, establishing a geometric dissociation between harmful-intent representation and the downstream generative refusal mechanism. Second, harmful prompts exhibit a near-degenerate angular distribution ($σ_θ\approx 0.03$ rad), an order of magnitude tighter than the normative distribution ($σ_θ\approx 0.27$ rad), preserved across all alignment stages including abliteration. Third, the two families exhibit opposite ring orientations at the same depth: harmful prompts occupy the outer ring in Qwen3.5-0.8B but the inner ring in Qwen2.5-0.5B, directly motivating the direction-agnostic scoring rule.

9. ❌ AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding

作者: Haozhe Qi, Kevin Qu, Mahdi Rad, Rui Wang, Alexander Mathis, Marc Pollefeys 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28696v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	10.0/10	10.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	5.0/10	5.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出AdaptToken框架，专门解决多模态大语言模型（MLLMs）在长视频理解中的挑战，核心创新在于利用模型的自不确定性进行全局令牌选择和早期停止。因此，与"Large Language Models"和"Context Window Extension"高度相关（10分），因为论文直接针对MLLMs的长上下文限制问题。与"Speculative Decoding"有一定关联（5分），因为AdaptToken-Lite通过早期停止减少推理时间，属于推理加速范畴。其他关键词如MoE、SLMs、训练方法、对齐、RAG、推理技术、代理等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文解决了多模态大语言模型在长视频理解中因高内存成本和上下文长度限制而面临的挑战，提出了AdaptToken框架，通过基于熵的自适应令牌选择和早期停止机制，显著提高了准确性并减少了推理时间。

摘要翻译

长视频理解对多模态大语言模型（MLLMs）而言仍具挑战性，主要受限于高内存成本和上下文长度限制。先前的研究通过评分和选择短视频片段内的帧/令牌来缓解此问题，但它们缺乏一种原则性机制来（i）比较远距离视频片段之间的相关性，以及（ii）在收集到足够证据后停止处理。我们提出了AdaptToken，一种无需训练的框架，它将MLLM的自我不确定性转化为长视频令牌选择的全局控制信号。AdaptToken将视频分割为多个组，提取跨模态注意力以对每组内的令牌进行排序，并利用模型的响应熵来估计每组与提示的相关性。该熵信号支持跨组的全局令牌预算分配，并进一步实现早期停止（AdaptToken-Lite），即当模型变得足够确定时跳过剩余组处理。在四个长视频基准测试（VideoMME、LongVideoBench、LVBench和MLVU）和多种基础MLLM（7B-72B）上的实验表明，AdaptToken持续提升了准确性（例如，在Qwen2.5-VL 7B模型上平均提升+6.7分），并能持续受益于极长输入（高达10K帧），而AdaptToken-Lite以可比的性能将推理时间减少约一半。项目页面：https://haozheqi.github.io/adapt-token

摘要 (Abstract)

Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM’s self-uncertainty into a global control signal for long-video token selection. AdaptToken splits a video into groups, extracts cross-modal attention to rank tokens within each group, and uses the model’s response entropy to estimate each group’s prompt relevance. This entropy signal enables a global token budget allocation across groups and further supports early stopping (AdaptToken-Lite), skipping the remaining groups when the model becomes sufficiently certain. Across four long-video benchmarks (VideoMME, LongVideoBench, LVBench, and MLVU) and multiple base MLLMs (7B-72B), AdaptToken consistently improves accuracy (e.g., +6.7 on average over Qwen2.5-VL 7B) and continues to benefit from extremely long inputs (up to 10K frames), while AdaptToken-Lite reduces inference time by about half with comparable performance. Project page: https://haozheqi.github.io/adapt-token

关键词: Multi-modal Large Language Models, MLLMs, long video understanding, adaptive token selection, entropy-based, inference acceleration, context window, early stopping

10. ❌ Q-DIVER: Integrated Quantum Transfer Learning and Differentiable Quantum Architecture Search with EEG Data

作者: Junghoon Justin Park, Yeonghyeon Park, Jiook Cha 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28122v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	5.0/10	5.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文《Q-DIVER》的核心是量子机器学习（QML）与深度学习在生物医学信号处理（EEG）中的结合，属于AI for Science（生物信息学）范畴，因此该关键词高度相关（10分）。论文涉及使用预训练模型（DIVER-1）和微调（fine-tuning），与“Pre-training/Domain Adaptation”和“Post-training/SFT”有一定关联（各5分）。其量子分类器参数效率高（50× fewer parameters），与“PEFT/Parameter-efficient Fine-tuning”概念相关（5分）。其他关键词主要针对大语言模型（LLMs）及其特定技术（如MoE、RLHF、RAG等），或与推理、代理、压缩等主题相关，而本文专注于量子电路和EEG处理，未涉及这些内容，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Q-DIVER的混合框架，将预训练的EEG编码器与可微分量子架构搜索相结合，用于脑电信号分类，在保持性能的同时显著减少了任务特定参数（50倍），验证了量子迁移学习在生物信号处理中的参数效率。

摘要翻译

将量子电路整合到深度学习流程中仍面临启发式设计局限性的挑战。我们提出Q-DIVER混合框架，将大规模预训练的脑电编码器（DIVER-1）与可微分量子分类器相结合。区别于固定拟设方案，我们采用可微分量子架构搜索技术，在端到端微调过程中自主发现任务最优的电路拓扑结构。在PhysioNet运动想象数据集上，我们的量子分类器取得了与经典多层感知器相当的预测性能（测试F1分数：63.49%），同时使用的任务特定头部参数减少约50倍（2.10M对比105.02M）。这些结果验证了量子迁移学习可作为高维生物信号处理中一种参数高效的策略。

摘要 (Abstract)

Integrating quantum circuits into deep learning pipelines remains challenging due to heuristic design limitations. We propose Q-DIVER, a hybrid framework combining a large-scale pretrained EEG encoder (DIVER-1) with a differentiable quantum classifier. Unlike fixed-ansatz approaches, we employ Differentiable Quantum Architecture Search to autonomously discover task-optimal circuit topologies during end-to-end fine-tuning. On the PhysioNet Motor Imagery dataset, our quantum classifier achieves predictive performance comparable to classical multi-layer perceptrons (Test F1: 63.49%) while using approximately \textbf{50$\times$ fewer task-specific head parameters} (2.10M vs. 105.02M). These results validate quantum transfer learning as a parameter-efficient strategy for high-dimensional biological signal processing.

关键词: Quantum Machine Learning, Differentiable Quantum Architecture Search, EEG Signal Processing, Parameter-efficient Fine-tuning, Transfer Learning, Bioinformatics, Quantum Transfer Learning, Motor Imagery Classification

11. ❌ LLM Readiness Harness: Evaluation, Observability, and CI Gates for LLM/RAG Applications

作者: Alexandre Cristovão Maiorano 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27355v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心是开发一个用于评估LLM和RAG应用就绪状态的框架，因此与"Large Language Models"和"Retrieval-Augmented Generation"高度相关（10分）。论文提到评估"groundedness"和"faithfulness"，这与"Hallucination Mitigation"有一定关联（5分）。其他关键词如模型架构、训练方法、推理优化、科学应用等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个LLM和RAG应用就绪状态评估框架，通过自动化基准测试、可观测性和CI质量门控生成场景加权的就绪分数，并在票务路由和BEIR基准任务上验证了该框架能有效区分模型性能并阻止高风险发布。

摘要翻译

本文提出一种面向大语言模型（LLM）与检索增强生成（RAG）应用的就绪度评估框架，该框架将评估转化为部署决策工作流。该系统在最小化API合约下，整合了自动化基准测试、OpenTelemetry可观测性以及持续集成（CI）质量门禁，进而将工作流成功率、策略合规性、事实依据性、检索命中率、成本及p95延迟等指标，聚合为基于场景加权的就绪度评分，并辅以帕累托前沿分析。我们在工单路由工作流以及BEIR事实依据任务（SciFact和FiQA）上对该框架进行了全面评估，实现了对Azure矩阵的全覆盖（涵盖数据集、场景、检索深度、随机种子和模型，共162/162个有效单元）。结果表明，就绪度并非单一指标：在FiQA数据集上，采用sla-first策略且检索深度k=5时，gpt-4.1-mini在就绪度与忠实度上领先，而gpt-5.2则需承担显著的延迟代价；在SciFact任务中，各模型质量接近但在运行层面仍可区分。工单路由的回归门禁能持续拦截不安全的提示词变体，证明该框架可有效阻止高风险版本发布，而非仅提供离线评分。最终，我们构建了一个可复现、基于运行实践的决策框架，用于判定LLM或RAG系统是否达到发布就绪状态。

摘要 (Abstract)

We present a readiness harness for LLM and RAG applications that turns evaluation into a deployment decision workflow. The system combines automated benchmarks, OpenTelemetry observability, and CI quality gates under a minimal API contract, then aggregates workflow success, policy compliance, groundedness, retrieval hit rate, cost, and p95 latency into scenario-weighted readiness scores with Pareto frontiers. We evaluate the harness on ticket-routing workflows and BEIR grounding tasks (SciFact and FiQA) with full Azure matrix coverage (162/162 valid cells across datasets, scenarios, retrieval depths, seeds, and models). Results show that readiness is not a single metric: on FiQA under sla-first at k=5, gpt-4.1-mini leads in readiness and faithfulness, while gpt-5.2 pays a substantial latency cost; on SciFact, models are closer in quality but still separable operationally. Ticket-routing regression gates consistently reject unsafe prompt variants, demonstrating that the harness can block risky releases instead of merely reporting offline scores. The result is a reproducible, operationally grounded framework for deciding whether an LLM or RAG system is ready to ship.

关键词: LLM applications, RAG applications, readiness evaluation, deployment decision workflow, automated benchmarks, observability, CI quality gates, groundedness

12. ❌ Entropic Claim Resolution: Uncertainty-Driven Evidence Selection for RAG

作者: Davide Di Gioia 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28444v1

评分: 23.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	5.0/10	5.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究RAG系统的证据选择算法，与"Retrieval-Augmented Generation"高度相关（10分），涉及LLM应用场景（8分），并提及与agentic框架（如ReAct）的对比，与"LLM Agents"有一定关联（5分）。其他关键词如MoE、SLMs、训练方法、推理加速等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对RAG系统在证据冲突或查询模糊场景下仅依赖相关性检索的不足，提出了一种基于熵最小化的不确定性驱动证据选择算法ECR，实现了从检索最相关证据到检索最具区分性证据的范式转变。

摘要翻译

当前基于检索增强生成（RAG）的系统主要依赖基于相关性的密集检索，通过顺序获取文档以最大化其与查询的语义相似性。然而，在知识密集型且存在证据冲突或查询本身具有根本模糊性的现实场景中，仅依靠相关性不足以解决认知不确定性。本文提出一种新颖的推理时算法——熵化主张消解（Entropic Claim Resolution, ECR），其将RAG推理重新定义为在相互竞争的语义答案假设上进行熵最小化的过程。与行动驱动的智能体框架（如ReAct）或固定流程的RAG架构不同，ECR通过最大化期望熵减（Expected Entropy Reduction, EER）——一种基于信息价值的决策理论准则——来顺序选择原子证据主张。当系统达到数学定义的认知充分状态（H ≤ ε，且满足认知一致性约束）时，该过程动态终止。我们将ECR集成至一个生产级多策略检索流程（CSGR++）中，并分析其理论特性。该框架为不确定性感知的证据选择提供了严格的基础，将检索范式从“检索最相关的内容”转向“检索最具区分度的内容”。

摘要 (Abstract)

Current Retrieval-Augmented Generation (RAG) systems predominantly rely on relevance-based dense retrieval, sequentially fetching documents to maximize semantic similarity with the query. However, in knowledge-intensive and real-world scenarios characterized by conflicting evidence or fundamental query ambiguity, relevance alone is insufficient for resolving epistemic uncertainty. We introduce Entropic Claim Resolution (ECR), a novel inference-time algorithm that reframes RAG reasoning as entropy minimization over competing semantic answer hypotheses. Unlike action-driven agentic frameworks (e.g., ReAct) or fixed-pipeline RAG architectures, ECR sequentially selects atomic evidence claims by maximizing Expected Entropy Reduction (EER), a decision-theoretic criterion for the value of information. The process dynamically terminates when the system reaches a mathematically defined state of epistemic sufficiency (H <= epsilon, subject to epistemic coherence). We integrate ECR into a production-grade multi-strategy retrieval pipeline (CSGR++) and analyze its theoretical properties. Our framework provides a rigorous foundation for uncertainty-aware evidence selection, shifting the paradigm from retrieving what is most relevant to retrieving what is most discriminative.

关键词: Retrieval-Augmented Generation, RAG, evidence selection, entropy minimization, uncertainty-aware, epistemic sufficiency, decision-theoretic, multi-strategy retrieval

13. ❌ Categorical Perception in Large Language Model Hidden States: Structural Warping at Digit-Count Boundaries

作者: Jon-Paul Cacioli 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28258v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）隐藏状态中的表征几何结构，特别是数字处理时的分类感知现象，与关键词’Large Language Models’高度相关（10分），并涉及模型内部工作机制的解释，与’Mechanistic Interpretability’高度相关（10分）。其他关键词如MoE、SLMs、训练方法、推理技术、应用领域等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该研究发现大语言模型在处理阿拉伯数字时，其隐藏状态表征在数字计数边界（如10和100）处会出现类似人类分类感知的几何扭曲现象，且这种扭曲与模型能否明确报告类别知识相分离。

摘要翻译

范畴感知（Categorical perception, CP）——即类别边界处辨别能力的增强——是感知心理学中研究最广泛的现象之一。本文报告了在处理阿拉伯数字的大型语言模型（LLMs）的隐藏状态表征中，出现了类似的几何扭曲现象。研究通过对来自五个架构家族的六个模型进行表征相似性分析发现，在所有被测试模型的100%主要层级中，一个CP叠加模型（对数距离加上边界增强）比纯粹的连续模型更好地拟合了表征几何结构。这种效应特异地出现在结构定义的边界处（如数字位数在10和100处的转换），在非边界的控制位置则不存在，并且在温度领域（其语言类别如热/冷缺乏分词不连续性）中也不存在。研究发现了两种性质不同的特征模式：“经典CP”（Gemma、Qwen模型），即模型既能进行显式分类，也表现出几何扭曲；以及“结构CP”（Llama、Mistral、Phi模型），即模型在边界处出现几何扭曲，但无法报告类别区分。这种分离现象在不同边界之间是稳定的，并且是模型架构本身的属性，而非刺激的特性。研究结果表明，结构性的输入格式不连续性足以在大型语言模型中产生范畴感知的几何结构，而不依赖于显式的语义类别知识。

摘要 (Abstract)

Categorical perception (CP) – enhanced discriminability at category boundaries – is among the most studied phenomena in perceptual psychology. This paper reports that analogous geometric warping occurs in the hidden-state representations of large language models (LLMs) processing Arabic numerals. Using representational similarity analysis across six models from five architecture families, the study finds that a CP-additive model (log-distance plus a boundary boost) fits the representational geometry better than a purely continuous model at 100% of primary layers in every model tested. The effect is specific to structurally defined boundaries (digit-count transitions at 10 and 100), absent at non-boundary control positions, and absent in the temperature domain where linguistic categories (hot/cold) lack a tokenisation discontinuity. Two qualitatively distinct signatures emerge: “classic CP” (Gemma, Qwen), where models both categorise explicitly and show geometric warping, and “structural CP” (Llama, Mistral, Phi), where geometry warps at the boundary but models cannot report the category distinction. This dissociation is stable across boundaries and is a property of the architecture, not the stimulus. Structural input-format discontinuities are sufficient to produce categorical perception geometry in LLMs, independently of explicit semantic category knowledge.

关键词: Categorical Perception, Large Language Models, Hidden States, Representational Geometry, Digit-Count Boundaries, Structural Warping, Mechanistic Interpretability, LLM Representations

14. ❌ HandX: Scaling Bimanual Motion and Interaction Generation

作者: Zimu Zhang, Yucheng Zhang, Xiyan Xu, Ziyin Wang, Sirui Xu, Kai Zhou, Bing Zhou, Chuan Guo, Jian Wang, Yu-Xiong Wang, Liang-Yan Gui 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28766v1

评分: 16.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	8.0/10	8.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文HandX主要研究双手运动和交互生成，属于计算机视觉和运动合成领域。其核心创新在于构建了一个统一的数据、标注和评估框架，并利用大语言模型（LLMs）进行运动特征的语义标注（摘要中提到’leverages reasoning from large language models to produce fine-grained, semantically rich descriptions’），因此与’Large Language Models’关键词高度相关（8分）。同时，论文明确探讨了缩放规律和数据质量的关系（‘We further observe clear scaling trends: larger models trained on larger, higher-quality datasets produce more semantically coherent bimanual motion.’），与’Scaling Laws AND Data Quality’关键词直接相关（8分）。论文未涉及其他关键词所描述的具体大模型技术原理（如MoE、SFT、RLHF、PEFT等）、推理方法（如CoT、MCTS）、代理系统、效率优化（如量化、推测解码）或特定科学领域应用（如生物信息学），因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有方法在生成逼真的双手交互运动时缺乏精细手指动态和协调性的问题，提出了HandX框架，通过整合高质量数据集、利用大语言模型进行语义标注，并基于扩散和自回归模型进行基准测试，最终实现了高质量的灵巧运动生成，并观察到模型性能随数据质量和规模提升的缩放规律。

摘要翻译

人体运动合成技术发展迅速，但真实的手部运动与双手交互研究仍显不足。全身模型常忽略驱动灵巧行为、手指关节活动、接触时机及双手协调的精细线索，现有资源也缺乏能捕捉细微手指动态与协作的高保真双手运动序列。为填补这一空白，我们提出HandX——一个涵盖数据、标注与评估的统一基础框架。我们整合并筛选现有数据集以提升质量，同时采集了新的动作捕捉数据集，重点关注缺乏代表性的双手交互行为，并记录精细的手指动态。为实现可扩展的标注，我们引入解耦策略：先提取代表性运动特征（如接触事件、手指屈曲），再利用大语言模型的推理能力生成与这些特征对齐的、语义丰富的细粒度描述。基于所得数据与标注，我们以多种条件模式对扩散模型和自回归模型进行基准测试。实验证明了高质量灵巧动作的生成能力，并通过我们新提出的手部专项评估指标予以验证。我们进一步观察到明显的规模化趋势：在更大规模、更高质量数据集上训练的模型能生成语义更连贯的双手运动。本数据集已公开，以支持未来研究。

摘要 (Abstract)

Synthesizing human motion has advanced rapidly, yet realistic hand motion and bimanual interaction remain underexplored. Whole-body models often miss the fine-grained cues that drive dexterous behavior, finger articulation, contact timing, and inter-hand coordination, and existing resources lack high-fidelity bimanual sequences that capture nuanced finger dynamics and collaboration. To fill this gap, we present HandX, a unified foundation spanning data, annotation, and evaluation. We consolidate and filter existing datasets for quality, and collect a new motion-capture dataset targeting underrepresented bimanual interactions with detailed finger dynamics. For scalable annotation, we introduce a decoupled strategy that extracts representative motion features, e.g., contact events and finger flexion, and then leverages reasoning from large language models to produce fine-grained, semantically rich descriptions aligned with these features. Building on the resulting data and annotations, we benchmark diffusion and autoregressive models with versatile conditioning modes. Experiments demonstrate high-quality dexterous motion generation, supported by our newly proposed hand-focused metrics. We further observe clear scaling trends: larger models trained on larger, higher-quality datasets produce more semantically coherent bimanual motion. Our dataset is released to support future research.

关键词: bimanual motion generation, hand motion synthesis, large language models for annotation, scaling laws, motion-capture dataset, diffusion models, autoregressive models, fine-grained finger dynamics

15. ❌ SAGAI-MID: A Generative AI-Driven Middleware for Dynamic Runtime Interoperability

作者: Oliver Aleksander Larsen, Mahyar T. Moghaddam 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28731v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文SAGAI-MID的核心是使用大语言模型（LLMs）作为运行时中间件来解决分布式系统中的模式不匹配问题。因此，它与关键词"Large Language Models" OR “LLMs” OR “Foundation Models"高度相关（10分），因为LLMs是系统的核心组件，用于语义分析和代码生成。论文未涉及其他关键词的具体技术细节，如MoE、SLMs、训练方法（预训练、微调、对齐）、推理优化（注意力、解码）、代理系统、模型压缩或科学AI应用，因此这些关键词得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SAGAI-MID的生成式AI驱动中间件，利用大语言模型在运行时动态检测和解决分布式系统中异构服务之间的模式不匹配问题，在多种互操作性场景下实现了高达0.90的准确率。

摘要翻译

现代分布式系统集成了异构服务、不同模式版本的REST API、GraphQL端点以及采用私有数据格式的物联网设备，这些组件长期存在模式不匹配问题。传统的静态适配器需要为每对模式手动编写代码，无法在运行时处理新型组合。本文提出SAGAI-MID——一个基于FastAPI的中间件，其利用大语言模型在运行时动态检测并解决模式不匹配问题。该系统采用五层处理流程：混合检测（结构差异分析结合LLM语义分析）、双重解决策略（基于单次请求的LLM即时转换与LLM生成的可复用适配器代码）以及三层防护机制（验证、集成投票、基于规则的降级方案）。我们通过Bass等人提出的互操作性策略框架来构建系统架构，将这些策略从设计期构件转化为运行时能力。我们在10种互操作性场景中对SAGAI-MID进行评估，场景涵盖REST版本迁移、物联网到分析系统的桥接以及GraphQL协议转换，测试涉及来自两家供应商的六种大语言模型。最佳配置方案达到0.90的pass@1准确率。CODEGEN策略始终优于DIRECT策略（平均pass@1准确率0.83对0.77），而不同模型的成本差异超过30倍且无相应准确率提升；准确率最高的模型同时成本最低。最后我们探讨了将大语言模型作为运行时架构组件对软件架构师带来的启示。

摘要 (Abstract)

Modern distributed systems integrate heterogeneous services, REST APIs with different schema versions, GraphQL endpoints, and IoT devices with proprietary payloads that suffer from persistent schema mismatches. Traditional static adapters require manual coding for every schema pair and cannot handle novel combinations at runtime. We present SAGAI-MID, a FastAPI-based middleware that uses large language models (LLMs) to dynamically detect and resolve schema mismatches at runtime. The system employs a five-layer pipeline: hybrid detection (structural diff plus LLM semantic analysis), dual resolution strategies (per-request LLM transformation and LLM-generated reusable adapter code), and a three-tier safeguard stack (validation, ensemble voting, rule-based fallback). We frame the architecture through Bass et al.’s interoperability tactics, transforming them from design-time artifacts into runtime capabilities. We evaluate SAGAI-MID on 10 interoperability scenarios spanning REST version migration, IoT-to-analytics bridging, and GraphQL protocol conversion across six LLMs from two providers. The best-performing configuration achieves 0.90 pass@1 accuracy. The CODEGEN strategy consistently outperforms DIRECT (0.83 vs 0.77 mean pass@1), while cost varies by over 30x across models with no proportional accuracy gain; the most accurate model is also the cheapest. We discuss implications for software architects adopting LLMs as runtime architectural components.

关键词: Generative AI, Middleware, Large Language Models, Runtime Interoperability, Schema Mismatch, Dynamic Resolution, Software Architecture, Distributed Systems

16. ❌ Membership Inference Attacks against Large Audio Language Models

作者: Jia-Kai Dong, Yu-Xiang Lin, Hung-Yi Lee 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28378v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文专注于大型音频语言模型（LALMs）的成员推理攻击评估，这是大型语言模型（LLMs）的一个特定子领域。论文的核心是研究LALMs的安全性和隐私问题，因此与"Large Language Models"高度相关（10分）。其他关键词主要涉及模型架构、训练技术、推理优化、对齐方法、应用领域等，而本文研究的是模型安全审计，与这些技术主题没有直接关联，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文首次系统评估了大型音频语言模型的成员推理攻击，发现音频数据中的非语义信息会导致训练和测试分布偏移，从而产生虚假的MIA性能，并建立了消除分布偏移干扰的可靠评估标准。

摘要翻译

本文首次对大型音频语言模型（Large Audio Language Models, LALMs）进行了系统的成员推理攻击（Membership Inference Attack, MIA）评估。由于音频编码了非语义信息，其会导致严重的训练与测试分布偏移，并可能引发虚假的MIA性能表现。通过使用一种基于文本、频谱和韵律特征的多模态盲基线方法，我们证明即使不进行模型推理，常见的语音数据集也呈现出近乎完美的训练/测试可分离性（AUC接近1.0），且标准MIA分数与这些盲声学伪影高度相关（相关性大于0.7）。利用此盲基线，我们发现分布匹配的数据集能够实现可靠的MIA评估，而无需受分布偏移的干扰。我们在这些数据集上对多种MIA方法进行了基准测试，并进行了模态解耦实验。结果表明，LALM的记忆是跨模态的，仅源于将说话者的声音身份与其文本内容进行绑定。这些发现为超越虚假相关性、审计LALMs建立了一项原则性标准。

摘要 (Abstract)

We present the first systematic Membership Inference Attack (MIA) evaluation of Large Audio Language Models (LALMs). As audio encodes non-semantic information, it induces severe train and test distribution shifts and can lead to spurious MIA performance. Using a multi-modal blind baseline based on textual, spectral, and prosodic features, we demonstrate that common speech datasets exhibit near-perfect train/test separability (AUC approximately 1.0) even without model inference, and the standard MIA scores strongly correlate with these blind acoustic artifacts (correlation greater than 0.7). Using this blind baseline, we identify that distribution-matched datasets enable reliable MIA evaluation without distribution shift confounds. We benchmark multiple MIA methods and conduct modality disentanglement experiments on these datasets. The results reveal that LALM memorization is cross-modal, arising only from binding a speaker’s vocal identity with its text. These findings establish a principled standard for auditing LALMs beyond spurious correlations.

关键词: Membership Inference Attack, Large Audio Language Models, train-test distribution shift, multi-modal baseline, acoustic artifacts, cross-modal memorization, model auditing, privacy evaluation

17. ❌ Coherent Without Grounding, Grounded Without Success: Observability and Epistemic Failure

作者: Camilo Chacón Sartori 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28371v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文的核心是分析大型语言模型（LLMs）在理解和行动之间的认知脱节问题，即"双向一致性悖论”。论文明确以LLMs为研究对象，因此与"Large Language Models"高度相关（10分）。论文探讨了LLMs的解释能力、行为成功与真实理解之间的关系，这涉及认知评估和解释性AI，但与"Mechanistic Interpretability"或"Explainable AI"等具体技术关键词无直接关联（0分）。论文未涉及其他关键词所代表的具体技术方法、训练技术、优化技术或特定应用领域（如科学AI），因此这些关键词均评为0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型（LLMs）中解释连贯性与行为有效性之间的脱节问题，提出了"双向一致性悖论"，并论证了评估人工智能认知代理需要结合连贯性、基础性和解释与行动之间的正确基于关系的三方框架。

摘要翻译

当智能体能够阐明某事物为何有效时，我们通常将此视为其具备真实理解的证据。这一观点预设了有效行动与正确解释之间存在共变关系，且连贯的解释能可靠地同时指示两者。我认为，这一假设对于当代大语言模型并不成立。我提出所谓的“双向连贯性悖论”：在不同认知条件下，能力与事实依据不仅相互分离，甚至呈现倒置关系。在低可观测性领域中，大语言模型常能成功执行任务，却错误识别导致其成功的机制；而在高可观测性领域中，它们虽能生成准确追踪可观测因果结构的解释，却往往无法将这些诊断转化为有效的干预措施。在这两种情况下，解释的连贯性依然得以保持，从而掩盖了深层的分离现象。通过编译器优化与超参数调优等实验，我构建了“认知三角”模型，用以阐释先验知识、信号与领域知识在不同可观测性条件下的互动机制。研究结果表明，仅凭行为成功或解释准确性均不足以认定其具备理解能力。我认为，评估人工认知智能体需要三元框架——即连贯性、事实依据以及连接解释与行动的恰当依据关系。大语言模型中“知其然”与“知其所以然”的系统性分离，由此对传统认识论及当前人工智能评估实践所承袭的假设提出了挑战。

摘要 (Abstract)

When an agent can articulate why something works, we typically take this as evidence of genuine understanding. This presupposes that effective action and correct explanation covary, and that coherent explanation reliably signals both. I argue that this assumption fails for contemporary Large Language Models (LLMs). I introduce what I call the Bidirectional Coherence Paradox: competence and grounding not only dissociate but invert across epistemic conditions. In low-observability domains, LLMs often act successfully while misidentifying the mechanisms that produce their success. In high-observability domains, they frequently generate explanations that accurately track observable causal structure yet fail to translate those diagnoses into effective intervention. In both cases, explanatory coherence remains intact, obscuring the underlying dissociation. Drawing on experiments in compiler optimization and hyperparameter tuning, I develop the Epistemic Triangle, a model of how priors, signals, and domain knowledge interact under varying observability. The results suggest that neither behavioral success nor explanatory accuracy alone suffices for attributing understanding. I argue that evaluating artificial epistemic agents requires a tripartite framework – coherence, grounding, and a proper basing relation linking explanation to action. The systematic separation of knowing-that and knowing-how in LLMs thus challenges assumptions inherited from both epistemology and current AI evaluation practice.

关键词: Large Language Models, LLMs, Epistemic Failure, Bidirectional Coherence Paradox, Observability, Understanding, Explanation, Action

18. ❌ ColorFLUX: A Structure-Color Decoupling Framework for Old Photo Colorization

作者: Bingchen Li, Zhixin Wang, Fan Li, Jiaqi Xu, Jiaming Guo, Renjing Pei, Xin Li, Zhibo Chen 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28162v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文主要研究基于扩散模型的老照片着色技术，核心创新包括结构-颜色解耦框架和渐进式DPO策略。论文明确使用了"Direct Preference Optimization (DPO)“方法（关键词8），因此该关键词得10分。论文未涉及大语言模型（LLMs）、MoE、SLMs、缩放定律、预训练/后训练、对齐、PEFT、RAG、上下文扩展、推理优化、智能体、量化、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等主题，也未涉及生物信息学等科学AI应用，因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于FLUX扩散模型的老照片着色框架，通过结构-颜色解耦和渐进式DPO策略，实现了比现有方法更准确、生动的着色效果。

摘要翻译

老照片保存着珍贵的历史记忆，使其修复与着色具有重要价值。现有修复模型虽能处理去噪、划痕消除等部分退化问题，却常难以实现精准着色。这一局限源于老照片特有的退化特征——如亮度衰减与色相改变——其分布与现代照片存在显著差异，导致着色过程中产生巨大的域间差距。本文提出一种基于生成扩散模型FLUX的新型老照片着色框架。我们引入结构-颜色解耦策略，将结构保持与色彩恢复分离，在维持结构一致性的同时实现老照片的精确着色。进一步通过渐进式直接偏好优化策略增强模型，使其能通过色彩增强的由粗到细过渡学习细微色彩偏好。此外，针对文本提示的局限性，我们引入视觉语义提示机制，直接从老照片中提取细粒度语义信息，以消除老照片固有的色彩偏差。在合成与真实数据集上的实验结果表明，本方法优于包括闭源商业模型在内的现有先进着色技术，能生成高质量且色彩生动的着色结果。

摘要 (Abstract)

Old photos preserve invaluable historical memories, making their restoration and colorization highly desirable. While existing restoration models can address some degradation issues like denoising and scratch removal, they often struggle with accurate colorization. This limitation arises from the unique degradation inherent in old photos, such as faded brightness and altered color hues, which are different from modern photo distributions, creating a substantial domain gap during colorization. In this paper, we propose a novel old photo colorization framework based on the generative diffusion model FLUX. Our approach introduces a structure-color decoupling strategy that separates structure preservation from color restoration, enabling accurate colorization of old photos while maintaining structural consistency. We further enhance the model with a progressive Direct Preference Optimization (Pro-DPO) strategy, which allows the model to learn subtle color preferences through coarse-to-fine transitions in color augmentation. Additionally, we address the limitations of text-based prompts by introducing visual semantic prompts, which extract fine-grained semantic information directly from old photos, helping to eliminate the color bias inherent in old photos. Experimental results on both synthetic and real datasets demonstrate that our approach outperforms existing state-of-the-art colorization methods, including closed-source commercial models, producing high-quality and vivid colorization.

关键词: old photo colorization, diffusion model, structure-color decoupling, Direct Preference Optimization, visual semantic prompts, domain gap, FLUX, progressive DPO

19. ❌ Subspace Optimization for Backpropagation-Free Continual Test-Time Adaptation

作者: Damian Sójka, Sebastian Cygert, Marc Masana 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28678v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文PACE专注于持续测试时适应（Continual Test-Time Adaptation），这是领域适应（Domain Adaptation）的一个子领域，因此与关键词"Pre-training” OR “Continual Pre-training” OR “Domain Adaptation"高度相关（10分）。论文的核心是提出一种无反向传播的优化方法（使用CMA-ES和Fastfood投影），并专注于运行时效率提升，但并未涉及大语言模型（LLMs）、专家混合（MoE）、小语言模型（SLMs）、缩放定律、后训练技术（如SFT、指令调优、RLHF、PEFT）、检索增强生成、上下文窗口扩展、注意力优化、推理方法（如CoT、系统2思维、MCTS）、自我改进、智能体、工具使用、多智能体系统、模型压缩、推测解码、幻觉缓解、可解释AI、世界模型、模型合并、上下文学习或科学AI（如生物信息学）等具体技术或应用。因此，除领域适应外，其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了PACE，一种无反向传播的持续测试时适应系统，通过优化归一化层的仿射参数并在低维子空间中使用进化策略，在持续分布偏移下实现了最先进的准确性，同时将运行时间减少了50%以上。

摘要翻译

我们提出PACE，一种无需反向传播的持续测试时自适应系统，该系统直接优化归一化层的仿射参数。现有的无导数方法难以在运行时效率与学习能力之间取得平衡：它们要么将更新限制在输入提示中，要么无论领域稳定性如何都需要持续进行资源密集型的自适应。为克服这些局限性，PACE利用协方差矩阵自适应进化策略（Covariance Matrix Adaptation Evolution Strategy）结合Fastfood投影，在低维子空间内优化高维仿射参数，从而实现了卓越的自适应性能。此外，我们通过引入自适应停止准则和领域专用向量库来消除冗余计算，从而提升了运行时效率。我们的框架在持续分布偏移下的多个基准测试中达到了最先进的准确率，与现有的无反向传播方法相比，运行时间减少了50%以上。

摘要 (Abstract)

We introduce PACE, a backpropagation-free continual test-time adaptation system that directly optimizes the affine parameters of normalization layers. Existing derivative-free approaches struggle to balance runtime efficiency with learning capacity, as they either restrict updates to input prompts or require continuous, resource-intensive adaptation regardless of domain stability. To address these limitations, PACE leverages the Covariance Matrix Adaptation Evolution Strategy with the Fastfood projection to optimize high-dimensional affine parameters within a low-dimensional subspace, leading to superior adaptive performance. Furthermore, we enhance the runtime efficiency by incorporating an adaptation stopping criterion and a domain-specialized vector bank to eliminate redundant computation. Our framework achieves state-of-the-art accuracy across multiple benchmarks under continual distribution shifts, reducing runtime by over 50% compared to existing backpropagation-free methods.

关键词: Continual Test-Time Adaptation, Backpropagation-Free, Normalization Layers, Covariance Matrix Adaptation Evolution Strategy, Fastfood Projection, Runtime Efficiency, Domain Shift, Subspace Optimization

20. ❌ Reducing Oracle Feedback with Vision-Language Embeddings for Preference-Based RL

作者: Udita Ghosh, Dripta S. Raychaudhuri, Jiachen Li, Konstantinos Karydis, Amit Roy-Chowdhury 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28053v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文主要研究偏好强化学习（Preference-based RL）中减少人工标注反馈的方法，提出ROVED框架结合视觉语言嵌入模型（VLE）和选择性人工反馈。与大多数关键词无关，因为论文聚焦于计算机视觉、机器人学和强化学习的交叉领域，而非大语言模型（LLM）技术。唯一相关的是"PEFT"关键词，因为论文明确提出了一个参数高效微调方法（parameter-efficient fine-tuning method）来适应VLE模型，这直接匹配"PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”，评分为10分（核心内容）。其他关键词如LLMs、MoE、Scaling Laws等均未涉及，评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出ROVED框架，通过结合视觉语言嵌入模型和选择性人工反馈来减少偏好强化学习中的标注成本，在机器人操作任务中减少高达80%的人工查询并实现90%的累计标注节省。

摘要翻译

基于偏好的强化学习能够通过比较学习有效的奖励函数，但其可扩展性受限于专家反馈的高成本。轻量级视觉语言嵌入模型提供了更经济的替代方案，但其噪声输出限制了其作为独立奖励生成器的有效性。为解决这一挑战，我们提出ROVED——一种将基于VLE的监督与定向专家反馈相结合的混合框架。该方法利用VLE生成片段级偏好，仅通过过滤机制识别高不确定性样本时才会调用专家标注。此外，我们提出一种参数高效的微调方法，利用获得的专家反馈对VLE进行自适应优化，从而以协同方式持续提升模型性能。这既保留了嵌入模型的扩展性，又确保了专家标注的准确性，同时避免了二者的效率缺陷。在多项机器人操作任务中，ROVED达到或超越了现有基于偏好的方法，同时将专家查询量减少达80%。值得注意的是，经过自适应优化的VLE能够跨任务泛化，累计标注成本降低达90%，这凸显了将可扩展嵌入模型与精准专家监督相结合应用于偏好强化学习的实用性。

摘要 (Abstract)

Preference-based reinforcement learning can learn effective reward functions from comparisons, but its scalability is constrained by the high cost of oracle feedback. Lightweight vision-language embedding (VLE) models provide a cheaper alternative, but their noisy outputs limit their effectiveness as standalone reward generators. To address this challenge, we propose ROVED, a hybrid framework that combines VLE-based supervision with targeted oracle feedback. Our method uses the VLE to generate segment-level preferences and defers to an oracle only for samples with high uncertainty, identified through a filtering mechanism. In addition, we introduce a parameter-efficient fine-tuning method that adapts the VLE with the obtained oracle feedback in order to improve the model over time in a synergistic fashion. This ensures the retention of the scalability of embeddings and the accuracy of oracles, while avoiding their inefficiencies. Across multiple robotic manipulation tasks, ROVED matches or surpasses prior preference-based methods while reducing oracle queries by up to 80%. Remarkably, the adapted VLE generalizes across tasks, yielding cumulative annotation savings of up to 90%, highlighting the practicality of combining scalable embeddings with precise oracle supervision for preference-based RL.

关键词: Preference-based Reinforcement Learning, Vision-Language Embeddings, Oracle Feedback Reduction, Parameter-efficient Fine-tuning, Robotic Manipulation, ROVED Framework, Uncertainty Filtering, Hybrid Supervision

21. ❌ Article and Comment Frames Shape the Quality of Online Comments

作者: Matteo Guida, Yulia Otmakhova, Eduard Hovy, Lea Frermann 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27889v1

评分: 8.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文主要研究新闻文章框架如何影响在线评论质量，属于计算社会科学领域。摘要最后一句提到开发了一个’proactive frame-aware LLM-based system’来缓解不健康的讨论，这直接涉及大语言模型的应用，因此与’Large Language Models’关键词高度相关（8分）。论文的核心是框架理论和评论质量分析，而不是大模型技术原理的创新或其他AI技术，因此其他所有关键词（如MoE、Scaling Laws、RLHF、RAG等）均与论文内容完全无关，得0分。

!!! tip deepseek-chat TL;DR

该研究探讨了新闻文章的框架是否以及如何影响在线评论的质量，发现文章框架能显著预测评论的健康程度，并且开发了一个基于大语言模型的框架感知系统来缓解不健康的讨论。

摘要翻译

框架理论认为信息的呈现方式塑造受众反应，但计算研究长期忽视受众反馈。尽管近期研究表明文章框架会系统性地影响读者回复的内容，本文进一步追问：框架是否也会影响回复质量？通过分析2700篇新闻文章下的100万条评论，我们将质量操作化为评论健康度（建设性、善意的贡献）。研究发现，在控制主题变量后，文章框架能显著预测评论健康度；且采纳文章框架的评论比偏离框架的评论更为健康。此外，不健康的顶层评论倾向于引发更多不健康回复，这种效应与评论所使用的框架无关。我们的研究建立了框架理论与话语质量之间的关联，为下游应用奠定基础。我们通过一个基于大语言模型的主动式框架感知系统展示了其应用潜力，该系统能有效缓解不健康的话语互动。

摘要 (Abstract)

Framing theory posits that how information is presented shapes audience responses, but computational work has largely ignored audience reactions. While recent work showed that article framing systematically shapes the content of reader responses, this paper asks: Does framing also affect response quality? Analyzing 1M comments across 2.7K news articles, we operationalize quality as comment health (constructive, good-faith contributions). We find that article frames significantly predict comment health while controlling for topic, and that comments that adopt the article frame are healthier than those that depart from it. Further, unhealthy top-level comments tend to generate more unhealthy responses, independent of the frame being used in the comment. Our results establish a link between framing theory and discourse quality, laying the groundwork for downstream applications. We illustrate this potential with a proactive frame-aware LLM- based system to mitigate unhealthy discourse

关键词: framing theory, online comments, comment health, discourse quality, LLM-based system, news articles, computational social science

22. ❌ Generalizable Detection of AI Generated Images with Large Models and Fuzzy Decision Tree

作者: Fei Wu, Guanghao Ding, Zijian Niu, Zhenrui Wang, Lei Yang, Zhuosheng Zhang, Shilin Wang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28508v1

评分: 8.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出了一种检测AI生成图像的新框架，核心是集成多模态大语言模型（MLLMs）与轻量级检测器。因此，它与"Large Language Models"（MLLMs属于其范畴）高度相关，评分为8分。论文未涉及其他关键词的具体技术（如MoE、SLMs、Scaling Laws、训练方法、推理技术、代理系统、模型压缩等），也未涉及科学领域的特定应用（如生物信息学），这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种通过模糊决策树集成多模态大语言模型和轻量级检测器的新框架，以解决AI生成图像检测中泛化能力不足的问题，并在实验中实现了最先进的准确性和强泛化性能。

摘要翻译

人工智能生成图像的恶意使用与广泛传播对数字内容的真实性构成严重威胁。现有检测方法通常利用生成流程中常见操作步骤遗留的低层级伪影，但由于对特定模型的过度拟合，往往缺乏泛化能力。近期，研究者开始借助多模态大语言模型进行AIGC检测，利用其高层级语义推理与广泛泛化能力。尽管前景可观，但MLLMs对细微生成伪影缺乏细粒度感知敏感性，使其难以作为独立检测器。为解决此问题，我们提出一种新颖的AI生成图像检测框架，通过模糊决策树将轻量级伪影感知检测器与MLLMs协同整合。该决策树将基础检测器的输出视为模糊隶属度值，从而实现对语义与感知视角互补线索的自适应融合。大量实验表明，所提方法在不同生成模型上实现了最先进的检测精度与强大的泛化能力。

摘要 (Abstract)

The malicious use and widespread dissemination of AI-generated images pose a serious threat to the authenticity of digital content. Existing detection methods exploit low-level artifacts left by common manipulation steps within the generation pipeline, but they often lack generalization due to model-specific overfitting. Recently, researchers have resorted to Multimodal Large Language Models (MLLMs) for AIGC detection, leveraging their high-level semantic reasoning and broad generalization capabilities. While promising, MLLMs lack the fine-grained perceptual sensitivity to subtle generation artifacts, making them inadequate as standalone detectors. To address this issue, we propose a novel AI-generated image detection framework that synergistically integrates lightweight artifact-aware detectors with MLLMs via a fuzzy decision tree. The decision tree treats the outputs of basic detectors as fuzzy membership values, enabling adaptive fusion of complementary cues from semantic and perceptual perspectives. Extensive experiments demonstrate that the proposed method achieves state-of-the-art accuracy and strong generalization across diverse generative models.

关键词: AI-generated image detection, Multimodal Large Language Models, MLLMs, fuzzy decision tree, generalization, artifact-aware detectors, semantic reasoning, perceptual sensitivity

23. ❌ Attention Frequency Modulation: Training-Free Spectral Modulation of Diffusion Cross-Attention

作者: Seunghun Oh, Unsang Park 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28114v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究扩散模型（特别是Stable Diffusion）中的交叉注意力机制，提出了一种无需训练的频率调制方法（AFM）来控制文本到图像的生成过程。所有关键词均针对大语言模型（LLMs）及其相关技术（如训练、对齐、推理、应用等），而本文专注于扩散模型（一种生成模型，非LLM）的注意力机制分析。因此，绝大多数关键词（如LLMs、MoE、SFT、RLHF、RAG、Agents等）与论文内容完全无关，评分为0。唯一略有相关的是"Mechanistic Interpretability" OR “Explainable AI”，因为论文分析了交叉注意力的动态特性（频谱进展、token竞争），属于对模型内部机制的可解释性研究，但并非核心焦点（核心是控制方法而非解释），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文研究了扩散模型中交叉注意力的多分辨率动态特性，提出了一种无需训练的注意力频率调制（AFM）方法，通过编辑傅里叶域中的注意力logits来偏置token竞争的空间尺度，从而实现对图像生成的可靠视觉编辑，同时保持语义对齐。

摘要翻译

交叉注意力是文本条件化潜在扩散模型的核心交互机制，但其在去噪过程中多分辨率动态的阶段性特征尚未得到充分解析，这限制了无需训练的原理性控制方法的开发。本研究将扩散交叉注意力建模为潜在网格上的时空信号：通过将词元软最大值权重聚合为与具体词元无关的浓度图，并追踪其在去噪过程中径向分区的傅里叶功率谱。在不同提示词与随机种子下，编码器交叉注意力均呈现出一致的从粗到细的频谱演进规律，形成了词元竞争稳定的时频指纹。基于此结构，我们提出了注意力频率调制（Attention Frequency Modulation, AFM），一种即插即用的推理时干预方法。该方法在傅里叶域中编辑词元级软最大值前的交叉注意力对数：在词元软最大值操作之前，依据去噪进度按计划对低频与高频波段进行重新加权，并可通过词元分配熵进行自适应门控。AFM 提供了一个连续的控制手段，能够在无需重新训练、修改提示词或更新参数的情况下，偏置词元竞争模式的空间尺度。在 Stable Diffusion 上的实验表明，AFM 能可靠地重新分配注意力频谱，产生显著的视觉编辑效果，同时基本保持语义对齐。最后，我们发现熵主要作为对同一基于频率的编辑操作的自适应增益，而非一个独立的控制维度。

摘要 (Abstract)

Cross-attention is the primary interface through which text conditions latent diffusion models, yet its step-wise multi-resolution dynamics remain under-characterized, limiting principled training-free control. We cast diffusion cross-attention as a spatiotemporal signal on the latent grid by summarizing token-softmax weights into token-agnostic concentration maps and tracking their radially binned Fourier power over denoising. Across prompts and seeds, encoder cross-attention exhibits a consistent coarse-to-fine spectral progression, yielding a stable time-frequency fingerprint of token competition. Building on this structure, we introduce Attention Frequency Modulation (AFM), a plug-and-play inference-time intervention that edits token-wise pre-softmax cross-attention logits in the Fourier domain: low- and high-frequency bands are reweighted with a progress-aligned schedule and can be adaptively gated by token-allocation entropy, before the token softmax. AFM provides a continuous handle to bias the spatial scale of token-competition patterns without retraining, prompt editing, or parameter updates. Experiments on Stable Diffusion show that AFM reliably redistributes attention spectra and produces substantial visual edits while largely preserving semantic alignment. Finally, we find that entropy mainly acts as an adaptive gain on the same frequency-based edit rather than an independent control axis.

关键词: Diffusion Models, Cross-Attention, Attention Frequency Modulation, Training-Free Control, Spectral Modulation, Fourier Domain, Stable Diffusion, Token Competition

24. ❌ Contour-Guided Query-Based Feature Fusion for Boundary-Aware and Generalizable Cardiac Ultrasound Segmentation

作者: Zahid Ullah, Sieun Choi, Jihie Kim 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28110v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文专注于医学图像分割，提出了一种用于心脏超声分割的深度学习网络（CGQR-Net），其核心是整合轮廓引导的结构先验与多分辨率特征表示。所有关键词均与大模型（LLM）技术、训练方法、推理优化、智能体系统等直接相关，而本文研究的是传统的计算机视觉和医学图像分析任务，未涉及任何大模型技术、原理或应用。唯一略有关联的关键词是“AI for Science” OR “Bioinformatics” OR “Cheminformatics”，因为该研究属于AI在生物医学（心脏超声）领域的应用，属于“AI for Science”的广义范畴，但并非其核心或创新点所在，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文针对心脏超声图像分割中因低对比度、噪声和域偏移导致的边界不精确问题，提出了一种轮廓引导查询细化网络（CGQR-Net），通过整合轮廓结构先验与多尺度特征，在CAMUS和CardiacNet数据集上实现了更准确的分割和更好的边界描绘。

摘要翻译

精确的心脏超声分割对于智能医疗系统中可靠评估心室功能至关重要。然而，由于对比度低、斑点噪声、边界不规则以及设备和患者群体间的域偏移，超声心动图图像的分割极具挑战性。现有方法主要基于外观驱动学习，在这些条件下往往难以保持边界精度和结构一致性。为解决这些问题，我们提出了一种用于边界感知心脏超声分割的轮廓引导查询优化网络（Contour-Guided Query Refinement Network, CGQR-Net）。该框架将多分辨率特征表示与轮廓衍生的结构先验信息相结合。HRNet主干网络在捕获多尺度上下文的同时保留了高分辨率空间细节。网络首先生成粗分割结果，从中提取解剖轮廓并编码为可学习的查询嵌入。这些轮廓引导的查询通过交叉注意力机制与融合特征图进行交互，实现结构感知的优化，从而改善边界描绘并减少噪声伪影。采用双头监督策略联合优化分割和边界预测，以增强结构一致性。所提方法在CAMUS数据集上进行了评估，并在CardiacNet数据集上进一步验证以评估跨数据集泛化能力。实验结果表明，该方法在不同成像条件下均实现了分割准确性的提升、边界精度的增强以及稳健的性能。这些结果凸显了将轮廓级结构信息与特征级表示相结合对于实现可靠心脏超声分割的有效性。

摘要 (Abstract)

Accurate cardiac ultrasound segmentation is essential for reliable assessment of ventricular function in intelligent healthcare systems. However, echocardiographic images are challenging due to low contrast, speckle noise, irregular boundaries, and domain shifts across devices and patient populations. Existing methods, largely based on appearance-driven learning, often fail to preserve boundary precision and structural consistency under these conditions. To address these issues, we propose a Contour-Guided Query Refinement Network (CGQR-Net) for boundary-aware cardiac ultrasound segmentation. The framework integrates multi-resolution feature representations with contour-derived structural priors. An HRNet backbone preserves high-resolution spatial details while capturing multi-scale context. A coarse segmentation is first generated, from which anatomical contours are extracted and encoded into learnable query embeddings. These contour-guided queries interact with fused feature maps via cross-attention, enabling structure-aware refinement that improves boundary delineation and reduces noise artifacts. A dual-head supervision strategy jointly optimizes segmentation and boundary prediction to enforce structural consistency. The proposed method is evaluated on the CAMUS dataset and further validated on the CardiacNet dataset to assess cross-dataset generalization. Experimental results demonstrate improved segmentation accuracy, enhanced boundary precision, and robust performance across varying imaging conditions. These results highlight the effectiveness of integrating contour-level structural information with feature-level representations for reliable cardiac ultrasound segmentation.

关键词: cardiac ultrasound segmentation, boundary-aware, contour-guided, query refinement, multi-resolution features, cross-dataset generalization, HRNet, structural consistency

25. ❌ Constructing Composite Features for Interpretable Music-Tagging

作者: Chenhao Xue, Weitao Hu, Joyraj Chakraborty, Zhijin Guo, Kang Li, Tianyu Shi, Martin Reed, Nikolaos Thomos 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28644v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文研究音乐标签分类中的特征融合问题，提出了一种基于遗传编程（GP）的自动特征组合方法以提高性能并保持可解释性。论文的核心是传统机器学习/特征工程方法（GP）在音频领域的应用，而非大模型或深度学习技术。所有关键词均围绕大模型、深度学习及其相关技术（如训练方法、推理优化、对齐、代理等），与论文内容无直接关联。唯一略有相关的是"Mechanistic Interpretability" OR “Explainable AI”，因为论文强调其方法的可解释性优势（与黑盒深度模型对比），但论文并非专门研究可解释AI技术本身，而是将其作为方法的一个属性，因此给予5分（有一定关联）。其他关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对音乐标签分类中深度特征融合方法缺乏可解释性的问题，提出了一种基于遗传编程的自动特征组合方法，在保持可解释性的同时提升了分类性能。

摘要翻译

融合多种音频特征能够提升音乐标注任务的性能，但常见的基于深度学习的特征融合方法往往缺乏可解释性。为解决这一问题，我们提出一种遗传编程（Genetic Programming，GP）流程，通过数学方式组合基础音乐特征来自动演化复合特征，从而在保持可解释性的同时捕捉特征间的协同交互作用。该方法在不牺牲可解释性的前提下，实现了与深度特征融合相似的表示优势。在MTG-Jamendo和GTZAN数据集上的实验表明，相较于不同抽象层次基础特征集上的先进系统，本方法均取得了稳定提升。值得注意的是，大部分性能增益出现在前几百次GP评估中，这表明在适中的搜索预算下即可识别出有效的特征组合。演化得到的高阶表达式包含线性、非线性及条件形式，其中多种低复杂度解决方案达到了顶尖性能，这符合倾向于更简洁表达式的简约性压力。对这些复合特征的分析进一步揭示了哪些交互与变换通常对标注任务有益，从而提供了在黑盒深度模型中难以获得的洞见。

摘要 (Abstract)

Combining multiple audio features can improve the performance of music tagging, but common deep learning-based feature fusion methods often lack interpretability. To address this problem, we propose a Genetic Programming (GP) pipeline that automatically evolves composite features by mathematically combining base music features, thereby capturing synergistic interactions while preserving interpretability. This approach provides representational benefits similar to deep feature fusion without sacrificing interpretability. Experiments on the MTG-Jamendo and GTZAN datasets demonstrate consistent improvements compared to state-of-the-art systems across base feature sets at different abstraction levels. It should be noted that most of the performance gains are noticed within the first few hundred GP evaluations, indicating that effective feature combinations can be identified under modest search budgets. The top evolved expressions include linear, nonlinear, and conditional forms, with various low-complexity solutions at top performance aligned with parsimony pressure to prefer simpler expressions. Analyzing these composite features further reveals which interactions and transformations tend to be beneficial for tagging, offering insights that remain opaque in black-box deep models.

关键词: Music Tagging, Feature Fusion, Genetic Programming, Interpretability, Composite Features, Audio Features, MTG-Jamendo, GTZAN

26. ❌ Neural Federated Learning for Livestock Growth Prediction

作者: Shoujin Wang, Mingze Ni, Wei Liu, Victor W. Chu, Kenny Sabir, Bryan Zheng, Ayush Kanwal, Roy Jing Yang, Fang Chen 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28117v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文《Neural Federated Learning for Livestock Growth Prediction》专注于畜牧业生长预测，提出了一种基于门控循环单元和多层感知机的联邦学习框架（LivestockFL）及其个性化版本（LivestockPFL）。论文的核心是联邦学习在农业领域的应用，旨在解决数据隐私和稀疏性问题。所有关键词均与大模型（LLM）技术、训练方法、推理优化、代理系统等直接相关，而本文未涉及任何大模型或深度学习技术原理的创新，也未使用大模型在不同领域的研究应用。唯一略有相关的是关键词“AI for Science” OR “Bioinformatics” OR “Cheminformatics”，因为畜牧业属于农业科学领域，AI在此的应用可视为科学应用，但论文未明确提及生物信息学或化学信息学，且创新点在于联邦学习框架而非AI技术本身，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文针对畜牧业生长预测中数据隐私和稀疏性的挑战，提出了首个联邦学习框架LivestockFL及其个性化版本LivestockPFL，通过分布式农场协作训练模型，有效提升了预测的准确性和实用性。

摘要翻译

牲畜生长预测对于优化农场管理、提升畜牧生产效率与可持续性至关重要，但由于大规模数据集的缺乏及农场层面数据涉及的隐私问题，该领域研究仍显不足。现有生物物理模型依赖固定公式，而大多数机器学习方法基于孤立的小规模数据集进行训练，限制了其鲁棒性与泛化能力。为应对这些挑战，我们提出了LivestockFL——首个专为牲畜生长预测设计的联邦学习框架。该框架支持分布式农场间的协同模型训练，无需共享原始数据，从而在保护数据隐私的同时缓解数据稀疏性问题，尤其适用于历史记录有限的农场。该框架采用基于门控循环单元与多层感知器结合的神经架构，从历史体重记录和辅助特征中建模时序生长模式。我们进一步提出LivestockPFL，这是一种新颖的个性化联邦学习框架，其在上述联邦学习框架基础上扩展了基于各农场本地数据训练的个性化预测头，从而生成针对特定农场的预测器。在真实数据集上的实验验证了所提方法的有效性和实用性。

摘要 (Abstract)

Livestock growth prediction is essential for optimising farm management and improving the efficiency and sustainability of livestock production, yet it remains underexplored due to limited large-scale datasets and privacy concerns surrounding farm-level data. Existing biophysical models rely on fixed formulations, while most machine learning approaches are trained on small, isolated datasets, limiting their robustness and generalisability. To address these challenges, we propose LivestockFL, the first federated learning framework specifically designed for livestock growth prediction. LivestockFL enables collaborative model training across distributed farms without sharing raw data, thereby preserving data privacy while alleviating data sparsity, particularly for farms with limited historical records. The framework employs a neural architecture based on a Gated Recurrent Unit combined with a multilayer perceptron to model temporal growth patterns from historical weight records and auxiliary features. We further introduce LivestockPFL, a novel personalised federated learning framework that extends the above federated learning framework with a personalized prediction head trained on each farm’s local data, producing farm-specific predictors. Experiments on a real-world dataset demonstrate the effectiveness and practicality of the proposed approaches.

关键词: Livestock growth prediction, Federated learning, Personalized federated learning, Data privacy, Gated Recurrent Unit, Multilayer perceptron, Farm management, Real-world dataset

27. ❌ From Independent to Correlated Diffusion: Generalized Generative Modeling with Probabilistic Computers

作者: Nihal Sanjay Singh, Mazdak Mohseni-Rajaee, Shaila Niazi, Kerem Y. Camsari 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.27996v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文专注于扩散模型的算法创新和硬件实现，提出了一种将相关噪声注入扩散过程的新方法，并利用概率计算机进行高效采样。所有关键词均与论文内容无关，除了"AI for Science" OR “Bioinformatics” OR “Cheminformatics”，因为论文涉及深度学习在科学计算（如Ising模型）中的应用，属于AI for Science的范畴，但并非核心内容，因此给予5分。

!!! tip deepseek-chat TL;DR

该论文提出了一种广义扩散模型框架，通过将独立噪声注入替换为包含已知交互结构的MCMC动力学，并利用概率计算机实现高效采样，从而在生成建模中更好地利用空间相关性，在2D铁磁Ising模型和3D Edwards-Anderson自旋玻璃的平衡态上验证了其优于标准独立扩散的性能。

摘要翻译

扩散模型已成为深度学习中生成任务的重要框架。它将生成建模分解为两个计算基元：确定性神经网络评估与随机采样。现有实现通常将主要计算置于神经网络中，但扩散作为一个框架允许为随机转移核函数提供更广泛的选择。本文通过将独立噪声注入替换为包含已知相互作用结构的马尔可夫链蒙特卡洛（MCMC）动力学，对随机采样组件进行了推广。当耦合设为零时，标准的独立扩散可作为特例被还原。通过将伊辛耦合明确纳入扩散动力学，加噪与去噪过程能够利用目标系统代表性的空间相关性。该框架可自然地映射到由概率比特构建的概率计算机上，后者在采样吞吐量和能效上相比图形处理器具有数量级优势。我们在二维铁磁伊辛模型与三维爱德华兹-安德森自旋玻璃的平衡态上验证了该方法，结果表明相关扩散生成的样本比独立扩散更接近MCMC参考分布。更广泛而言，该框架表明概率计算机能够实现新型扩散算法，这类算法可利用结构化概率采样进行生成建模。

摘要 (Abstract)

Diffusion models have emerged as a powerful framework for generative tasks in deep learning. They decompose generative modeling into two computational primitives: deterministic neural-network evaluation and stochastic sampling. Current implementations usually place most computation in the neural network, but diffusion as a framework allows a broader range of choices for the stochastic transition kernel. Here, we generalize the stochastic sampling component by replacing independent noise injection with Markov chain Monte Carlo (MCMC) dynamics that incorporate known interaction structure. Standard independent diffusion is recovered as a special case when couplings are set to zero. By explicitly incorporating Ising couplings into the diffusion dynamics, the noising and denoising processes exploit spatial correlations representative of the target system. The resulting framework maps naturally onto probabilistic computers (p-computers) built from probabilistic bits (p-bits), which provide orders-of-magnitude advantages in sampling throughput and energy efficiency over GPUs. We demonstrate the approach on equilibrium states of the 2D ferromagnetic Ising model and the 3D Edwards-Anderson spin glass, showing that correlated diffusion produces samples in closer agreement with MCMC reference distributions than independent diffusion. More broadly, the framework shows that p-computers can enable new classes of diffusion algorithms that exploit structured probabilistic sampling for generative modeling.

关键词: Diffusion Models, Generative Modeling, Markov Chain Monte Carlo, Probabilistic Computers, Ising Model, Spin Glass, Correlated Diffusion, Sampling Efficiency

28. ❌ Perspective of Fermi’s golden rule and its generalizations in chemical physics

作者: Seogjoo J. Jang, Goun Kim, Young Min Rhee 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28373v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 该论文是关于化学物理中费米黄金定则（FGR）的综述性文章，主要讨论其历史、推导、应用和推广。论文内容属于传统化学物理领域，不涉及任何大模型、深度学习或人工智能技术。唯一可能相关的关键词是"AI for Science" OR “Bioinformatics” OR “Cheminformatics”，因为论文属于化学物理领域，与计算化学有一定关联，但论文本身并未使用或讨论AI方法，因此给予5分（有一定关联）。其他所有关键词均与大模型技术、训练方法、推理优化、AI应用等完全无关，均给予0分。

!!! tip deepseek-chat TL;DR

该论文综述了费米黄金定则在化学物理中的历史、推导、应用、存在的问题以及最新的推广和计算方法。

摘要翻译

本文回顾了费米黄金定则（Fermi’s golden rule, FGR）的简史，概述其推导过程、基本假设及典型表达形式。文章综述了FGR在化学物理等领域的主要应用，展示了该规则的广泛适用性与成功实践。同时，文中澄清了FGR在实际应用中存在的模糊性与开放性问题，并探讨了近年来FGR的推广形式及其在实际计算中的应用方法进展。

摘要 (Abstract)

This perspective provides a succinct history of Fermi’s golden rule (FGR), overview of its derivation, assumptions, and representative forms. Major applications of FGR, mostly in the field of chemical physics, are reviewed. These illustrate the broad applicability and success of FGR. Ambiguities and open issues encountered in practical applications of FGR are clarified. Recent advances in generalizations of FGR and computational methods for practical applications are addressed.

关键词: Fermi’s golden rule, chemical physics, rate theory, quantum dynamics, computational methods, generalizations, applications, perspective

29. ❌ Geometry-aware similarity metrics for neural representations on Riemannian and statistical manifolds

作者: N Alex Cayco Gajic, Arthur Pellegrino 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28764v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种基于黎曼几何的度量相似性分析方法（MSA），用于比较神经表示的内在几何结构，属于神经网络可解释性研究。该研究主要关注神经网络表示几何的理论分析框架，而非大模型技术或具体应用。所有关键词中，只有’Mechanistic Interpretability OR Explainable AI’（机制可解释性或可解释AI）与论文主题有一定关联，因为论文旨在通过几何分析来理解神经计算的机制，属于可解释AI范畴。其他关键词均涉及大模型的具体技术、训练方法、应用领域或特定问题，与这篇理论性几何分析论文无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于黎曼几何的度量相似性分析方法，用于比较神经表示的内在几何结构，从而更好地理解不同神经网络解决方案的机制。

摘要翻译

相似性度量被广泛用于解释神经网络解决任务时所使用的表征几何结构。然而，由于现有方法比较的是表征在状态空间中的外在几何，而非其内在几何，它们可能无法捕捉到根本不同的神经网络解决方案之间微妙但关键的差异。本文引入度量相似性分析（MSA，metric similarity analysis），这是一种新颖的方法，它利用黎曼几何的工具，在流形假设下比较神经表征的内在几何。我们证明，MSA可用于：i）在不同学习机制下的深度网络中分离神经计算的特征；ii）比较非线性动力学；以及iii）研究扩散模型。因此，我们提出了一个数学基础坚实且广泛适用的框架，通过比较神经计算的内在几何来理解其背后的机制。

摘要 (Abstract)

Similarity measures are widely used to interpret the representational geometries used by neural networks to solve tasks. Yet, because existing methods compare the extrinsic geometry of representations in state space, rather than their intrinsic geometry, they may fail to capture subtle yet crucial distinctions between fundamentally different neural network solutions. Here, we introduce metric similarity analysis (MSA), a novel method which leverages tools from Riemannian geometry to compare the intrinsic geometry of neural representations under the manifold hypothesis. We show that MSA can be used to i) disentangle features of neural computations in deep networks with different learning regimes, ii) compare nonlinear dynamics, and iii) investigate diffusion models. Hence, we introduce a mathematically grounded and broadly applicable framework to understand the mechanisms behind neural computations by comparing their intrinsic geometries.

关键词: similarity measures, neural representations, Riemannian geometry, intrinsic geometry, manifold hypothesis, metric similarity analysis, neural computations, diffusion models

30. ❌ ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining

作者: Anuj Diwan, Eunsol Choi, David Harwath 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28737v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ParaSpeechCLAP专注于语音-文本双编码器对比模型，用于语音风格描述，与大多数关键词无关。仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（8分），因为论文涉及预训练（language-audio pretraining）和领域适应（处理多种风格描述）。其他关键词涉及LLM技术、推理、对齐、压缩、科学AI等，均未在论文中体现。

!!! tip deepseek-chat TL;DR

论文提出了ParaSpeechCLAP双编码器对比模型，将语音和文本风格描述映射到共同嵌入空间，在风格检索、分类和TTS奖励建模中优于基线。

摘要翻译

我们提出ParaSpeechCLAP，这是一种双编码器对比模型，能够将语音和文本风格描述映射到一个共同的嵌入空间中，支持广泛的内在（说话人层面）和情境（话语层面）描述符（例如音高、音质和情感），其涵盖范围远超现有模型所处理的狭窄集合。我们训练了专门的ParaSpeechCLAP-Intrinsic和ParaSpeechCLAP-Situational模型，以及一个统一的ParaSpeechCLAP-Combined模型，发现专业化模型在单一风格维度上表现更强，而统一模型在组合评估中表现更优。我们进一步表明，ParaSpeechCLAP-Intrinsic模型通过额外的分类损失和类别平衡训练获得了性能提升。我们在风格描述检索、语音属性分类以及作为无需额外训练的推理时奖励模型（用于改进风格提示的文本到语音合成）三个应用中验证了模型的性能。在大多数评估指标上，ParaSpeechCLAP在所有三个应用中均优于基线模型。我们的模型和代码已发布于https://github.com/ajd12342/paraspeechclap。

摘要 (Abstract)

We introduce ParaSpeechCLAP, a dual-encoder contrastive model that maps speech and text style captions into a common embedding space, supporting a wide range of intrinsic (speaker-level) and situational (utterance-level) descriptors (such as pitch, texture and emotion) far beyond the narrow set handled by existing models. We train specialized ParaSpeechCLAP-Intrinsic and ParaSpeechCLAP-Situational models alongside a unified ParaSpeechCLAP-Combined model, finding that specialization yields stronger performance on individual style dimensions while the unified model excels on compositional evaluation. We further show that ParaSpeechCLAP-Intrinsic benefits from an additional classification loss and class-balanced training. We demonstrate our models’ performance on style caption retrieval, speech attribute classification and as an inference-time reward model that improves style-prompted TTS without additional training. ParaSpeechCLAP outperforms baselines on most metrics across all three applications. Our models and code are released at https://github.com/ajd12342/paraspeechclap .

关键词: speech-text model, dual-encoder, contrastive learning, style captioning, language-audio pretraining, text-to-speech, embedding space, speech attribute classification

31. ❌ On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

作者: Omer Dahary, Benaya Koren, Daniel Garibi, Daniel Cohen-Or 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28762v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究扩散变换器（Diffusion Transformers）中的多样性生成问题，提出在上下文空间应用排斥力来增加生成多样性。所有评分关键词均针对大语言模型（LLMs）及相关技术，而本文专注于文本到图像（T2I）扩散模型，特别是扩散变换器架构，未涉及LLMs、MoE、对齐、推理、代理、量化等LLM相关技术。关键词’AI for Science’等虽涉及科学应用，但本文属计算机视觉/生成模型领域，非生物信息学等特定科学领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文针对文本到图像扩散模型缺乏多样性的问题，提出在扩散变换器的上下文空间中应用动态排斥力，在不牺牲视觉保真度或语义一致性的情况下显著增加生成多样性。

摘要翻译

现代文本到图像（Text-to-Image，T2I）扩散模型已实现显著的语义对齐能力，但其生成结果往往缺乏多样性，对于任意给定提示词，模型倾向于收敛到一组狭窄的视觉解决方案上。这种典型性偏差对需要广泛生成结果的创意应用构成了挑战。我们发现当前追求多样性的方法存在一个根本性的权衡：修改模型输入需要昂贵的优化过程以融入生成路径的反馈，而对空间上已固化的中间隐变量进行操作则容易破坏正在形成的视觉结构，导致伪影产生。本文提出在上下文空间（Contextual Space）中施加排斥力作为一种新颖框架，以在扩散变换器中实现丰富的多样性。通过在多模态注意力通道中进行干预，我们在变换器的前向传播过程中实施实时排斥操作，将干预注入到文本条件与涌现的图像结构相互增强的模块之间。这使得我们能够在生成路径已获得结构信息但构图尚未固定之前，对引导轨迹进行重定向。实验结果表明，上下文空间中的排斥操作能产生显著更丰富的多样性，同时不牺牲视觉保真度或语义一致性。此外，我们的方法具有独特的效率优势，仅引入极小的计算开销，且在现代“Turbo”模型和蒸馏模型中依然有效，而传统的基于轨迹的干预方法在这些模型中通常失效。

摘要 (Abstract)

Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of variety, converging on a narrow set of visual solutions for any given prompt. This typicality bias presents a challenge for creative applications that require a wide range of generative outcomes. We identify a fundamental trade-off in current approaches to diversity: modifying model inputs requires costly optimization to incorporate feedback from the generative path. In contrast, acting on spatially-committed intermediate latents tends to disrupt the forming visual structure, leading to artifacts. In this work, we propose to apply repulsion in the Contextual Space as a novel framework for achieving rich diversity in Diffusion Transformers. By intervening in the multimodal attention channels, we apply on-the-fly repulsion during the transformer’s forward pass, injecting the intervention between blocks where text conditioning is enriched with emergent image structure. This allows for redirecting the guidance trajectory after it is structurally informed but before the composition is fixed. Our results demonstrate that repulsion in the Contextual Space produces significantly richer diversity without sacrificing visual fidelity or semantic adherence. Furthermore, our method is uniquely efficient, imposing a small computational overhead while remaining effective even in modern “Turbo” and distilled models where traditional trajectory-based interventions typically fail.

关键词: Text-to-Image Diffusion Models, Diversity Generation, Diffusion Transformers, Contextual Space Repulsion, Multimodal Attention, Generative Diversity, Visual Fidelity, Semantic Adherence

32. ❌ RAD-AI: Rethinking Architecture Documentation for AI-Augmented Ecosystems

作者: Oliver Aleksander Larsen, Mahyar T. Moghaddam 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28735v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究AI增强生态系统的架构文档框架（RAD-AI），关注软件工程、文档标准、监管合规（EU AI Act），而非大模型/深度学习技术原理或具体应用创新。所有关键词均涉及大模型技术、训练方法、推理优化、应用领域等具体技术内容，与论文的软件架构文档主题完全无关。

!!! tip deepseek-chat TL;DR

论文针对现有软件架构文档框架无法满足AI系统需求的问题，提出了RAD-AI扩展框架，显著提升了EU AI Act合规文档的覆盖度（从36%到93%）。

摘要翻译

人工智能增强生态系统（即多个AI组件通过共享数据和基础设施相互连接的集成系统）正逐渐成为智慧城市、自动驾驶车队和智能平台的架构常态。然而，从业者所依赖的架构文档框架——arc42与C4模型——原为确定性软件设计，无法捕捉概率性行为、数据依赖性演化或机器学习/软件双重生命周期。这一空白带来监管影响：欧盟《人工智能法案》（条例2024/1689）通过附件四强制要求技术文档，而现有框架均未提供结构化支持，高风险系统的合规要求将于2026年8月2日起强制执行。本文提出RAD-AI框架，作为向后兼容的扩展方案：它在arc42中增加八个AI专属章节，在C4模型中扩展三类图表，并辅以系统化的欧盟《人工智能法案》附件四合规映射。六位经验丰富的软件架构从业者参与的监管覆盖度评估显示，RAD-AI将附件四条款的可应对性从约36%提升至93%（平均评分），较现有框架实现显著改进。对两个生产级AI平台（Uber Michelangelo、Netflix Metaflow）的对比分析揭示了标准框架遗漏的八项AI专属问题，证明文档缺陷是结构性的而非领域特定的。一项智慧出行生态系统的案例研究进一步揭示了生态系统层面的问题（包括级联漂移和差异化合规义务），这些在标准标注体系下均不可见。

摘要 (Abstract)

AI-augmented ecosystems (interconnected systems where multiple AI components interact through shared data and infrastructure) are becoming the architectural norm for smart cities, autonomous fleets, and intelligent platforms. Yet the architecture documentation frameworks practitioners rely on, arc42 and the C4 model, were designed for deterministic software and cannot capture probabilistic behavior, data-dependent evolution, or dual ML/software lifecycles. This gap carries regulatory consequence: the EU AI Act (Regulation 2024/1689) mandates technical documentation through Annex IV that no existing framework provides structured support for, with enforcement for high-risk systems beginning August 2, 2026. We present RAD-AI, a backward-compatible extension framework that augments arc42 with eight AI-specific sections and C4 with three diagram extensions, complemented by a systematic EU AI Act Annex IV compliance mapping. A regulatory coverage assessment with six experienced software-architecture practitioners provides preliminary evidence that RAD-AI increases Annex IV addressability from approximately 36% to 93% (mean rating) and demonstrates substantial improvement over existing frameworks. Comparative analysis on two production AI platforms (Uber Michelangelo, Netflix Metaflow) captures eight additional AI-specific concerns missed by standard frameworks and demonstrates that documentation deficiencies are structural rather than domain-specific. An illustrative smart mobility ecosystem case study reveals ecosystem-level concerns, including cascading drift and differentiated compliance obligations, that are invisible under standard notation.

关键词: AI-augmented ecosystems, architecture documentation, arc42, C4 model, EU AI Act, regulatory compliance, technical documentation, smart mobility ecosystem

33. ❌ Stepwise Credit Assignment for GRPO on Flow-Matching Models

作者: Yash Savani, Branislav Kveton, Yuchen Liu, Yilin Wang, Jing Shi, Subhojyoti Mukherjee, Nikos Vlassis, Krishna Kumar Singh 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28718v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是扩散模型（flow-matching models）中的强化学习信用分配问题，具体针对Flow-GRPO方法提出改进。虽然涉及深度学习技术（扩散模型和强化学习），但所有关键词都明确指向大语言模型（LLMs）及其相关技术（如MoE、SFT、RLHF、RAG等），而本文完全不涉及语言模型或文本生成。论文专注于图像生成的扩散模型，与语言模型无关，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文针对Flow-GRPO在扩散模型中采用均匀信用分配忽略生成过程时序结构的问题，提出了Stepwise-Flow-GRPO方法，通过基于每步奖励改进的信用分配和引入DDIM-inspired SDE，实现了更好的样本效率和收敛速度。

摘要翻译

Flow-GRPO成功地将强化学习应用于流模型，但采用了对所有步骤的均匀信用分配。这忽视了扩散生成的时间结构：早期步骤决定构图与内容（低频结构），而后期步骤则解析细节与纹理（高频细节）。此外，仅基于最终图像进行均匀信用分配可能无意中奖励次优的中间步骤，特别是当误差在扩散轨迹后期被修正时。我们提出了Stepwise-Flow-GRPO，该方法根据每一步的奖励改进来分配信用。通过利用Tweedie公式获取中间奖励估计，并引入基于增益的优势函数，我们的方法实现了更优的样本效率和更快的收敛速度。我们还提出了一种受DDIM启发的随机微分方程（SDE），在保持策略梯度随机性的同时提升了奖励质量。

摘要 (Abstract)

Flow-GRPO successfully applies reinforcement learning to flow models, but uses uniform credit assignment across all steps. This ignores the temporal structure of diffusion generation: early steps determine composition and content (low-frequency structure), while late steps resolve details and textures (high-frequency details). Moreover, assigning uniform credit based solely on the final image can inadvertently reward suboptimal intermediate steps, especially when errors are corrected later in the diffusion trajectory. We propose Stepwise-Flow-GRPO, which assigns credit based on each step’s reward improvement. By leveraging Tweedie’s formula to obtain intermediate reward estimates and introducing gain-based advantages, our method achieves superior sample efficiency and faster convergence. We also introduce a DDIM-inspired SDE that improves reward quality while preserving stochasticity for policy gradients.

关键词: Flow-GRPO, Stepwise Credit Assignment, Diffusion Models, Reinforcement Learning, Tweedie’s Formula, DDIM-inspired SDE, Policy Gradients, Sample Efficiency

34. ❌ Dynamic Dual-Granularity Skill Bank for Agentic RL

作者: Songjun Tu, Chengdong Xu, Qichao Zhang, Yaocheng Zhang, Xiangyuan Lan, Linjing Li, Dongbin Zhao 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28716v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出D2Skill方法，专注于智能体强化学习（Agentic RL）中的技能库构建与维护，核心创新在于动态双粒度技能建模（任务技能和步骤技能）以及基于效用的技能维护机制。论文与’LLM Agents/Autonomous Agents/Agentic Workflow’高度相关（10分），因为这是关于智能体工作流的核心研究；与’Large Language Models/LLMs/Foundation Models’有一定关联（8分），因为实验使用了Qwen系列大模型作为基础；与’Self-Correction/Self-Improvement/Self-Reflection’有一定关联（8分），因为技能库通过反思机制进行扩展和维护。其他关键词如MoE、SFT、RAG、CoT等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文针对智能体强化学习中可重用经验利用不足的问题，提出了动态双粒度技能库D2Skill，通过任务技能和步骤技能的双重建模以及基于效用的动态维护机制，在ALFWorld和WebShop任务上显著提升了智能体的成功率。

摘要翻译

智能体强化学习（RL）可从可重用经验中显著获益，然而现有的基于技能的方法主要提取轨迹级指导，且往往缺乏维护动态演化技能记忆的机制化方法。我们提出D2Skill，一种面向智能体强化学习的动态双粒度技能库，它将可重用经验组织为任务技能（提供高层指导）和步骤技能（提供细粒度决策支持与错误纠正）。D2Skill通过在同一策略下并行执行基线推演与技能注入推演，联合训练策略与技能库，并利用其性能差异生成事后效用信号，以同时驱动技能更新与策略优化。该技能库完全基于训练时经验构建，通过反思机制持续扩展，并借助效用感知的检索与剪枝进行维护。在ALFWorld和WebShop环境中使用Qwen2.5-7B-Instruct与Qwen3-4B-Instruct-2507模型的实验表明，D2Skill相较于无技能基线持续将成功率提升10-20个百分点。进一步的消融实验与分析证明，双粒度技能建模与动态技能维护对性能提升均至关重要，同时所学技能展现出更高效用、具备跨评估场景的迁移能力，且仅引入适度的训练开销。

摘要 (Abstract)

Agentic reinforcement learning (RL) can benefit substantially from reusable experience, yet existing skill-based methods mainly extract trajectory-level guidance and often lack principled mechanisms for maintaining an evolving skill memory. We propose D2Skill, a dynamic dual-granularity skill bank for agentic RL that organizes reusable experience into task skills for high-level guidance and step skills for fine-grained decision support and error correction. D2Skill jointly trains the policy and skill bank through paired baseline and skill-injected rollouts under the same policy, using their performance gap to derive hindsight utility signals for both skill updating and policy optimization. Built entirely from training-time experience, the skill bank is continuously expanded through reflection and maintained with utility-aware retrieval and pruning. Experiments on ALFWorld and WebShop with Qwen2.5-7B-Instruct and Qwen3-4B-Instruct-2507 show that D2Skill consistently improves success rates over skill-free baselines by 10-20 points. Further ablations and analyses show that both dual-granularity skill modeling and dynamic skill maintenance are critical to these gains, while the learned skills exhibit higher utility, transfer across evaluation settings, and introduce only modest training overhead.

关键词: Agentic Reinforcement Learning, Skill Bank, Dual-Granularity Skills, Dynamic Skill Maintenance, Hindsight Utility, Qwen LLMs, ALFWorld, WebShop

35. ❌ A Convex Route to Thermomechanics: Learning Internal Energy and Dissipation

作者: Hagen Holthusen, Paul Steinmann, Ellen Kuhl 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28707v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文提出了一种基于物理的神经网络框架，用于发现完全耦合热力学中的本构模型，属于AI for Science（科学人工智能）领域，具体应用于热力学和材料科学。论文的核心是使用输入凸神经网络来保证热力学一致性，而不是大语言模型（LLM）或深度学习技术原理的创新。因此，除了’AI for Science OR Bioinformatics OR Cheminformatics’关键词得5分（因为论文属于AI在科学领域的应用，但并非生物信息学或化学信息学）外，其他所有关键词均与大语言模型、深度学习技术原理、模型训练/对齐/推理优化、智能体系统等无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于输入凸神经网络的物理驱动框架，用于从变形和熵中学习热力学一致的本构模型（内部能量和耗散势），并在合成和实验数据集上验证了其准确捕获软组织和填充橡胶热力学响应的能力。

摘要翻译

本文提出一种基于物理的神经网络框架，用于发现完全耦合热力学问题中的本构模型。与基于亥姆霍兹自由能的经典表述不同，我们采用内能和耗散势作为主要本构函数，并将其表示为变形和熵的函数。这一选择避免了强制实施混合凸-凹条件的需要，并有助于热力学原理的一致性融入。在本研究中，我们重点关注无优先方向或内部变量的材料。
虽然本构框架以熵为表述变量，但温度被处理为独立观测量，熵则通过本构关系在内部推断得出，从而实现了无需熵数据的热力学一致性建模。
网络的热力学容许性通过结构设计得以保证。内能与耗散势由输入凸神经网络表示，确保了凸性并满足第二定律。客观性、材料对称性和归一化条件则通过基于不变量的表示和零锚定公式直接嵌入网络架构中。
我们在合成数据集和实验数据集上验证了所提框架的性能，包括纯热传导问题以及软组织和填充橡胶的完全耦合热力学响应。结果表明，学习得到的模型能够准确捕捉潜在的本构行为。所有代码、数据及训练模型均已通过 https://doi.org/10.5281/zenodo.19248596 公开。

摘要 (Abstract)

We present a physics-based neural network framework for the discovery of constitutive models in fully coupled thermomechanics. In contrast to classical formulations based on the Helmholtz energy, we adopt the internal energy and a dissipation potential as primary constitutive functions, expressed in terms of deformation and entropy. This choice avoids the need to enforce mixed convexity–concavity conditions and facilitates a consistent incorporation of thermodynamic principles. In this contribution, we focus on materials without preferred directions or internal variables. While the formulation is posed in terms of entropy, the temperature is treated as the independent observable, and the entropy is inferred internally through the constitutive relation, enabling thermodynamically consistent modeling without requiring entropy data. Thermodynamic admissibility of the networks is guaranteed by construction. The internal energy and dissipation potential are represented by input convex neural networks, ensuring convexity and compliance with the second law. Objectivity, material symmetry, and normalization are embedded directly into the architecture through invariant-based representations and zero-anchored formulations. We demonstrate the performance of the proposed framework on synthetic and experimental datasets, including purely thermal problems and fully coupled thermomechanical responses of soft tissues and filled rubbers. The results show that the learned models accurately capture the underlying constitutive behavior. All code, data, and trained models are made publicly available via https://doi.org/10.5281/zenodo.19248596.

关键词: physics-based neural network, thermomechanics, constitutive models, internal energy, dissipation potential, input convex neural networks, thermodynamic consistency, soft tissues

36. ❌ Why Aggregate Accuracy is Inadequate for Evaluating Fairness in Law Enforcement Facial Recognition Systems

作者: Khalid Adnan Alsayed 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28675v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于面部识别系统的公平性评估，特别是执法环境中的应用。论文讨论的是计算机视觉领域的算法公平性问题，而非大语言模型或深度学习技术原理的创新。所有评分关键词都直接与大语言模型相关，而本文研究的是面部识别系统，属于不同的AI子领域（计算机视觉），因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究发现，在执法面部识别系统中，仅使用总体准确率作为评估指标不足以评估系统的公平性，因为总体准确率会掩盖不同人口统计群体间的关键性能差异，需要采用更全面的公平性评估框架。

摘要翻译

面部识别系统正日益广泛地应用于执法与安防领域，其算法决策可能带来重大的社会影响。尽管已报道的准确率普遍较高，但越来越多的证据表明，此类系统在不同人口统计学群体间的表现往往不均衡，导致差异化的错误率及潜在危害。本文认为，在关键应用场景中，仅依靠总体准确率不足以评估面部识别系统的公平性与可靠性。通过对子群体层面的错误分布——包括误识率（False Positive Rate, FPR）与拒识率（False Negative Rate, FNR）——进行分析，本文揭示了总体性能指标如何掩盖不同人口群体间的关键差异。实证观察表明，总体准确率相近的系统可能呈现出显著不同的公平性特征，即在单一总体指标下，子群体的错误率仍存在巨大差异。本文进一步探讨了在执法应用中，以准确率为中心的评估方式所带来的操作风险，其中错误分类可能导致不当怀疑或漏检身份。文章强调，应采用关注公平性的评估方法及与模型无关的审计策略，以实现对实际部署系统的后置评估。研究结果强调，必须超越将准确率作为核心指标的做法，并采用更全面的评估框架，以推动负责任的人工智能部署。

摘要 (Abstract)

Facial recognition systems are increasingly deployed in law enforcement and security contexts, where algorithmic decisions can carry significant societal consequences. Despite high reported accuracy, growing evidence demonstrates that such systems often exhibit uneven performance across demographic groups, leading to disproportionate error rates and potential harm. This paper argues that aggregate accuracy is an insufficient metric for evaluating the fairness and reliability of facial recognition systems in high-stakes environments. Through analysis of subgroup-level error distribution, including false positive rate (FPR) and false negative rate (FNR), the paper demonstrates how aggregate performance metrics can obscure critical disparities across demographic groups. Empirical observations show that systems with similar overall accuracy can exhibit substantially different fairness profiles, with subgroup error rates varying significantly despite a single aggregate metric. The paper further examines the operational risks associated with accuracy-centric evaluation practices in law enforcement applications, where misclassification may result in wrongful suspicion or missed identification. It highlights the importance of fairness-aware evaluation approaches and model-agnostic auditing strategies that enable post-deployment assessment of real-world systems. The findings emphasise the need to move beyond accuracy as a primary metric and adopt more comprehensive evaluation frameworks for responsible AI deployment.

关键词: facial recognition systems, fairness evaluation, law enforcement, aggregate accuracy, subgroup error distribution, algorithmic fairness, model-agnostic auditing, responsible AI deployment

37. ❌ AMIGO: Agentic Multi-Image Grounding Oracle Benchmark

作者: Min Wang, Ata Mahjoubfar 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28662v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文AMIGO专注于评估智能体视觉语言模型在长视野、多轮交互任务中的表现，核心是智能体工作流和推理能力。与’LLM Agents’高度相关（10分），因为论文直接研究智能体模型在交互式任务中的行为。与’Large Language Models’相关（8分），因为智能体模型通常基于LLM。与’Chain of Thought’和’System 2 Thinking’相关（各8分），因为任务涉及多步推理、不确定性下的决策和深度思考。其他关键词如MoE、SFT、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了AMIGO基准，用于评估智能体视觉语言模型在长视野、多轮交互的隐藏目标识别任务中的表现，重点关注问题选择、约束跟踪和细粒度判别能力，并报告了识别成功率、协议合规性和噪声容忍度等指标。

摘要翻译

智能体视觉语言模型日益通过多轮交互执行任务，但多数评估仍集中于单图像、单轮次的正确性判断。我们提出AMIGO（智能体多图像定位预言基准），这是一个面向视觉相似图像库中隐藏目标识别的长程基准。在AMIGO中，预言方会秘密选定一张目标图像，模型必须通过遵循严格协议的多轮属性聚焦型是/否/不确定提问序列来还原该目标，该协议会通过跳过指令惩罚无效操作。此设定着重考察：（一）不确定性下的问题选择能力，（二）跨轮次的一致性约束追踪能力，以及（三）证据累积过程中的细粒度判别能力。AMIGO还支持受控的预言方不完美反馈机制，以探究不一致反馈下的模型鲁棒性与验证行为。我们通过“猜中我心仪服饰”任务实例化AMIGO，并报告涵盖结果与交互质量的多元指标，包括识别成功率、证据验证度、交互效率、协议遵从性、噪声容忍度以及轨迹级诊断分析。

摘要 (Abstract)

Agentic vision-language models increasingly act through extended interactions, but most evaluations still focus on single-image, single-turn correctness. We introduce AMIGO (Agentic Multi-Image Grounding Oracle Benchmark), a long-horizon benchmark for hidden-target identification over galleries of visually similar images. In AMIGO, the oracle privately selects a target image, and the model must recover it by asking a sequence of attribute-focused Yes/No/Unsure questions under a strict protocol that penalizes invalid actions with Skip. This setting stresses (i) question selection under uncertainty, (ii) consistent constraint tracking across turns, and (iii) fine-grained discrimination as evidence accumulates. AMIGO also supports controlled oracle imperfections to probe robustness and verification behavior under inconsistent feedback. We instantiate AMIGO with Guess My Preferred Dress task and report metrics covering both outcomes and interaction quality, including identification success, evidence verification, efficiency, protocol compliance, noise tolerance, and trajectory-level diagnostics.

关键词: Agentic vision-language models, Multi-image grounding, Long-horizon benchmark, Hidden-target identification, Question selection under uncertainty, Constraint tracking, Fine-grained discrimination, Oracle imperfections

38. ❌ Information-Theoretic Limits of Safety Verification for Self-Improving Systems

作者: Arsenios Scrivens 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28650v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究自改进系统的安全验证理论极限，与LLMs高度相关（摘要提到GPT-2验证和LLM-scale verification），与LoRA直接相关（提供LoRA参数细节用于验证），与Self-Improvement高度相关（研究自改进系统的安全边界）。其他关键词如MoE、SLMs、Scaling Laws等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文研究了自改进系统在保持有限累积风险的同时实现无限有益自我修改的可能性，建立了安全验证的理论极限，证明了分类器方法的根本限制，并展示了基于Lipschitz验证器（使用LoRA参数化）在LLM规模上实现零风险与正效用的可行性。

摘要翻译

安全门能否在维持有限累积风险的同时，允许无限的有益自我修改？我们通过双重条件——要求∑δ_n < ∞（有限风险）与∑TPR_n = ∞（无限效用）——将这一问题形式化，并建立了一套关于二者（不）兼容性的理论。
分类不可能性（定理1）：对于满足p > 1的幂律风险调度δ_n = O(n^{-p})，在安全/不安全分布重叠的情况下，任何基于分类器的安全门都通过霍尔德不等式满足TPR_n ≤ C_α * δ_n^β，从而必然导致∑TPR_n < ∞。这一不可能性是指数最优的（定理3）。通过NP计数法（定理4）的第二个独立证明，在不使用霍尔德不等式的情况下得到了一个紧致13%的上界。
通用有限时域上限（定理5）：对于任意可求和的风险调度，分类器可达到的精确最大效用为U*(N, B) = N * TPR_NP(B/N)，其增长率为exp(O(√(log N)))——即亚多项式。当N = 10^6且预算B = 1.0时，分类器最多能提取U* ~ 87的效用，而验证器可达约500,000。
验证逃逸（定理2）：一个利普希茨球验证器能以TPR > 0实现δ = 0，从而逃逸上述不可能性。针对LoRA下预层归一化（pre-LayerNorm）变换器的形式化利普希茨边界，使得大语言模型规模的验证成为可能。该分离是严格的。我们在GPT-2（d_LoRA = 147,456）上进行了验证：条件风险δ = 0时，TPR = 0.352。完整的实证验证见姊妹论文[D2]。

摘要 (Abstract)

Can a safety gate permit unbounded beneficial self-modification while maintaining bounded cumulative risk? We formalize this question through dual conditions – requiring sum delta_n < infinity (bounded risk) and sum TPR_n = infinity (unbounded utility) – and establish a theory of their (in)compatibility. Classification impossibility (Theorem 1): For power-law risk schedules delta_n = O(n^{-p}) with p > 1, any classifier-based gate under overlapping safe/unsafe distributions satisfies TPR_n <= C_alpha * delta_n^beta via Holder’s inequality, forcing sum TPR_n < infinity. This impossibility is exponent-optimal (Theorem 3). A second independent proof via the NP counting method (Theorem 4) yields a 13% tighter bound without Holder’s inequality. Universal finite-horizon ceiling (Theorem 5): For any summable risk schedule, the exact maximum achievable classifier utility is U*(N, B) = N * TPR_NP(B/N), growing as exp(O(sqrt(log N))) – subpolynomial. At N = 10^6 with budget B = 1.0, a classifier extracts at most U* ~ 87 versus a verifier’s ~500,000. Verification escape (Theorem 2): A Lipschitz ball verifier achieves delta = 0 with TPR > 0, escaping the impossibility. Formal Lipschitz bounds for pre-LayerNorm transformers under LoRA enable LLM-scale verification. The separation is strict. We validate on GPT-2 (d_LoRA = 147,456): conditional delta = 0 with TPR = 0.352. Comprehensive empirical validation is in the companion paper [D2].

关键词: Safety Verification, Self-Improving Systems, Information-Theoretic Limits, Lipschitz Verification, LoRA, Transformer Models, Risk-Utility Tradeoff, Formal Verification

39. ❌ The Ultimate Tutorial for AI-driven Scale Development in Generative Psychometrics: Releasing AIGENIE from its Bottle

作者: Lara Russell-Lasalandra, Hudson Golino, Luis Eduardo Garrido, Alexander P. Christensen 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28643v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是开发AIGENIE R包，利用大语言模型（LLM）自动生成心理学量表项目，属于大模型在科学领域（心理学）的应用。因此，与’Large Language Models’高度相关（10分），与’AI for Science’高度相关（10分）。论文提到支持多个LLM提供商的API，与’Tool Use OR Function Calling OR API Tool Use’有一定关联（5分）。其他关键词如MoE、SFT、RAG等涉及大模型技术原理或具体方法，论文未涉及，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了AIGENIE框架，通过集成大语言模型和网络心理测量方法，自动化心理学量表开发的早期阶段，并发布了相应的R包工具。

摘要翻译

心理量表开发传统上需要大量专家参与、迭代修订和大规模预测试，之后才能开始心理测量学评估。AIGENIE R 包实现了 AI-GENIE 框架（基于网络整合评估的自动项目生成），该框架将大语言模型文本生成与网络心理测量学方法相结合，以自动化此过程的早期阶段。该软件包利用大语言模型生成候选项目池，将其转换为高维嵌入，并应用多步骤缩减流程——探索性图分析、唯一变量分析和自助法探索性图分析——以完全在计算机中生成经过结构验证的项目池。本教程通过六个部分介绍该软件包：安装与设置、理解应用程序编程接口、文本生成、项目生成、AIGENIE 函数和 GENIE 函数。两个贯穿始终的示例说明了该软件包的使用：大五人格模型（一个成熟构念）和人工智能焦虑（一个新兴构念）。该软件包支持多个大语言模型提供商（OpenAI、Anthropic、Groq、HuggingFace 及本地模型），提供无需外部 API 调用的完全离线模式，并为希望将心理测量学缩减流程应用于现有项目池（无论其来源如何）的研究者提供 GENIE() 函数。AIGENIE 包可在 R-universe 上免费获取，地址为 https://laralee.r-universe.dev/AIGENIE。

摘要 (Abstract)

Psychological scale development has traditionally required extensive expert involvement, iterative revision, and large-scale pilot testing before psychometric evaluation can begin. The AIGENIE R package implements the AI-GENIE framework (Automatic Item Generation with Network-Integrated Evaluation), which integrates large language model (LLM) text generation with network psychometric methods to automate the early stages of this process. The package generates candidate item pools using LLMs, transforms them into high-dimensional embeddings, and applies a multi-step reduction pipeline – Exploratory Graph Analysis (EGA), Unique Variable Analysis (UVA), and bootstrap EGA – to produce structurally validated item pools entirely in silico. This tutorial introduces the package across six parts: installation and setup, understanding Application Programming Interfaces (APIs), text generation, item generation, the AIGENIE function, and the GENIE function. Two running examples illustrate the package’s use: the Big Five personality model (a well-established construct) and AI Anxiety (an emerging construct). The package supports multiple LLM providers (OpenAI, Anthropic, Groq, HuggingFace, and local models), offers a fully offline mode with no external API calls, and provides the GENIE() function for researchers who wish to apply the psychometric reduction pipeline to existing item pools regardless of their origin. The AIGENIE package is freely available on R-universe at https://laralee.r-universe.dev/AIGENIE.

关键词: large language models, psychological scale development, automatic item generation, network psychometrics, R package, AI-GENIE framework, psychometric evaluation, in silico validation

40. ❌ Dynamic Lookahead Distance via Reinforcement Learning-Based Pure Pursuit for Autonomous Racing

作者: Mohamed Elgouhary, Amr S. El-Wakeel 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28625v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究自动驾驶赛车中的路径跟踪控制算法，使用强化学习（PPO）动态调整Pure Pursuit控制器的前瞻距离。论文内容完全聚焦于机器人控制、强化学习和自动驾驶领域，未涉及任何大语言模型、深度学习技术原理或AI for Science相关主题。所有关键词均与大模型、深度学习技术或科学AI应用无关，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于强化学习的混合控制框架，通过PPO算法动态调整Pure Pursuit控制器的前瞻距离，在自动驾驶赛车中实现了更好的圈速性能和轨迹跟踪稳定性，并成功从仿真迁移到真实硬件平台。

摘要翻译

纯追踪（Pure Pursuit, PP）算法因其简洁性和实时性能，在自动驾驶车辆中被广泛用作路径跟踪算法。然而，其有效性对前视距离的选择非常敏感：较短的前视距离能改善转弯性能，但可能在直道上引发不稳定；而较长的前视距离能提高平顺性，却会降低弯道中的跟踪精度。本文提出一种混合控制框架，将近端策略优化（Proximal Policy Optimization, PPO）与经典纯追踪控制器相结合，以在竞速过程中动态调整前视距离。PPO智能体将车辆速度和多视野曲率特征映射为在线前视距离指令。该智能体在F1TENTH Gym仿真器中利用Stable-Baselines3进行训练，采用了KL惩罚和学习率衰减以确保稳定性，随后部署于ROS2环境中以指导控制器。仿真实验将所提方法与固定前视距离的纯追踪算法以及一种自适应纯追踪基线方法进行了比较。额外的实车实验则将学习得到的控制器与固定前视距离纯追踪控制器进行了对比。结果表明，学习得到的策略在未见过的赛道上提升了单圈时间性能和重复完成圈数的能力，同时能够零样本迁移到硬件平台。学习得到的控制器通过在直道上增大前视距离、在弯道中减小前视距离来进行自适应调整，这证明了通过对单一可解释参数进行在线自适应来增强经典控制器的有效性。在未见过的赛道上，所提方法在蒙特利尔赛道取得了33.16秒的成绩，在亚斯码头赛道取得了46.05秒的成绩，同时相比基线方法能容忍更激进的速度曲线缩放，并在所有测试设置中取得了最佳单圈时间。初步的实车实验进一步支持了该算法在1:10比例自动驾驶竞速平台上的仿真到现实迁移能力。

摘要 (Abstract)

Pure Pursuit (PP) is a widely used path-tracking algorithm in autonomous vehicles due to its simplicity and real-time performance. However, its effectiveness is sensitive to the choice of lookahead distance: shorter values improve cornering but can cause instability on straights, while longer values improve smoothness but reduce accuracy in curves. We propose a hybrid control framework that integrates Proximal Policy Optimization (PPO) with the classical Pure Pursuit controller to adjust the lookahead distance dynamically during racing. The PPO agent maps vehicle speed and multi-horizon curvature features to an online lookahead command. It is trained using Stable-Baselines3 in the F1TENTH Gym simulator with a KL penalty and learning-rate decay for stability, then deployed in a ROS2 environment to guide the controller. Experiments in simulation compare the proposed method against both fixed-lookahead Pure Pursuit and an adaptive Pure Pursuit baseline. Additional real-car experiments compare the learned controller against a fixed-lookahead Pure Pursuit controller. Results show that the learned policy improves lap-time performance and repeated lap completion on unseen tracks, while also transferring zero-shot to hardware. The learned controller adapts the lookahead by increasing it on straights and reducing it in curves, demonstrating effectiveness in augmenting a classical controller by online adaptation of a single interpretable parameter. On unseen tracks, the proposed method achieved 33.16 s on Montreal and 46.05 s on Yas Marina, while tolerating more aggressive speed-profile scaling than the baselines and achieving the best lap times among the tested settings. Initial real-car experiments further support sim-to-real transfer on a 1:10-scale autonomous racing platform

关键词: Autonomous Racing, Pure Pursuit, Reinforcement Learning, PPO, Lookahead Distance, Path Tracking, Sim-to-Real Transfer, Control Framework

41. ❌ Trust-Aware Routing for Distributed Generative AI Inference at the Edge

作者: Chanh Nguyen, Erik Elmroth 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28622v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究分布式生成式AI推理的边缘路由协调框架（G-TRAC），专注于系统层面的信任管理、路由算法和容错机制，不涉及大模型技术原理、训练方法、推理优化、对齐、应用等具体内容，与所有给定关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对分布式生成式AI推理在边缘计算环境中面临的设备不可靠和网络动态性问题，提出了一个信任感知的协调框架G-TRAC，通过风险约束最短路径算法和混合信任架构，显著提高了推理完成率并有效隔离了不可靠节点。

摘要翻译

生成式人工智能的新兴部署正日益在去中心化、异构的边缘设备而非单一可信服务器上执行推理。在此类环境中，单个设备故障或行为异常可能中断整个推理过程，使得传统尽力而为的对等路由机制不再适用。因此，协调分布式生成式推理需要明确考虑参与节点间的可靠性、性能差异及信任度的机制。
本文提出G-TRAC，一个集成算法路径选择与系统级协议设计的信任感知协调框架，旨在确保鲁棒的分布式推理。首先，我们将路由问题形式化为一个风险约束最短路径计算问题，并提出一种多项式时间解决方案，该方案结合信任阈值剪枝与迪杰斯特拉搜索算法，在实际边缘规模下实现亚毫秒级的中位路由延迟，并在更大规模下仍保持在10毫秒以内。其次，为在动态环境中操作性地支持路由逻辑，该框架采用一种混合信任架构，该架构在稳定的锚节点处维护全局信誉状态，同时通过后台同步向边缘节点传播轻量级更新。
在商用设备组成的异构测试平台上的实验评估表明，G-TRAC显著提高了推理完成率，有效隔离了不可靠节点，并在节点故障与网络分区情况下仍能维持鲁棒的推理执行。

摘要 (Abstract)

Emerging deployments of Generative AI increasingly execute inference across decentralized and heterogeneous edge devices rather than on a single trusted server. In such environments, a single device failure or misbehavior can disrupt the entire inference process, making traditional best-effort peer-to-peer routing insufficient. Coordinating distributed generative inference therefore requires mechanisms that explicitly account for reliability, performance variability, and trust among participating peers. In this paper, we present G-TRAC, a trust-aware coordination framework that integrates algorithmic path selection with system-level protocol design to ensure robust distributed inference. First, we formulate the routing problem as a \textit{Risk-Bounded Shortest Path} computation and introduce a polynomial-time solution that combines trust-floor pruning with Dijkstra’s search, achieving sub-millisecond median routing latency at practical edge scales, and remaining below 10 ms at larger scales. Second, to operationally support the routing logic in dynamic environments, the framework employs a \textit{Hybrid Trust Architecture} that maintains global reputation state at stable anchors while disseminating lightweight updates to edge peers via background synchronization. Experimental evaluation on a heterogeneous testbed of commodity devices demonstrates that G-TRAC significantly improves inference completion rates, effectively isolates unreliable peers, and sustains robust execution even under node failures and network partitions.

关键词: Generative AI, distributed inference, edge computing, trust-aware routing, risk-bounded shortest path, hybrid trust architecture, fault tolerance, peer-to-peer coordination

42. ❌ Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning

作者: Ziqi Miao, Haonan Jia, Lijun Li, Chen Qian, Yuan Xiong, Wenting Yan, Jing Shao 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28618v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于多模态大语言模型（MLLMs）的推理能力提升，提出PRCO框架通过角色分离优化感知与推理。核心相关关键词：1）‘Large Language Models’（10分）- 论文明确研究MLLMs，属于大模型范畴；2）‘Chain of Thought’和’System 2 Thinking’（各10分）- 论文涉及多步推理、深度推理机制，与思维链和系统2思维高度相关。其他关键词如MoE、量化、RAG等未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型中感知与推理优化耦合的问题，提出了PRCO框架，通过分离Observer和Solver角色并采用特定奖励信号，在八个基准测试中实现了平均准确率超过7个百分点的提升。

摘要翻译

可验证奖励强化学习（RLVR）显著提升了多模态大语言模型（MLLMs）的推理能力。然而，现有的RLVR方法通常依赖于结果驱动的优化策略，即仅依据最终答案的共享奖励来同时更新感知与推理模块。这种共享奖励机制模糊了贡献分配，虽常能改进推理模式，却未能可靠提升上游视觉证据提取的准确性。为突破这一感知瓶颈，我们提出了PRCO（感知-推理协同进化），一种采用共享策略的双角色RLVR框架。PRCO包含两个协同角色：针对问题生成定制化证据描述的观察者（Observer），以及基于该描述预测最终答案的求解者（Solver）。关键之处在于，PRCO采用了角色特定的奖励信号：求解者通过最终答案的可验证结果奖励进行优化，而观察者则获得源自求解者下游成功效用的奖励。在八个具有挑战性的多模态推理基准上进行的大量实验表明，与基础模型相比，PRCO在不同模型规模上均实现了超过7个百分点的平均准确率提升，且性能优于先前开源的基于RL调优的基线方法。

摘要 (Abstract)

Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving reasoning patterns while failing to reliably enhance the accuracy of upstream visual evidence extraction. To address this perception bottleneck, we introduce PRCO (Perception-Reasoning Coevolution), a dual-role RLVR framework with a shared policy. PRCO consists of two cooperative roles: an Observer that generates an evidence caption tailored to the question and a Solver that predicts the final answer based on this caption. Crucially, PRCO employs role-specific reward signals: the Solver is optimized using verifiable outcome rewards on the final answer, while the Observer receives a utility reward derived from the Solver’s downstream success. Extensive experiments across eight challenging multimodal reasoning benchmarks demonstrate that PRCO yields consistent improvements across model scales by over 7 points on average accuracy compared to the base model, outperforming prior open-source RL-tuned baselines.

关键词: Multimodal Large Language Models, Reinforcement Learning with Verifiable Rewards, Perception-Reasoning Coevolution, Dual-role Framework, Evidence Caption Generation, Multimodal Reasoning, Role-specific Reward Signals, Benchmark Evaluation

43. ❌ TGIF2: Extended Text-Guided Inpainting Forgery Dataset & Benchmark

作者: Hannes Mareen, Dimitrios Karageorgiou, Paschalis Giakoumoglou, Peter Lambert, Symeon Papadopoulos, Glenn Van Wallendael 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28613v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于图像取证领域，研究文本引导的图像修复（text-guided inpainting）伪造检测和定位，使用生成模型（如FLUX.1）创建数据集并进行基准测试。论文主题与所有评分关键词（均涉及大语言模型、深度学习技术原理、AI for Science等）完全无关，因为论文不涉及任何语言模型、模型训练/微调技术、推理方法、代理系统、模型压缩或科学AI应用。论文属于计算机视觉/多媒体取证领域，而非大模型或深度学习技术原理研究。

!!! tip deepseek-chat TL;DR

该论文针对文本引导图像修复伪造检测的挑战，扩展了TGIF数据集为TGIF2，并评估了现有取证方法在新型生成模型（如FLUX.1）和图像增强操作下的性能退化问题。

摘要翻译

生成式人工智能使文本引导的图像修复成为强大的图像编辑工具，但同时也对媒体取证构成了日益严峻的挑战。现有基准测试（包括我们提出的文本引导修复伪造数据集TGIF）表明，图像伪造定位方法能够定位拼接图像中的篡改区域，但在处理完全重生成图像时表现不佳；而合成图像检测方法虽能识别完全重生成图像，却无法实现定位。随着新型生成式修复模型不断涌现，且完全重生成图像的定位问题尚未解决，我们需要更新的数据集和基准。本文推出TGIF的扩展版本TGIF2，该数据集涵盖了文本引导修复技术的最新进展，支持对取证鲁棒性进行更深入的分析。TGIF2在原始数据集基础上，新增了由FLUX.1模型生成的编辑图像以及随机非语义掩码。基于TGIF2数据集，我们开展了涵盖图像伪造定位与合成图像检测的取证评估，包括对完全重生成图像进行定位方法的微调以及生成式超分辨率攻击实验。实验表明，现有图像伪造定位与合成图像检测方法在应对FLUX.1模型生成的篡改时性能均出现下降，凸显了其泛化能力的局限性。此外，尽管微调提升了针对完全重生成图像的定位能力，但随机非语义掩码的评估结果揭示了方法存在的对象偏差问题。同时，生成式超分辨率处理会显著削弱取证痕迹，这证明常见的图像增强操作可能破坏当前取证流程的可靠性。总之，TGIF2提供了更新的数据集与基准，为理解现代修复技术和基于人工智能的图像增强所带来的挑战提供了新的视角。TGIF2数据集可通过https://github.com/IDLabMedia/tgif-dataset获取。

摘要 (Abstract)

Generative AI has made text-guided inpainting a powerful image editing tool, but at the same time a growing challenge for media forensics. Existing benchmarks, including our text-guided inpainting forgery (TGIF) dataset, show that image forgery localization (IFL) methods can localize manipulations in spliced images but struggle not in fully regenerated (FR) images, while synthetic image detection (SID) methods can detect fully regenerated images but cannot perform localization. With new generative inpainting models emerging and the open problem of localization in FR images remaining, updated datasets and benchmarks are needed. We introduce TGIF2, an extended version of TGIF, that captures recent advances in text-guided inpainting and enables a deeper analysis of forensic robustness. TGIF2 augments the original dataset with edits generated by FLUX.1 models, as well as with random non-semantic masks. Using the TGIF2 dataset, we conduct a forensic evaluation spanning IFL and SID, including fine-tuning IFL methods on FR images and generative super-resolution attacks. Our experiments show that both IFL and SID methods degrade on FLUX.1 manipulations, highlighting limited generalization. Additionally, while fine-tuning improves localization on FR images, evaluation with random non-semantic masks reveals object bias. Furthermore, generative super-resolution significantly weakens forensic traces, demonstrating that common image enhancement operations can undermine current forensic pipelines. In summary, TGIF2 provides an updated dataset and benchmark, which enables new insights into the challenges posed by modern inpainting and AI-based image enhancements. TGIF2 is available at https://github.com/IDLabMedia/tgif-dataset.

关键词: text-guided inpainting, image forgery localization, synthetic image detection, generative AI, forensic benchmark, FLUX.1, generative super-resolution, dataset extension

44. ❌ ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning

作者: Huanxuan Liao, Zhongtao Jiang, Yupu Hao, Yuqiao Tan, Shizhu He, Jun Zhao, Kun Xu, Kang Liu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28610v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出ResAdapt框架，通过输入侧自适应分辨率优化多模态大语言模型（MLLMs）的效率，核心涉及大模型（LLMs）和推理任务。与’Large Language Models’高度相关（10分），因为论文聚焦MLLMs。与’Context Window Extension’（5分）相关，因解决长时序上下文问题。与’Chain of Thought’（5分）相关，因应用于推理密集型任务。与’Quantization’（5分）相关，因涉及压缩和效率优化。与’Speculative Decoding’（5分）相关，因关注推理加速。其他关键词如MoE、SLMs、对齐等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

论文提出ResAdapt框架，通过自适应输入分辨率解决多模态大语言模型在维持高空间分辨率和长时序上下文时的效率瓶颈，在相同视觉预算下支持更多帧数并提升性能。

摘要翻译

多模态大语言模型（MLLMs）通过提升输入保真度实现了更强的视觉理解能力，但由此产生的视觉标记增长使得同时维持高空间分辨率与长时序上下文变得难以实现。我们认为瓶颈并非在于编码后表示的压缩方式，而在于编码器接收的像素总量，并为此提出ResAdapt——一种输入侧自适应框架，该框架在编码前学习每帧图像应分配多少视觉预算。ResAdapt将轻量级分配器与未经修改的MLLM主干网络耦合，使主干网络在接收经操作符变换的输入时，仍能保持其原有的视觉标记接口。我们将资源分配建模为上下文赌博机问题，并通过成本感知策略优化（Cost-Aware Policy Optimization, CAPO）训练分配器，该方法将稀疏的 rollout 反馈转化为稳定的精度-成本学习信号。在预算受控的视频问答、时序定位和图像推理任务中，ResAdapt提升了低预算工作点的性能，并常位于或接近效率-精度前沿，在激进压缩下的推理密集型基准测试中提升最为显著。值得注意的是，在相同视觉预算下，ResAdapt可支持多达16倍的帧数，同时带来超过15%的性能提升。代码发布于https://github.com/Xnhyacinth/ResAdapt。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) achieve stronger visual understanding by scaling input fidelity, yet the resulting visual token growth makes jointly sustaining high spatial resolution and long temporal context prohibitive. We argue that the bottleneck lies not in how post-encoding representations are compressed but in the volume of pixels the encoder receives, and address it with ResAdapt, an Input-side adaptation framework that learns how much visual budget each frame should receive before encoding. ResAdapt couples a lightweight Allocator with an unchanged MLLM backbone, so the backbone retains its native visual-token interface while receiving an operator-transformed input. We formulate allocation as a contextual bandit and train the Allocator with Cost-Aware Policy Optimization (CAPO), which converts sparse rollout feedback into a stable accuracy-cost learning signal. Across budget-controlled video QA, temporal grounding, and image reasoning tasks, ResAdapt improves low-budget operating points and often lies on or near the efficiency-accuracy frontier, with the clearest gains on reasoning-intensive benchmarks under aggressive compression. Notably, ResAdapt supports up to 16x more frames at the same visual budget while delivering over 15% performance gain. Code is available at https://github.com/Xnhyacinth/ResAdapt.

关键词: Multimodal Large Language Models, Adaptive Resolution, Efficient Reasoning, Visual Token Compression, Input-side Adaptation, Video QA, Temporal Grounding, Cost-Aware Policy Optimization

45. ❌ Detection of Adversarial Attacks in Robotic Perception

作者: Ziad Sharawy, Mohammad Nakshbandiand, Sorin Mihai Grigorescu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28594v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究机器人感知中对抗攻击的检测，专注于深度神经网络在语义分割任务中的鲁棒性，未涉及大模型、深度学习技术原理创新或科学领域应用，与所有评分关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文研究了机器人感知中深度神经网络语义分割的对抗攻击检测问题，提出了针对机器人应用场景的专门检测策略。

摘要翻译

深度神经网络（Deep Neural Networks, DNNs）在机器人感知的语义分割任务中表现出强大性能，但其仍易受对抗性攻击的威胁，这对安全关键型应用构成了风险。尽管针对图像分类的鲁棒性已有较多研究，但机器人场景下的语义分割需要专门的网络架构与检测策略。

摘要 (Abstract)

Deep Neural Networks (DNNs) achieve strong performance in semantic segmentation for robotic perception but remain vulnerable to adversarial attacks, threatening safety-critical applications. While robustness has been studied for image classification, semantic segmentation in robotic contexts requires specialized architectures and detection strategies.

关键词: Adversarial Attacks, Robotic Perception, Deep Neural Networks, Semantic Segmentation, Robustness, Detection Strategies, Safety-critical Applications

46. ❌ Towards a Medical AI Scientist

作者: Hongtao Wu, Boyun Zheng, Dingjie Song, Yu Jiang, Jianfeng Gao, Lei Xing, Lichao Sun, Yixuan Yuan 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28589v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出Medical AI Scientist框架，属于AI for Science在医疗领域的应用，高度相关（10分）。该系统是自主研究框架，属于LLM Agents/Autonomous Agents范畴（10分）。论文提到clinician-engineer co-reasoning mechanism，涉及多步推理和深度思考过程，与Chain of Thought和System 2 Thinking相关（8分）。论文使用LLMs进行评估，与Large Language Models相关（10分）。其他关键词如MoE、SFT、RAG、量化等未在摘要中提及，评0分。

!!! tip deepseek-chat TL;DR

该论文提出了首个面向临床自主研究的Medical AI Scientist框架，通过临床医生-工程师协同推理机制生成高质量研究想法和手稿，在171个案例评估中显著优于商业LLMs，生成的手稿质量接近MICCAI会议水平。

摘要翻译

能够自主生成科学假设、开展实验并撰写论文的自主系统，近期已成为加速科学发现的新兴范式。然而，现有的“AI科学家”大多仍与具体领域无关，这限制了其在临床医学中的应用，因为该领域的研究需要以医学证据为基础，并涉及专业化的数据模态。本研究提出了“医学AI科学家”——首个专为临床自主研究定制的自动化研究框架。该框架通过临床医生与工程师协同推理机制，将广泛调研的文献转化为可操作的证据，从而实现基于临床依据的构思，并提升所生成研究思路的可追溯性。此外，它还能在结构化医学写作规范与伦理政策的指导下，完成基于证据的论文草拟。该框架支持三种研究模式：基于论文的复现、文献启发的创新以及任务驱动的探索，分别对应自动化科学探究的不同层级，其自主性逐级提升。基于大语言模型与人类专家的综合评估表明，在171个案例、19项临床任务和6种数据模态中，“医学AI科学家”所生成的研究思路质量显著优于商用大语言模型。同时，本系统实现了所提方法与其执行之间的高度一致性，并在可执行实验中展现出显著更高的成功率。人类专家与斯坦福Agentic评审员的双盲评估显示，该系统生成的论文稿件质量接近MICCAI会议水平，且一致优于ISBI与BIBM会议的标准。所提出的“医学AI科学家”彰显了利用人工智能在医疗健康领域实现自主科学发现的潜力。

摘要 (Abstract)

Autonomous systems that generate scientific hypotheses, conduct experiments, and draft manuscripts have recently emerged as a promising paradigm for accelerating discovery. However, existing AI Scientists remain largely domain-agnostic, limiting their applicability to clinical medicine, where research is required to be grounded in medical evidence with specialized data modalities. In this work, we introduce Medical AI Scientist, the first autonomous research framework tailored to clinical autonomous research. It enables clinically grounded ideation by transforming extensively surveyed literature into actionable evidence through clinician-engineer co-reasoning mechanism, which improves the traceability of generated research ideas. It further facilitates evidence-grounded manuscript drafting guided by structured medical compositional conventions and ethical policies. The framework operates under 3 research modes, namely paper-based reproduction, literature-inspired innovation, and task-driven exploration, each corresponding to a distinct level of automated scientific inquiry with progressively increasing autonomy. Comprehensive evaluations by both large language models and human experts demonstrate that the ideas generated by the Medical AI Scientist are of substantially higher quality than those produced by commercial LLMs across 171 cases, 19 clinical tasks, and 6 data modalities. Meanwhile, our system achieves strong alignment between the proposed method and its implementation, while also demonstrating significantly higher success rates in executable experiments. Double-blind evaluations by human experts and the Stanford Agentic Reviewer suggest that the generated manuscripts approach MICCAI-level quality, while consistently surpassing those from ISBI and BIBM. The proposed Medical AI Scientist highlights the potential of leveraging AI for autonomous scientific discovery in healthcare.

关键词: Medical AI Scientist, autonomous research framework, clinical medicine, clinician-engineer co-reasoning, evidence-grounded manuscript drafting, AI for Science, autonomous agents, scientific discovery in healthcare

47. ❌ Navigating the Mirage: A Dual-Path Agentic Framework for Robust Misleading Chart Question Answering

作者: Yanjie Zhang, Yafei Li, Rui Sheng, Zixin Chen, Yanna Lin, Huamin Qu, Lei Chen, Yushi Sun 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28583v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出ChartCynics框架，专注于解决误导性图表问答问题，核心创新在于一个双路径的智能体工作流（Agentic Workflow），涉及视觉诊断和OCR数据路径的协同。因此，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分）。框架采用两阶段优化协议，包括Oracle-Informed SFT，故与’Post-training OR Supervised Fine-tuning OR SFT’相关（8分）。其’怀疑推理范式’涉及多步推理和深度思考，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’和’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’相关（各8分）。目标为提升图表解释的鲁棒性和可信度，与’Hallucination Mitigation OR Factuality OR Truthfulness’相关（8分）。基于Qwen3-VL-8B等模型，与’Large Language Models OR LLMs OR Foundation Models’和’Small Language Models OR SLMs OR On-device AI’有一定关联（各5分）。其他关键词如MoE、Scaling Laws、RLHF、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对误导性图表问答的挑战，提出了一个名为ChartCynics的双路径智能体框架，通过解耦感知与验证、引入怀疑推理范式和两阶段优化协议，显著提升了小型开源模型在图表解释任务上的鲁棒性和准确性，在基准测试中取得了约29%的性能提升。

摘要翻译

尽管视觉语言模型（VLMs）已取得显著成功，但误导性图表因其欺骗性的视觉结构与扭曲的数据呈现，仍构成重大挑战。我们提出了ChartCynics——一种基于“怀疑式”推理范式的智能双路径框架，旨在揭示视觉欺骗。与整体性模型不同，ChartCynics将感知与验证解耦：诊断视觉路径通过策略性感兴趣区域（ROI）裁剪捕捉结构异常（如倒置坐标轴），而光学字符识别（OCR）驱动的数据路径则确保数值的准确锚定。为解决跨模态冲突，我们引入了一种智能摘要器，该模块通过两阶段协议优化：基于Oracle信息的监督微调（SFT）用于推理能力蒸馏，以及欺骗感知的生成式强化策略优化（GRPO）用于对抗性对齐。此流程能有效惩罚视觉陷阱并强化逻辑一致性。在两个基准测试上的评估表明，ChartCynics分别达到74.43%和64.55%的准确率，相比Qwen3-VL-8B骨干模型实现了约29%的绝对性能提升，并超越了当前最先进的专有模型。我们的研究证明，专业化的智能工作流程能使较小的开源模型获得更强的鲁棒性，为可信赖的图表解读奠定了新基础。

摘要 (Abstract)

Despite the success of Vision-Language Models (VLMs), misleading charts remain a significant challenge due to their deceptive visual structures and distorted data representations. We present ChartCynics, an agentic dual-path framework designed to unmask visual deception via a “skeptical” reasoning paradigm. Unlike holistic models, ChartCynics decouples perception from verification: a Diagnostic Vision Path captures structural anomalies (e.g., inverted axes) through strategic ROI cropping, while an OCR-Driven Data Path ensures numerical grounding. To resolve cross-modal conflicts, we introduce an Agentic Summarizer optimized via a two-stage protocol: Oracle-Informed SFT for reasoning distillation and Deception-Aware GRPO for adversarial alignment. This pipeline effectively penalizes visual traps and enforces logical consistency. Evaluations on two benchmarks show that ChartCynics achieves 74.43% and 64.55% accuracy, providing an absolute performance boost of ~29% over the Qwen3-VL-8B backbone, outperforming state-of-the-art proprietary models. Our results demonstrate that specialized agentic workflows can grant smaller open-source models superior robustness, establishing a new foundation for trustworthy chart interpretation.

关键词: misleading charts, agentic framework, dual-path, vision-language models, skeptical reasoning, chart interpretation, robustness, Qwen3-VL-8B

48. ❌ ChemCLIP: Bridging Organic and Inorganic Anticancer Compounds Through Contrastive Learning

作者: Mohamad Koohi-Moghadam, Hongzhe Sun, Hongyan Li, Kyongtae Tyler Bae 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28575v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于化学信息学领域，使用对比学习框架ChemCLIP来桥接有机和无机抗癌化合物的表示学习，属于AI在科学领域的应用。论文内容与大多数关键词（主要涉及大模型技术原理、训练方法、推理优化等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关，因为其核心是AI在化学信息学（Cheminformatics）中的应用，用于抗癌药物发现中的跨领域知识迁移。

!!! tip deepseek-chat TL;DR

该论文针对有机和无机抗癌化合物因传统上被视为不同化学领域而限制知识迁移的问题，提出了ChemCLIP对比学习框架，通过基于共享抗癌活性而非结构相似性学习统一表示，成功将结构不同的化合物映射到共享嵌入空间，并评估了多种分子编码策略，为跨领域化学知识迁移提供了有效方法。

摘要翻译

传统抗癌药物研发领域通常将有机小分子与金属基配位复合物视为独立的化学领域，尽管它们具有共同的生物学目标，这种划分限制了知识迁移。这种差异在现有数据中尤为明显——有机化合物拥有大量筛选数据库，而已表征的金属配合物仅数千种。本研究提出ChemCLIP，一种双编码器对比学习框架，通过基于共享抗癌活性（而非结构相似性）学习统一表征，从而弥合有机与无机化学之间的鸿沟。我们整合了包含44,854个独特有机化合物和5,164个独特金属配合物的互补数据集，并在60种癌细胞系中进行了标准化处理。通过采用活性感知的困难负样本挖掘技术训练并行编码器，我们将结构迥异的化合物映射到共享的256维嵌入空间，在此空间中具有生物相似性的化合物能聚集在一起，而不受化学类别限制。我们通过定量对齐指标、嵌入可视化和下游分类任务，系统评估了四种分子编码策略：摩根指纹（Morgan fingerprints）、ChemBERTa、MolFormer和Chemprop。摩根指纹表现出最优性能，平均对齐比为0.899，下游分类AUC值分别达到0.859（无机类）和0.817（有机类）。本研究表明对比学习是统一不同化学领域的有效策略，为多模态化学应用中的编码器选择提供了实证指导，其意义不仅限于抗癌药物发现，更可拓展至任何需要跨领域化学知识迁移的研究场景。

摘要 (Abstract)

The discovery of anticancer therapeutics has traditionally treated organic small molecules and metal-based coordination complexes as separate chemical domains, limiting knowledge transfer despite their shared biological objectives. This disparity is particularly pronounced in available data, with extensive screening databases for organic compounds compared to only a few thousand characterized metal complexes. Here, we introduce ChemCLIP, a dual-encoder contrastive learning framework that bridges this organic-inorganic divide by learning unified representations based on shared anticancer activities rather than structural similarity. We compiled complementary datasets comprising 44,854 unique organic compounds and 5,164 unique metal complexes, standardized across 60 cancer cell lines. By training parallel encoders with activity-aware hard negative mining, we mapped structurally distinct compounds into a shared 256-dimensional embedding space where biologically similar compounds cluster together regardless of chemical class. We systematically evaluated four molecular encoding strategies: Morgan fingerprints, ChemBERTa, MolFormer, and Chemprop, through quantitative alignment metrics, embedding visualizations, and downstream classification tasks. Morgan fingerprints achieved superior performance with an average alignment ratio of 0.899 and downstream classification AUCs of 0.859 (inorganic) and 0.817 (organic). This work establishes contrastive learning as an effective strategy for unifying disparate chemical domains and provides empirical guidance for encoder selection in multi-modal chemistry applications, with implications extending beyond anticancer drug discovery to any scenario requiring cross-domain chemical knowledge transfer.

关键词: ChemCLIP, contrastive learning, anticancer compounds, organic-inorganic divide, molecular representation, cheminformatics, cross-domain knowledge transfer, drug discovery

49. ❌ Learning Partial Action Replacement in Offline MARL

作者: Yue Jin, Giovanni Montana 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28573v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于离线多智能体强化学习（MARL）中的部分动作替换问题，提出PLCQL框架。所有关键词均与大模型、深度学习技术原理或AI科学应用相关，而本文研究的是传统强化学习算法优化，未涉及大模型、深度学习架构、训练方法、推理优化、对齐技术、代理系统（非LLM代理）或科学AI应用。唯一相关的是’Multi-agent Systems OR Agent Coordination’，因为论文明确研究多智能体系统，但未涉及LLM代理或协调技术。其他关键词均不相关。

!!! tip deepseek-chat TL;DR

论文提出PLCQL框架，通过将部分动作替换子集选择建模为上下文赌博机问题，解决了离线多智能体强化学习中联合动作空间指数增长导致的数据集覆盖稀疏问题，在多个基准测试中实现了更高的标准化分数和显著降低的计算成本。

摘要翻译

离线多智能体强化学习（MARL）面临一个关键挑战：联合动作空间随智能体数量呈指数级增长，导致数据集覆盖度指数级稀疏，且分布外（OOD）联合动作不可避免。部分动作替换（PAR）通过将部分智能体的动作锚定在数据集动作中来缓解此问题，但现有方法依赖枚举多种子集配置，计算成本高昂，且无法适应不同状态。我们提出PLCQL框架，将PAR子集选择建模为上下文赌博机问题，并利用近端策略优化与不确定性加权奖励，学习一种状态依赖的PAR策略。该自适应策略动态决定每个更新步骤中替换多少智能体，在策略改进与保守值估计之间取得平衡。我们证明了一个值误差界，表明估计误差与偏离智能体的期望数量呈线性关系。与基于PAR的先前方法SPaCQL相比，PLCQL将每次迭代的Q函数评估次数从n减少至1，显著提升了计算效率。实验表明，在MPE、MaMuJoCo和SMAC基准测试中，PLCQL在66%的任务上取得了最高的归一化分数，在84%的任务上优于SPaCQL，同时大幅降低了计算成本。

摘要 (Abstract)

Offline multi-agent reinforcement learning (MARL) faces a critical challenge: the joint action space grows exponentially with the number of agents, making dataset coverage exponentially sparse and out-of-distribution (OOD) joint actions unavoidable. Partial Action Replacement (PAR) mitigates this by anchoring a subset of agents to dataset actions, but existing approach relies on enumerating multiple subset configurations at high computational cost and cannot adapt to varying states. We introduce PLCQL, a framework that formulates PAR subset selection as a contextual bandit problem and learns a state-dependent PAR policy using Proximal Policy Optimisation with an uncertainty-weighted reward. This adaptive policy dynamically determines how many agents to replace at each update step, balancing policy improvement against conservative value estimation. We prove a value-error bound showing that the estimation error scales linearly with the expected number of deviating agents. Compared with the previous PAR-based method SPaCQL, PLCQL reduces the number of per-iteration Q-function evaluations from n to 1, significantly improving computational efficiency. Empirically, PLCQL achieves the highest normalised scores on 66% of tasks across MPE, MaMuJoCo, and SMAC benchmarks, outperforming SPaCQL on 84% of tasks while substantially reducing computational cost.

关键词: Offline Multi-agent Reinforcement Learning, Partial Action Replacement, Contextual Bandit, Proximal Policy Optimization, Computational Efficiency, Value-error Bound, Dataset Coverage, Joint Action Space

50. ❌ CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments

作者: Yi Yu, Guangquan Hu, Chenghuang Shen, Xingyan Liu, Jing Gu, Hangyi Sun, Junzhuo Ma, Weiting Liu, Jianfeng Liu, Mingyue Pu, Yu Wang, Zhengdong Xiao, Rui Xie, Longjiu Luo, Qianrong Wang, Gurong Cui, Honglin Qiao, Wenlian Lu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28569v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based agents在真实云服务环境中的评估，与’Large Language Models’和’LLM Agents’高度相关（10分），涉及多轮推理和复杂任务处理，与’Chain of Thought’和’System 2 Thinking’有一定关联（5分），云服务环境涉及工具使用，与’Tool Use’有一定关联（5分）。其他关键词如MoE、SLMs、训练方法、优化技术、科学AI应用等均未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有LLM智能体评估基准在真实云服务环境中的不足，提出了基于真实工单数据的CirrusBench评估框架，引入客户中心化指标衡量解决效率，实验发现当前先进模型在复杂多轮任务中效率不足，为实际技术服务应用指明了发展方向。

摘要翻译

大型语言模型（LLM）日益增强的自主能力使其能够在真实世界应用中得到部署，例如在云服务场景中，客户与助手之间的交互呈现出高技术复杂性和长程依赖性，这使得模型的鲁棒性和解决效率对客户满意度至关重要。然而，现有基于LLM的智能体评估基准主要依赖于合成环境，无法捕捉真实客户输入的多样性和不可预测性，且往往忽略了实际部署中至关重要的解决效率。为弥补这一差距，我们提出了CirrusBench，这是一个新颖的评估框架，其核心特点在于基于真实云服务工单的实际数据构建。CirrusBench保留了技术服务环境中固有的复杂多轮逻辑链和真实的工具依赖性。除了执行正确性之外，我们引入了新颖的以客户为中心的指标来定义智能体的成功，通过标准化效率指数和多轮延迟等指标量化服务质量，从而明确衡量解决效率。利用本框架进行的实验表明，尽管最先进的模型展现出强大的推理能力，但它们常在复杂、真实的多轮任务中表现不佳，且难以达到客户服务所需的高效标准，这凸显了基于LLM的智能体在实际技术服务应用未来发展的关键方向。CirrusBench评估框架已发布于：https://github.com/CirrusAI

摘要 (Abstract)

The increasing agentic capabilities of Large Language Models (LLMs) have enabled their deployment in real-world applications, such as cloud services, where customer-assistant interactions exhibit high technical complexity and long-horizon dependencies, making robustness and resolution efficiency critical for customer satisfaction. However, existing benchmarks for LLM-based agents largely rely on synthetic environments that fail to capture the diversity and unpredictability of authentic customer inputs, often ignoring the resolution efficiency essential for real-world deployment. To bridge this gap, we introduce CirrusBench, a novel evaluation framework distinguished by its foundation in real-world data from authentic cloud service tickets. CirrusBench preserves the intricate multi-turn logical chains and realistic tool dependencies inherent to technical service environments. Moving beyond execution correctness, we introduce novel Customer-Centric metrics to define agent success, quantifying service quality through metrics such as the Normalized Efficiency Index and Multi-Turn Latency to explicitly measure resolution efficiency. Experiments utilizing our framework reveal that while state-of-the-art models demonstrate strong reasoning capabilities, they frequently struggle in complex, realistic multi-turn tasks and fail to meet the high-efficiency standards required for customer service, highlighting critical directions for the future development of LLM-based agents in practical technical service applications. CirrusBench evaluation framework is released at: https://github.com/CirrusAI

关键词: LLM-based agents, cloud service environments, evaluation framework, real-world data, multi-turn tasks, resolution efficiency, Customer-Centric metrics, technical service applications

51. ❌ T-Norm Operators for EU AI Act Compliance Classification: An Empirical Comparison of Lukasiewicz, Product, and Gödel Semantics in a Neuro-Symbolic Reasoning System

作者: Adam Laabs 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28558v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是神经符号推理系统中t-norm算子的比较，用于欧盟AI法案合规分类，属于AI应用领域，但核心内容与深度学习、大模型技术原理、科学AI应用等关键词无直接关联。论文未涉及LLM、MoE、SLM、Scaling Laws、预训练、微调、对齐、RLHF、PEFT、RAG、推理优化、代理系统、模型压缩、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等主题，也未涉及生物信息学或化学信息学等科学AI应用。

!!! tip deepseek-chat TL;DR

该研究比较了Lukasiewicz、Product和Gödel三种t-norm算子在神经符号推理系统中对欧盟AI法案合规分类的性能，发现Gödel算子准确率最高但存在误报，而Product算子在零误报下表现优于Lukasiewicz。

摘要翻译

本文首次对三种t-范数算子——卢卡西维茨（T_L）、乘积（T_P）与哥德尔（T_G）——作为神经符号推理系统中的逻辑合取机制，在欧盟《人工智能法案》合规分类任务中进行了比较性试点研究。利用LGGT+（逻辑引导图变换器增强版）引擎及包含1035条标注数据（涵盖禁止、高风险、有限风险、最小风险四类）的基准测试集，我们评估了分类准确率、假阳性与假阴性率，以及各算子在模糊案例上的行为表现。在n=1035的规模下，三种算子表现差异显著（McNemar检验p<0.001）。T_G获得最高准确率（84.5%）和最优边界案例召回率（85%），但其最小语义机制导致8例假阳性（0.8%）。T_L与T_P均保持零假阳性，其中T_P表现优于T_L（81.2%对78.5%）。我们的核心发现是：（1）算子选择的重要性次于规则库的完备性；（2）T_L与T_P能维持零假阳性，但会遗漏边界案例；（3）T_G的最小语义机制以0.8%的假阳性率为代价实现了更高召回率；（4）混合语义分类器是下一步的有效研究方向。我们已基于Apache 2.0协议开源LGGT+核心引擎（通过201/201项测试）及基准数据集（n=1035）。

摘要 (Abstract)

We present a first comparative pilot study of three t-norm operators – Lukasiewicz (T_L), Product (T_P), and Gödel (T_G) - as logical conjunction mechanisms in a neuro-symbolic reasoning system for EU AI Act compliance classification. Using the LGGT+ (Logic-Guided Graph Transformers Plus) engine and a benchmark of 1035 annotated AI system descriptions spanning four risk categories (prohibited, high_risk, limited_risk, minimal_risk), we evaluate classification accuracy, false positive and false negative rates, and operator behaviour on ambiguous cases. At n=1035, all three operators differ significantly (McNemar p<0.001). T_G achieves highest accuracy (84.5%) and best borderline recall (85%), but introduces 8 false positives (0.8%) via min-semantics over-classification. T_L and T_P maintain zero false positives, with T_P outperforming T_L (81.2% vs. 78.5%). Our principal findings are: (1) operator choice is secondary to rule base completeness; (2) T_L and T_P maintain zero false positives but miss borderline cases; (3) T_G’s min-semantics achieves higher recall at cost of 0.8% false positive rate; (4) a mixed-semantics classifier is the productive next step. We release the LGGT+ core engine (201/201 tests passing) and benchmark dataset (n=1035) under Apache 2.0.

关键词: t-norm operators, neuro-symbolic reasoning, EU AI Act compliance, classification, Lukasiewicz, Product, Gödel, LGGT+

52. ❌ Domain-Invariant Prompt Learning for Vision-Language Models

作者: Arsham Gholamzadeh Khoee, Yinan Yu, Robert Feldt 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28555v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究视觉语言模型（如CLIP）的提示学习，属于计算机视觉领域，而非大语言模型（LLM）或深度学习技术原理的创新。论文主要涉及视觉语言模型的领域泛化，与大多数关键词无关。仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为论文提到预训练模型（CLIP）和领域适应（domain generalization），但核心是提示学习而非预训练或领域适应本身。其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型在领域泛化任务中提示学习的问题，提出了一种对抗训练方法DiCoOp，实验表明其在多个视觉领域上优于现有方法。

摘要翻译

以CLIP为代表的大规模预训练视觉-语言模型通过将图像与文本对齐至共享特征空间，借助提示机制实现了强大的零样本迁移能力，从而改变了计算机视觉领域的发展范式。软提示方法（如上下文优化CoOp）通过学习一组上下文向量，能有效适配此类模型以完成下游识别任务。然而，CoOp缺乏明确的机制来处理未见数据分布间的域偏移问题。为此，我们提出领域不变上下文优化（Domain-invariant Context Optimization, DiCoOp），这是一种针对领域泛化优化的CoOp扩展方法。通过采用对抗性训练策略，DiCoOp迫使模型在保持分类判别力的同时学习领域不变的提示向量。实验结果表明，在跨多种视觉领域的泛化任务中，DiCoOp始终优于CoOp方法。

摘要 (Abstract)

Large pre-trained vision-language models like CLIP have transformed computer vision by aligning images and text in a shared feature space, enabling robust zero-shot transfer via prompting. Soft-prompting, such as Context Optimization (CoOp), effectively adapts these models for downstream recognition tasks by learning a set of context vectors. However, CoOp lacks explicit mechanisms for handling domain shifts across unseen distributions. To address this, we propose Domain-invariant Context Optimization (DiCoOp), an extension of CoOp optimized for domain generalization. By employing an adversarial training approach, DiCoOp forces the model to learn domain-invariant prompts while preserving discriminative power for classification. Experimental results show that DiCoOp consistently surpasses CoOp in domain generalization tasks across diverse visual domains.

关键词: vision-language models, CLIP, prompt learning, domain generalization, adversarial training, Context Optimization, DiCoOp, zero-shot transfer

53. ❌ Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model

作者: Athos Georgiou 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28554v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	5.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是提出Hydra模型，在单一视觉语言模型(VLM)中统一文档检索和生成功能。与关键词相关性分析：1) 高度相关(8-10分)：‘PEFT/LoRA’是核心技术(使用LoRA适配器)，‘Retrieval-Augmented Generation’涉及检索生成统一；‘Large Language Models’是基础模型类型。2) 中等相关(5分)：‘KV Cache Compression’在解码优化中提及。3) 无关(0分)：其他关键词如MoE、SLMs、Scaling Laws、Alignment等未涉及。论文主要关注模型架构工程优化，而非大模型训练、对齐、推理加速等广泛技术。

!!! tip deepseek-chat TL;DR

该论文解决了视觉文档理解中需要独立检索和生成模型导致系统复杂的问题，提出了Hydra模型，通过单一视觉语言模型和可切换的LoRA适配器实现了检索和生成功能的统一，在保持生成质量的同时减少了41%的GPU内存占用。

摘要翻译

视觉文档理解通常需要独立的检索与生成模型，这导致内存占用和系统复杂度成倍增加。我们提出Hydra，一种双头架构方法，能够从单一视觉语言模型中同时支持ColBERT风格的延迟交互检索与自回归生成。仅针对检索任务训练的一个LoRA适配器在推理时可切换：启用时产生多向量嵌入；禁用时则恢复基础模型的生成质量——在与独立基础模型流程对比时，10,500个贪婪解码和随机采样样本中100%实现字节级一致输出，在四个VQA基准测试（其中三个为信息型任务；ChartQA任务在贪婪解码下两种模型得分均接近零）的15,301个样本中最大delta-ANLS差值仅为0.0044。我们明确了三项工程要求（注意力模式恢复、lm_head保留、KV缓存感知解码），若忽略这些要求，即使权重恢复正确，也会无声地破坏生成能力。在ViDoRe V1基准上，Hydra（4B参数）在单次训练中与受控单头基线差距在1个百分点内，在V2和V3版本上虽集中于部分任务但获得更高的综合分数；需通过多随机种子实验验证这些趋势。单模型设计将GPU峰值内存降低41%，但适配器切换在并发服务负载下会引入吞吐量开销。消融实验表明，在基于LoRA（r=16）的训练框架中，GritLM风格的联合训练未带来增益。对Qwen2.5-Omni-3B的概念验证扩展表明，该机制可泛化至音频检索与视频嵌入任务，并支持语音生成。

摘要 (Abstract)

Visual document understanding typically requires separate retrieval and generation models, doubling memory and system complexity. We present Hydra, a dual-head approach that provides both ColBERT-style late-interaction retrieval and autoregressive generation from a single vision-language model (VLM). A single LoRA adapter, trained only for retrieval, is toggled at inference: enabling it produces multi-vector embeddings; disabling it recovers the base model’s generation quality – byte-identical outputs in 100% of 10,500 greedy and stochastic samples, with max delta-ANLS = 0.0044 across 15,301 samples on four VQA benchmarks (three informative; ChartQA is near-zero for both models under greedy decoding) when compared against an independent base-model pipeline. We identify three engineering requirements (attention-mode restoration, lm_head preservation, KV-cache-aware decoding) whose omission silently breaks generation despite correct weight recovery. On ViDoRe V1, Hydra (4B) is within 1 percentage point of a controlled single-head baseline in a single training run, with higher aggregate scores on V2 and V3 that are concentrated on a subset of tasks; multi-seed experiments are needed to confirm these trends. The single-model design reduces peak GPU memory by 41%, though adapter switching introduces throughput overhead under concurrent serving loads. An ablation shows that GritLM-style joint training provides no benefit within the LoRA-based (r=16) training regime. A proof-of-concept extension to Qwen2.5-Omni-3B demonstrates that the mechanism generalizes to audio retrieval and video embedding, with speech generation.

关键词: Vision-Language Model, Document Retrieval, Document Generation, LoRA, Parameter-efficient Fine-tuning, KV-cache, Unified Model, Multi-modal AI

54. ❌ RAD-LAD: Rule and Language Grounded Autonomous Driving in Real-Time

作者: Anurag Ghosh, Srinivasa Narasimhan, Manmohan Chandraker, Francesco Pittaluga 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28522v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究自动驾驶中的语言-动作规划器（LAD）和基于规则的规划器（RAD），属于特定领域应用，但未涉及大模型技术原理、训练方法、推理优化、对齐技术、模型压缩等关键词。摘要中提到的’language-action planner’和’driving language models’表明使用了语言模型，但未详细说明技术细节，且所有关键词均未在标题或摘要中明确提及，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了RAD-LAD混合规划系统，结合基于规则的规划器（RAD）和语言-动作规划器（LAD），在自动驾驶中实现了可靠机动和自适应决策，在nuPlan基准测试中达到了最先进的性能。

摘要翻译

我们提出LAD（Language-Action Planner），一种具有可中断架构的实时语言-动作规划器，能够以单次前向传播生成运动规划（约20 Hz），或在生成运动规划的同时输出文本推理过程（约10 Hz）。LAD的速度足以支持实时闭环部署，其延迟较现有驾驶语言模型降低约3倍，并在nuPlan Test14-Hard和InterPlan基准测试中创造了基于学习方法的最新性能纪录。我们还提出了RAD（Rule-based Planner），一种基于规则的规划器，旨在解决PDM-Closed（Probabilistic Driving Model-Closed）的结构性局限。RAD在nuPlan Test14-Hard和InterPlan测试中取得了基于规则的规划器中的最优性能。最后，我们证明结合RAD与LAD可实现混合规划，从而融合两种方法的优势。该混合系统表明规则与学习能力具有互补性：规则支持可靠操控，而语言模型则能实现自适应且可解释的决策过程。

摘要 (Abstract)

We present LAD, a real-time language–action planner with an interruptible architecture that produces a motion plan in a single forward pass (~20 Hz) or generates textual reasoning alongside a motion plan (~10 Hz). LAD is fast enough for real-time closed-loop deployment, achieving ~3x lower latency than prior driving language models while setting a new learning-based state of the art on nuPlan Test14-Hard and InterPlan. We also introduce RAD, a rule-based planner designed to address structural limitations of PDM-Closed. RAD achieves state-of-the-art performance among rule-based planners on nuPlan Test14-Hard and InterPlan. Finally, we show that combining RAD and LAD enables hybrid planning that captures the strengths of both approaches. This hybrid system demonstrates that rules and learning provide complementary capabilities: rules support reliable maneuvering, while language enables adaptive and explainable decision-making.

关键词: autonomous driving, language-action planner, rule-based planner, real-time planning, hybrid planning, motion plan, nuPlan benchmark, explainable decision-making

55. ❌ Detecting low left ventricular ejection fraction from ECG using an interpretable and scalable predictor-driven framework

作者: Ya Zhou, Tianxiang Hao, Ziyi Cai, Haojie Zhu, Hejun He, Jia Liu, Xiaohan Fan, Jing Yuan 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28532v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出ECGPD-LEF框架，利用基础模型（foundation model）从心电图中提取诊断概率，结合可解释建模检测左心室射血分数降低，属于AI在生物医学（具体为心脏病学）领域的应用。因此，与’Large Language Models OR LLMs OR Foundation Models’（论文明确提及foundation model）和’Mechanistic Interpretability OR Explainable AI’（框架强调可解释性，进行了可解释性分析）有直接关联，分别给予8分。与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关，属于AI for Science在生物信息学/医学领域的应用，给予10分。其他关键词主要涉及大模型技术细节（如MoE、RLHF、量化等）、推理方法（如CoT、MCTS）或代理系统，论文未涉及，给予0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于心电图的预测驱动框架（ECGPD-LEF），通过整合基础模型衍生的诊断概率和可解释建模，有效检测左心室射血分数降低，在内部和外部验证中均优于现有基线方法，并揭示了关键心电图预测因子。

摘要翻译

低左心室射血分数（LEF）常在进展至症状性心力衰竭时才被检出，这凸显了对可扩展筛查策略的需求。尽管人工智能心电图（AI-ECG）已展现出潜力，但现有方法要么完全依赖可解释性有限的端到端黑箱模型，要么依赖于性能欠佳、基于商业心电图测量算法的表格化系统。我们提出了基于心电图的预测因子驱动型LEF检测框架（ECGPD-LEF），这是一个将基础模型衍生的诊断概率与可解释建模相结合的结构化框架，用于从心电图中检测LEF。该框架在包含72,475对心电图-超声心动图数据的基准数据集EchoNext上进行训练，并在预定义的独立内部队列（n=5,442）和外部队列（n=16,017）中评估。我们的框架对中度LEF实现了稳健的鉴别能力（内部AUROC 88.4%，F1 64.5%；外部AUROC 86.8%，F1 53.6%），在人口统计学和临床亚组中均持续优于该基准提供的官方端到端基线模型。可解释性分析识别出驱动LEF风险评估的高影响力预测因子，包括正常心电图、不完全性左束支传导阻滞以及前侧壁导联的心内膜下损伤。值得注意的是，这些预测因子无需针对特定任务进行重新训练，即可独立实现类零样本推理（内部AUROC 75.3-81.0%；外部AUROC 71.6-78.6%），表明心室功能障碍内在地编码于结构化的诊断概率表征之中。该框架协调了预测性能与机制透明度，支持通过添加更多预测因子以及与现有AI-ECG系统的无缝集成来实现可扩展的性能提升。

摘要 (Abstract)

Low left ventricular ejection fraction (LEF) frequently remains undetected until progression to symptomatic heart failure, underscoring the need for scalable screening strategies. Although artificial intelligence-enabled electrocardiography (AI-ECG) has shown promise, existing approaches rely solely on end-to-end black-box models with limited interpretability or on tabular systems dependent on commercial ECG measurement algorithms with suboptimal performance. We introduced ECG-based Predictor-Driven LEF (ECGPD-LEF), a structured framework that integrates foundation model-derived diagnostic probabilities with interpretable modeling for detecting LEF from ECG. Trained on the benchmark EchoNext dataset comprising 72,475 ECG-echocardiogram pairs and evaluated in predefined independent internal (n=5,442) and external (n=16,017) cohorts, our framework achieved robust discrimination for moderate LEF (internal AUROC 88.4%, F1 64.5%; external AUROC 86.8%, F1 53.6%), consistently outperforming the official end-to-end baseline provided with the benchmark across demographic and clinical subgroups. Interpretability analyses identified high-impact predictors, including normal ECG, incomplete left bundle branch block, and subendocardial injury in anterolateral leads, driving LEF risk estimation. Notably, these predictors independently enabled zero-shot-like inference without task-specific retraining (internal AUROC 75.3-81.0%; external AUROC 71.6-78.6%), indicating that ventricular dysfunction is intrinsically encoded within structured diagnostic probability representations. This framework reconciles predictive performance with mechanistic transparency, supporting scalable enhancement through additional predictors and seamless integration with existing AI-ECG systems.

关键词: ECG, left ventricular ejection fraction, foundation model, interpretable modeling, AI-ECG, predictor-driven framework, echocardiogram, scalable screening

56. ❌ The Unreasonable Effectiveness of Scaling Laws in AI

作者: Chien-Ping Lu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28507v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	10.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究AI缩放定律（Scaling Laws），特别是预训练阶段的缩放定律，因此与’Scaling Laws AND Data Quality’高度相关（10分），与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（8分）。论文讨论缩放定律在AI中的普遍有效性，涉及大模型的基础概念，但与具体的大模型技术（如LLMs）关联较弱，因此’Large Language Models OR LLMs OR Foundation Models’得5分。其他关键词（如MoE、SFT、RAG、量化等）在论文中未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文探讨了AI缩放定律（特别是预训练阶段）为何在经验上异常有效，认为其源于将计算抽象为逻辑计算，从而解释了定律在不同设置中的普适性以及推动硬件、算法和系统效率持续提升的机制。

摘要翻译

经典的AI扩展定律（特别是针对预训练阶段）描述了训练损失如何以幂律形式随计算量增加而下降。其有效性具有基础且极其实用的意义：它们使进展变得可预测，尽管速率在递减。然而，这种有效性在另外两个层面上也显得不合常理。首先，这些定律在很大程度上是经验性和观测性的，但它们反复出现在不同模型家族中，并日益扩展到与训练相关的其他机制中。其次，尽管它们预测了收益递减，但实践中的进展往往通过快速提升的效率得以持续，例如单位令牌成本的下降便可见一斑。本文认为，这两个特征源于同一根源：扩展定律之所以异常有效，是因为它们抽象掉了许多实现细节。计算变量最好被理解为逻辑计算量——一种与具体实现无关的、模型侧工作的概念，而实际扩展的负担则取决于真实资源转化为该计算量的效率。这种抽象有助于解释为何这些定律能如此广泛地适用于不同场景，以及为何它们会在硬件、算法和系统层面引发持续的效率竞赛。一旦效率被明确纳入考量，主要的实际问题就转变为：在收益递减的背景下，需要多少次效率倍增才能保持扩展的产出效益。由此视角观之，收益递减不仅是损失曲线的几何趋平，也意味着降本压力、系统级创新以及维持类摩尔定律式效率倍增所需突破的日益加剧。

摘要 (Abstract)

Classical AI scaling laws, especially for pre-training, describe how training loss decreases with compute in a power-law form. Their effectiveness has a basic and very practical sense: they make progress predictable, albeit at a declining rate. Yet their effectiveness is also unreasonable in two further senses. First, these laws are largely empirical and observational, but they appear repeatedly across model families and increasingly across training-adjacent regimes. Second, despite the diminishing returns they predict, progress in practice has often continued through rapidly improving efficiency, visible for example in falling cost per token. This paper argues that both features arise from the same source: scaling laws are unusually effective because they abstract away from many realization details. The compute variable is best understood as logical compute, an implementation-agnostic notion of model-side work, while the practical burden of scaling depends on how efficiently real resources are converted into that compute. This abstraction helps explain both why the laws travel so well across settings and why they give rise to a persistent efficiency game in hardware, algorithms, and systems. Once efficiency is made explicit, the main practical question becomes how many efficiency doublings are required to keep scaling productive despite diminishing returns. Under that view, diminishing returns are not only a geometric flattening of the loss curve, but also rising pressure for cost reduction, system-level innovation, and the breakthroughs needed to sustain Moore-like efficiency doublings.

关键词: Scaling Laws, Pre-training, Compute, Efficiency, Diminishing Returns, Logical Compute, Power-law, AI Progress

57. ❌ Next-Token Prediction and Regret Minimization

作者: Mehryar Mohri, Clayton Sanford, Jon Schneider, Kiran Vodrahalli, Yifan Wu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28499v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究next-token prediction模型在对抗性在线决策环境中的应用，核心关注上下文窗口（context window）对模型性能的影响。与’Large Language Models’相关（8分），因为next-token prediction是LLM的核心技术；与’Context Window Extension’高度相关（10分），因为论文专门分析有界和无界上下文窗口对模型性能的影响，并提到transformer架构。其他关键词如MoE、SFT、RAG等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了在对抗性在线决策环境中使用next-token prediction模型时，上下文窗口大小如何影响模型的对抗性遗憾（regret），发现无界上下文窗口总能实现低遗憾，而有界上下文窗口存在根本限制，并证明transformer架构可实现无界上下文鲁棒化。

摘要翻译

我们研究在对抗性在线决策环境中如何运用下一词元预测算法的问题。具体而言，若我们在对手动作序列的分布 $\mathcal{D}$ 上训练一个下一词元预测模型，那么何时该模型所导出的在线决策算法（通过对模型预测进行近似最优响应）能够具有较低的对抗性遗憾（即何时 $\mathcal{D}$ 是一个\emph{低遗憾分布}）？
针对无界上下文窗口（此时模型的预测可依赖于对手迄今为止的所有动作），我们证明尽管并非每个分布 $\mathcal{D}$ 都是低遗憾分布，但每个分布 $\mathcal{D}$ 在总变差距离意义下均指数接近某个低遗憾分布，因此总可以以对原始下一词元预测模型准确性的微小代价实现次线性遗憾。与此相对，对于有界上下文窗口（此时模型的预测仅能依赖于对手过去 $w$ 步动作，如现代Transformer架构中的情形），我们证明存在某些对手行为的分布 $\mathcal{D}$ 与任意低遗憾分布 $\mathcal{D’}$ 相距 $Θ(1)$ 之远（即使当 $w = Ω(T)$ 且此类分布存在时）。最后，我们通过证明无界上下文鲁棒化过程可由标准Transformer架构的层实现来补充上述结论，并提供实证证据表明Transformer模型能够被高效训练以表征这些新的低遗憾分布。

摘要 (Abstract)

We consider the question of how to employ next-token prediction algorithms in adversarial online decision-making environments. Specifically, if we train a next-token prediction model on a distribution $\mathcal{D}$ over sequences of opponent actions, when is it the case that the induced online decision-making algorithm (by approximately best responding to the model’s predictions) has low adversarial regret (i.e., when is $\mathcal{D}$ a \emph{low-regret distribution})? For unbounded context windows (where the prediction made by the model can depend on all the actions taken by the adversary thus far), we show that although not every distribution $\mathcal{D}$ is a low-regret distribution, every distribution $\mathcal{D}$ is exponentially close (in TV distance) to one low-regret distribution, and hence sublinear regret can always be achieved at negligible cost to the accuracy of the original next-token prediction model. In contrast to this, for bounded context windows (where the prediction made by the model can depend only on the past $w$ actions taken by the adversary, as may be the case in modern transformer architectures), we show that there are some distributions $\mathcal{D}$ of opponent play that are $Θ(1)$-far from any low-regret distribution $\mathcal{D’}$ (even when $w = Ω(T)$ and such distributions exist). Finally, we complement these results by showing that the unbounded context robustification procedure can be implemented by layers of a standard transformer architecture, and provide empirical evidence that transformer models can be efficiently trained to represent these new low-regret distributions.

关键词: next-token prediction, adversarial online decision-making, regret minimization, context window, transformer architecture, low-regret distribution, bounded context, unbounded context

58. ❌ MRI-to-CT synthesis using drifting models

作者: Qing Lyu, Jianxu Wang, Jeremy Hudson, Ge Wang, Chirstopher T. Whitlow 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28498v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学影像合成（MRI-to-CT），使用深度学习模型（如drifting models、CNN、VAE、GAN、扩散模型）进行图像生成和评估。论文内容与绝大多数关键词（涉及大语言模型技术、训练方法、推理优化、代理系统等）完全无关，因为这些关键词主要针对自然语言处理和大语言模型领域。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物医学（医学影像）领域的应用，但并非核心创新点，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文研究使用drifting models从MRI合成骨盆CT图像，实验表明其在图像质量和推理效率上优于多种基线方法（如CNN、VAE、GAN、扩散模型），为MRI-only放疗规划等应用提供了快速高质量的合成CT生成方案。

摘要翻译

精确的磁共振成像至计算机断层扫描合成技术，可通过提供具有骨骼细节的类CT图像，避免额外的电离辐射，从而实现仅基于磁共振的盆腔诊疗流程。本研究探讨了近期提出的漂移模型在从磁共振图像合成盆腔CT图像中的应用，并将其与卷积神经网络（UNet、VAE）、生成对抗网络（WGAN-GP）、物理启发的概率模型（PPFM）以及基于扩散的方法（FastDDPM、DDIM、DDPM）进行了基准比较。实验在两个互补数据集上进行：Gold Atlas男性盆腔数据集和SynthRAD2023盆腔子集。通过结构相似性指数（SSIM）、峰值信噪比（PSNR）和均方根误差（RMSE）评估图像保真度和结构一致性，并辅以对皮质骨和盆腔软组织界面等关键解剖区域的定性评估。在两个数据集中，所提出的漂移模型均实现了较高的SSIM和PSNR以及较低的RMSE，超越了强扩散基线方法及传统的基于CNN、VAE、GAN和PPFM的方法。视觉检查显示，该模型生成的皮质骨边缘更锐利，骶骨和股骨头几何形态描绘更准确，并减少了伪影或过度平滑现象，尤其在骨-空气-软组织边界处。此外，漂移模型通过单步推理和毫秒量级的推理时间实现了这些优势，相比迭代扩散采样方法获得了更优的精度-效率平衡，同时在图像质量上保持竞争力。这些结果表明，漂移模型是实现快速、高质量磁共振至盆腔合成CT生成的有前景方向，值得在仅磁共振放疗规划和正电子发射断层扫描/磁共振衰减校正等下游应用中进一步深入研究。

摘要 (Abstract)

Accurate MRI-to-CT synthesis could enable MR-only pelvic workflows by providing CT-like images with bone details while avoiding additional ionizing radiation. In this work, we investigate recently proposed drifting models for synthesizing pelvis CT images from MRI and benchmark them against convolutional neural networks (UNet, VAE), a generative adversarial network (WGAN-GP), a physics-inspired probabilistic model (PPFM), and diffusion-based methods (FastDDPM, DDIM, DDPM). Experiments are performed on two complementary datasets: Gold Atlas Male Pelvis and the SynthRAD2023 pelvis subset. Image fidelity and structural consistency are evaluated with SSIM, PSNR, and RMSE, complemented by qualitative assessment of anatomically critical regions such as cortical bone and pelvic soft-tissue interfaces. Across both datasets, the proposed drifting model achieves high SSIM and PSNR and low RMSE, surpassing strong diffusion baselines and conventional CNN-, VAE-, GAN-, and PPFM-based methods. Visual inspection shows sharper cortical bone edges, improved depiction of sacral and femoral head geometry, and reduced artifacts or over-smoothing, particularly at bone-air-soft tissue boundaries. Moreover, the drifting model attains these gains with one-step inference and inference times on the order of milliseconds, yielding a more favorable accuracy-efficiency trade-off than iterative diffusion sampling while remaining competitive in image quality. These findings suggest that drifting models are a promising direction for fast, high-quality pelvic synthetic CT generation from MRI and warrant further investigation for downstream applications such as MRI-only radiotherapy planning and PET/MR attenuation correction.

关键词: MRI-to-CT synthesis, drifting models, pelvic CT, medical image synthesis, diffusion models, radiotherapy planning, image fidelity, inference efficiency

59. ❌ Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification

作者: Masnun Nuha Chowdhury, Nusrat Jahan Beg, Umme Hunny Khan, Syed Rifat Raiyan, Md Kamrul Hasan, Hasan Mahmud 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28488v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在争议性声明验证中的应用，直接涉及LLMs、RAG、多智能体系统、幻觉缓解等关键词，其中LLMs、RAG、Multi-agent Systems、Hallucination Mitigation、LLM Agents为高度相关（10分）；Self-Correction因包含自我反思机制给8分；Chain of Thought和System 2 Thinking因涉及结构化推理和深度思考给5分；其余关键词如MoE、量化、推理加速等未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM在争议性声明验证中存在的幻觉和浅层推理问题，提出了一个结合渐进式RAG和角色切换的法庭式多智能体辩论框架PROClaim，在Check-COVID基准上实现了81.7%的准确率，比标准多智能体辩论方法提升了10个百分点。

摘要翻译

大型语言模型（LLMs）因存在幻觉与浅层推理问题，在高风险声明验证任务中仍不可靠。尽管检索增强生成（RAG）与多智能体辩论（MAD）方法试图解决此问题，但它们受限于单次检索机制和非结构化的辩论动态。我们提出一种法庭风格的多智能体框架PROClaim，将验证任务重构为一种结构化的对抗性审议过程。该方法整合了专业角色（如原告、辩护方、法官）与渐进式检索增强生成（P-RAG），以在辩论过程中动态扩展并精炼证据池。此外，我们采用证据协商、自我反思及异构多法官聚合机制，以增强校准性、鲁棒性与多样性。在Check-COVID基准的零样本评估中，PROClaim实现了81.7%的准确率，较标准多智能体辩论方法提升10.0个百分点，其中P-RAG贡献了主要性能增益（+7.5个百分点）。我们最终证明，结构化审议与模型异构性能够有效缓解系统性偏差，为可靠的声明验证提供了坚实基础。代码与数据已公开于https://github.com/mnc13/PROClaim。

摘要 (Abstract)

Large language models (LLMs) remain unreliable for high-stakes claim verification due to hallucinations and shallow reasoning. While retrieval-augmented generation (RAG) and multi-agent debate (MAD) address this, they are limited by one-pass retrieval and unstructured debate dynamics. We propose a courtroom-style multi-agent framework, PROClaim, that reformulates verification as a structured, adversarial deliberation. Our approach integrates specialized roles (e.g., Plaintiff, Defense, Judge) with Progressive RAG (P-RAG) to dynamically expand and refine the evidence pool during the debate. Furthermore, we employ evidence negotiation, self-reflection, and heterogeneous multi-judge aggregation to enforce calibration, robustness, and diversity. In zero-shot evaluations on the Check-COVID benchmark, PROClaim achieves 81.7% accuracy, outperforming standard multi-agent debate by 10.0 percentage points, with P-RAG driving the primary performance gains (+7.5 pp). We ultimately demonstrate that structural deliberation and model heterogeneity effectively mitigate systematic biases, providing a robust foundation for reliable claim verification. Our code and data are publicly available at https://github.com/mnc13/PROClaim.

关键词: Large Language Models, Retrieval-Augmented Generation, Multi-agent Debate, Claim Verification, Progressive RAG, Hallucination Mitigation, Structured Deliberation, Zero-shot Evaluation

60. ❌ CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains

作者: Wenhan Wang, Zhixiang Zhou, Zhongtian Ma, Yanzhu Chen, Ziyu Lin, Hao Sheng, Pengfei Liu, Honglin Ma, Wenqi Shao, Qiaosheng Zhang, Yu Qiao 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28474v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出CiQi-Agent，一个用于中国古陶瓷鉴赏的多模态智能体，核心涉及LLM代理、工具使用、检索增强生成（RAG）和AI在文化遗产领域的应用。具体相关点：1）基于LLM构建代理（LLM Agents），支持视觉工具调用和检索工具（Tool Use）；2）采用监督微调（SFT）和强化学习（RLHF）训练；3）集成RAG进行多模态检索；4）属于AI for Science在文化遗产领域的应用；5）涉及推理（CoT/System 2）和可解释性（Explainable AI）。其他关键词如MoE、量化、上下文扩展等未涉及。

!!! tip deepseek-chat TL;DR

该研究针对中国古陶瓷鉴赏的专业门槛问题，提出了CiQi-Agent多模态智能体，通过集成视觉工具、检索增强生成和监督微调与强化学习，在六个鉴赏属性上超越了包括GPT-5在内的竞争模型，并发布了大规模数据集CiQi-VQA。

摘要翻译

中国古代瓷器鉴赏需要深厚的历史知识、材料学理解与美学感知力，这使得非专业人士难以涉足。为促进文化遗产理解的普及化并辅助专家鉴赏，我们推出了CiQi-Agent——一个专用于中国古代瓷器智能鉴定的领域智能体。CiQi-Agent支持多图像瓷器输入，能够调用视觉工具并实现多模态检索增强生成，在六大属性维度进行细粒度鉴赏分析：朝代、年号、窑口、釉色、纹饰与器型。除属性分类外，该系统能捕捉细微视觉特征，检索相关领域知识，并融合视觉与文本证据以生成连贯、可解释的鉴赏描述。为实现此能力，我们构建了大规模专家标注数据集CiQi-VQA，包含29,596件瓷器样本、51,553张图像及557,940组视觉问答对，并进一步建立了与前述六大属性对齐的综合评测基准CiQi-Bench。CiQi-Agent通过监督微调、强化学习及工具增强推理框架进行训练，该框架整合了两类工具：视觉工具与多模态检索工具。实验结果表明，在CiQi-Bench的所有六项属性评测中，CiQi-Agent（7B）均优于所有开源与闭源竞争模型，平均准确率较GPT-5高出12.2%。模型与数据集已公开发布于https://huggingface.co/datasets/SII-Monument-Valley/CiQi-VQA。

摘要 (Abstract)

The connoisseurship of antique Chinese porcelain demands extensive historical expertise, material understanding, and aesthetic sensitivity, making it difficult for non-specialists to engage. To democratize cultural-heritage understanding and assist expert connoisseurship, we introduce CiQi-Agent – a domain-specific Porcelain Connoisseurship Agent for intelligent analysis of antique Chinese porcelain. CiQi-Agent supports multi-image porcelain inputs and enables vision tool invocation and multimodal retrieval-augmented generation, performing fine-grained connoisseurship analysis across six attributes: dynasty, reign period, kiln site, glaze color, decorative motif, and vessel shape. Beyond attribute classification, it captures subtle visual details, retrieves relevant domain knowledge, and integrates visual and textual evidence to produce coherent, explainable connoisseurship descriptions. To achieve this capability, we construct a large-scale, expert-annotated dataset CiQi-VQA, comprising 29,596 porcelain specimens, 51,553 images, and 557,940 visual question–answering pairs, and further establish a comprehensive benchmark CiQi-Bench aligned with the previously mentioned six attributes. CiQi-Agent is trained through supervised fine-tuning, reinforcement learning, and a tool-augmented reasoning framework that integrates two categories of tools: a vision tool and multimodal retrieval tools. Experimental results show that CiQi-Agent (7B) outperforms all competitive open- and closed-source models across all six attributes on CiQi-Bench, achieving on average 12.2% higher accuracy than GPT-5. The model and dataset have been released and are publicly available at https://huggingface.co/datasets/SII-Monument-Valley/CiQi-VQA.

关键词: multimodal agent, cultural heritage, retrieval-augmented generation, tool-augmented reasoning, supervised fine-tuning, reinforcement learning, Chinese porcelain connoisseurship, vision tool invocation

61. ❌ FeDMRA: Federated Incremental Learning with Dynamic Memory Replay Allocation

作者: Tiantian Wang, Xiang Xiang, Simon S. Du 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28455v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于联邦学习和增量学习在医疗领域的应用，提出了一种动态内存重放分配方法来解决非独立同分布数据下的灾难性遗忘问题。论文的核心是联邦学习框架和增量学习算法，而非大模型或深度学习技术原理的创新。所有关键词中，只有"AI for Science OR Bioinformatics OR Cheminformatics"与论文的医疗应用背景有一定关联（应用于医疗图像数据集），因此给予5分。其他关键词均与大模型技术、训练方法、推理优化、代理系统等无关，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对联邦医疗系统中非独立同分布数据导致的灾难性遗忘问题，提出了一种动态内存重放分配策略，在三个医疗图像数据集上实验表明其性能显著优于现有基线模型。

摘要翻译

在联邦医疗系统中，联邦类增量学习已成为关键范式，它使得分布式客户端间能够进行持续的自适应模型学习，同时保障数据隐私。然而在实际应用中，分布式框架内各智能体节点的数据往往呈现非独立同分布特性，导致传统的持续学习方法难以适用。为应对这些挑战，本文覆盖了更全面的增量任务场景，并基于数据回放机制提出了一种面向样本存储的动态记忆分配策略。该策略充分挖掘数据异质性的内在潜力，同时兼顾所有参与客户端的性能公平性，从而建立了一种平衡且自适应的解决方案以缓解灾难性遗忘。与固定分配客户端样本记忆的传统方式不同，本方案强调在客户端间合理分配有限的存储资源以提升模型性能。此外，我们在三个医学影像数据集上进行了大量实验，结果表明相较于现有基线模型，该方法取得了显著的性能提升。

摘要 (Abstract)

In federated healthcare systems, Federated Class-Incremental Learning (FCIL) has emerged as a key paradigm, enabling continuous adaptive model learning among distributed clients while safeguarding data privacy. However, in practical applications, data across agent nodes within the distributed framework often exhibits non-independent and identically distributed (non-IID) characteristics, rendering traditional continual learning methods inapplicable. To address these challenges, this paper covers more comprehensive incremental task scenarios and proposes a dynamic memory allocation strategy for exemplar storage based on the data replay mechanism. This strategy fully taps into the inherent potential of data heterogeneity, while taking into account the performance fairness of all participating clients, thereby establishing a balanced and adaptive solution to mitigate catastrophic forgetting. Unlike the fixed allocation of client exemplar memory, the proposed scheme emphasizes the rational allocation of limited storage resources among clients to improve model performance. Furthermore, extensive experiments are conducted on three medical image datasets, and the results demonstrate significant performance improvements compared to existing baseline models.

关键词: Federated Learning, Incremental Learning, Dynamic Memory Replay, Non-IID Data, Catastrophic Forgetting, Medical Image Analysis, Healthcare Systems, Exemplar Storage

62. ❌ GeoHCC: Local Geometry-Aware Hierarchical Context Compression for 3D Gaussian Splatting

作者: Xuan Deng, Xiandong Meng, Hengyu Man, Qiang Zhu, Tiange Zhang, Debin Zhao, Xiaopeng Fan 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28431v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于3D高斯泼溅（3DGS）的压缩技术，属于计算机视觉和图形学领域，与提供的大模型/深度学习关键词基本无关。唯一相关的是’Quantization OR Model Compression OR Low-bit Weights’，因为论文涉及模型压缩（通过锚点剪枝和熵编码减少存储开销），但并非针对大语言模型，因此给予5分（有一定关联）。其他关键词均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对3D高斯泼溅（3DGS）存储开销过大的问题，提出了GeoHCC框架，通过几何感知的锚点剪枝和分层熵编码，在压缩模型的同时保持了优越的几何完整性和渲染保真度。

摘要翻译

尽管三维高斯泼溅（3DGS）能够实现高保真实时渲染，但其高昂的存储开销严重阻碍了实际部署。近期基于锚点的3DGS压缩方案通过上下文建模减少冗余，却忽略了显式的几何依赖性，导致结构退化与次优的率失真性能。本文提出GeoHCC——一种几何感知的3DGS压缩框架，它将锚点间的几何关联性融入锚点剪枝与熵编码过程，以实现紧凑表征。我们首先提出邻域感知锚点剪枝（NAAP），该方法通过加权邻域特征聚合评估锚点重要性，并将冗余锚点合并至显著相邻点，从而生成紧凑且几何一致的锚点集合。在此优化结构基础上，我们进一步开发了分层熵编码方案，其中通过轻量级几何引导卷积（GG-Conv）算子利用从粗到细的先验信息，实现空间自适应的上下文建模与率失真优化。大量实验表明，GeoHCC有效解决了结构保持瓶颈，在几何完整性与渲染保真度方面均优于当前最先进的基于锚点的方法。

摘要 (Abstract)

Although 3D Gaussian Splatting (3DGS) enables high-fidelity real-time rendering, its prohibitive storage overhead severely hinders practical deployment. Recent anchor-based 3DGS compression schemes reduce redundancy through context modeling, yet overlook explicit geometric dependencies, leading to structural degradation and suboptimal rate-distortion performance. In this paper, we propose GeoHCC, a geometry-aware 3DGS compression framework that incorporates inter-anchor geometric correlations into anchor pruning and entropy coding for compact representation. We first introduce Neighborhood-Aware Anchor Pruning (NAAP), which evaluates anchor importance via weighted neighborhood feature aggregation and merges redundant anchors into salient neighbors, yielding a compact yet geometry-consistent anchor set. Building upon this optimized structure, we further develop a hierarchical entropy coding scheme, in which coarse-to-fine priors are exploited through a lightweight Geometry-Guided Convolution (GG-Conv) operator to enable spatially adaptive context modeling and rate-distortion optimization. Extensive experiments demonstrate that GeoHCC effectively resolves the structure preservation bottleneck, maintaining superior geometric integrity and rendering fidelity over state-of-the-art anchor-based approaches.

关键词: 3D Gaussian Splatting, compression, geometry-aware, anchor pruning, entropy coding, rate-distortion, rendering fidelity, storage overhead

63. ❌ AceleradorSNN: A Neuromorphic Cognitive System Integrating Spiking Neural Networks and DynamicImage Signal Processing on FPGA

作者: Daniel Gutierrez, Ruben Martinez, Leyre Arnedo, Antonio Cuesta, Soukaina El Hamry 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28429v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于硬件加速的脉冲神经网络（SNN）和动态图像信号处理（ISP）的FPGA实现，用于自动驾驶、无人机等领域的实时物体检测。所有评分关键词均涉及大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理系统等），而论文完全不涉及LLM、深度学习模型训练、自然语言处理或相关应用领域，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对自动驾驶等系统对高速、低延迟、高能效物体检测的需求，开发了AceleradorSNN系统，通过集成基于脉冲神经网络的神经形态处理单元和动态可重构认知图像信号处理器在FPGA上实现，解决了传统卷积神经网络在实时处理方面的局限性。

摘要翻译

在自动驾驶系统——如高级驾驶辅助系统（ADAS）、无人机（UAV）和工业4.0机器人——中对高速、低延迟与高能效目标检测的需求，凸显了传统卷积神经网络（Convolutional Neural Networks, CNNs）的局限性。为应对这些挑战，我们开发了第三代人工智能认知系统AceleradorSNN。该架构集成了一个基于脉冲神经网络（Spiking Neural Networks, SNNs）的神经形态处理单元（Neuromorphic Processing Unit, NPU），用于处理来自动态视觉传感器（Dynamic Vision Sensors, DVS）的异步数据；同时，还包含一个为RGB相机设计的动态可重构认知图像信号处理器（Cognitive Image Signal Processor, ISP）。本文详细阐述了这两个IP核的硬件导向设计、基于代理梯度训练的SNN骨干网络的评估，以及在现场可编程门阵列（Field-Programmable Gate Arrays, FPGA）上实现的实时流式ISP架构。

摘要 (Abstract)

The demand for high-speed, low-latency, and energy-efficient object detection in autonomous systems – such as advanced driver-assistance systems (ADAS), unmanned aerial vehicles (UAVs), and Industry 4.0 robotics – has exposed the limitations of traditional Convolutional Neural Networks (CNNs). To address these challenges, we have developed AceleradorSNN, a third-generation artificial intelligence cognitive system. This architecture integrates a Neuromorphic Processing Unit (NPU) based on Spiking Neural Networks (SNNs) to process asynchronous data from Dynamic Vision Sensors (DVS), alongside a dynamically reconfigurable Cognitive Image Signal Processor (ISP) for RGB cameras. This paper details the hardware-oriented design of both IP cores, the evaluation of surrogate-gradienttrained SNN backbones, and the real-time streaming ISP architecture implemented on Field-Programmable Gate Arrays (FPGA).

关键词: Spiking Neural Networks, Neuromorphic Processing Unit, Dynamic Vision Sensors, Cognitive Image Signal Processor, FPGA, Object Detection, Real-time Processing, Hardware Acceleration

64. ❌ Learning unified control of internal spin squeezing in atomic qudits for magnetometry

作者: C. Z. Cao, J. Z. Han, M. Xiong, M. Deng, L. Wang, X. Lv, M. Xue 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28421v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究量子磁力计中的自旋压缩控制，使用强化学习优化量子传感器性能。所有关键词均与大模型/深度学习技术原理或应用相关，但论文仅涉及强化学习在量子物理实验中的应用，未涉及大模型、深度学习或AI for Science中的生物/化学信息学。仅’AI for Science’因广义科学应用得5分（弱关联），其余完全无关。

!!! tip deepseek-chat TL;DR

该论文研究如何利用物理信息强化学习控制多级原子中的非线性塞曼效应，以在量子磁力计中快速制备并稳定自旋压缩态，从而将固有非线性动力学转化为持续的计量优势，实现超越标准量子极限的磁灵敏度。

摘要翻译

在量子增强原子磁强计中，产生并保持具有计量学价值的量子态是一项核心挑战。在低场区运行的多能级原子中，非线性塞曼效应既是一种资源也是一种限制。它非线性地重新分配内部自旋涨落，从而在单个原子qudit内生成自旋压缩态，但在固定读出条件下，它会扭曲与测量相关的正交分量并限制可获得的计量学增益。这一挑战因压缩轴与有效非线性作用均具有时间依赖性而进一步加剧。本文表明，基于物理信息的强化学习能够将非线性塞曼动力学从读出退化源转化为持续的计量学资源。仅利用实验可获取的低阶自旋矩，一个训练有素的智能体在$^{161}\mathrm{Dy}$的$f=21/2$能级中，识别出一种统一控制策略，该策略可快速制备强压缩的内部态，并在持续开启的非线性塞曼演化下稳定维持超过$4,\mathrm{dB}$的固定轴自旋压缩。计入态制备开销后，所学协议实现了$13.9,\mathrm{pT}/\sqrt{\mathrm{Hz}}$的单原子磁灵敏度，相当于获得超出标准量子极限约$3,\mathrm{dB}$的优势。我们的研究结果确立了基于学习的控制作为一种实用途径，可将多能级量子传感器中不可避免的内在非线性动力学转化为可操作的计量学优势。

摘要 (Abstract)

Generating and preserving metrologically useful quantum states is a central challenge in quantum-enhanced atomic magnetometry. In multilevel atoms operated in the low-field regime, the nonlinear Zeeman (NLZ) effect is both a resource and a limitation. It nonlinearly redistributes internal spin fluctuations to generate spin-squeezed states within a single atomic qudit, yet under fixed readout it distorts the measurement-relevant quadrature and limits the accessible metrological gain. This challenge is compounded by the time dependence of both the squeezing axis and the effective nonlinear action. Here we show that physics-informed reinforcement learning can transform NLZ dynamics from a source of readout degradation into a sustained metrological resource. Using only experimentally accessible low-order spin moments, a trained agent identifies, in the $f=21/2$ manifold of $^{161}\mathrm{Dy}$, a unified control policy that rapidly prepares strongly squeezed internal states and stabilizes more than $4,\mathrm{dB}$ of fixed-axis spin squeezing under always-on NLZ evolution. Including state-preparation overhead, the learned protocol yields a single-atom magnetic sensitivity of $13.9,\mathrm{pT}/\sqrt{\mathrm{Hz}}$, corresponding to an advantage of approximately $3,\mathrm{dB}$ beyond the standard quantum limit. Our results establish learning-based control as a practical route for converting unavoidable intrinsic nonlinear dynamics in multilevel quantum sensors into operational metrological advantage.

关键词: spin squeezing, quantum magnetometry, reinforcement learning, nonlinear Zeeman effect, atomic qudits, metrological gain, control policy, Dy-161

65. ❌ Spectral Higher-Order Neural Networks

作者: Gianluca Peri, Timoteo Carletti, Duccio Fanelli, Diego Febbe 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28420v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Spectral Higher-Order Neural Networks》提出了一种新的神经网络架构SHONNs，专注于在通用前馈网络中引入高阶交互，并利用谱属性来缓解稳定性和参数缩放问题。该研究属于神经网络架构的基础理论创新，但未涉及大模型（LLMs）、深度学习技术原理的具体创新（如MoE、Scaling Laws、训练方法等），也未涉及大模型在科学或其他领域的应用。所有关键词均与大模型、深度学习技术原理或应用相关，而本文研究的是传统神经网络的高阶交互理论，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Spectral Higher-Order Neural Networks（SHONNs）的新算法策略，通过谱属性在通用前馈网络中引入高阶交互，以解决加权高阶前向传播中的稳定性和参数缩放问题。

摘要翻译

神经网络是现代机器学习的基础工具。标准范式假设在按顺序层组织的相互纠缠单元之间存在二元交互（通过前向线性传递）。学界也设计了超越成对交互的广义架构，以考虑计算神经元间的高阶耦合。然而，高阶网络通常被部署为增强型图神经网络（GNNs），因此仅在输入呈现显式超图结构的情境中显示出优势。本文提出谱高阶神经网络（Spectral Higher-Order Neural Networks, SHONNs），这是一种在通用前馈网络结构中融入高阶交互的新算法策略。SHONNs利用基于谱属性的模型重构，从而缓解了加权高阶前向传播中常见的稳定性与参数缩放问题。

摘要 (Abstract)

Neural networks are fundamental tools of modern machine learning. The standard paradigm assumes binary interactions (across feedforward linear passes) between inter-tangled units, organized in sequential layers. Generalized architectures have been also designed that move beyond pairwise interactions, so as to account for higher-order couplings among computing neurons. Higher-order networks are however usually deployed as augmented graph neural networks (GNNs), and, as such, prove solely advantageous in contexts where the input exhibits an explicit hypergraph structure. Here, we present Spectral Higher-Order Neural Networks (SHONNs), a new algorithmic strategy to incorporate higher-order interactions in general-purpose, feedforward, network structures. SHONNs leverages a reformulation of the model in terms of spectral attributes. This allows to mitigate the common stability and parameter scaling problems that come along weighted, higher-order, forward propagations.

关键词: Spectral Higher-Order Neural Networks, higher-order interactions, feedforward networks, spectral attributes, stability, parameter scaling, neural network architecture

66. ❌ KGroups: A Versatile Univariate Max-Relevance Min-Redundancy Feature Selection Algorithm for High-dimensional Biological Data

作者: Malick Ebiele, Malika Bendechache, Rob Brennan 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28417v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于高维生物数据中的特征选择算法（KGroups），属于生物信息学领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评5分），因为涉及AI在生物数据分析中的应用。但论文未涉及大模型、深度学习技术原理、LLM相关方法（如MoE、SFT、RAG等）、推理技术（如CoT）、模型优化（如量化）或代理系统等，其他关键词均完全无关（评0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为KGroups的新型单变量最大相关最小冗余特征选择算法，用于高维生物数据，实验表明其在保持与多元mRMR相似预测性能的同时，速度提升高达821倍。

摘要翻译

本文提出了一种名为KGroups的新型单变量过滤式特征选择算法。现有文献中的多数研究聚焦于探究特征选择方法中相关性或冗余度的评估指标，这类工作已展现出显著成果，并切实提升了过滤式特征选择方法的预测性能。然而，针对替代性过滤式特征选择算法的探索仍较为有限。这引出了一个关键问题：过滤式特征选择方法的预测性能在多大程度上取决于其选择算法本身，而非相关性或冗余度的评估方式？当前大多数过滤式特征选择方法可分为两类：相关性最大化（Max-Rel，亦称KBest）以及同步实现相关性最大化和冗余度最小化（mRMR）。KBest是一种采用降序排序进行选择的单变量过滤式特征选择算法；mRMR则是一种采用增量搜索算法进行选择的多变量过滤式特征选择算法。本文提出的KGroups是一种新型单变量mRMR算法，其通过聚类方法实现特征选择。在14个高维生物基准数据集上的大量实验表明，KGroups在达到与多变量mRMR相近预测性能的同时，计算速度最高可提升821倍。与参数固定的mRMR和KBest不同，KGroups具有可参数化特性，这为通过超参数微调进一步提升预测性能留下了空间。实验同时证实，KGroups的预测性能优于KBest算法。

摘要 (Abstract)

This paper proposes a new univariate filter feature selection (FFS) algorithm called KGroups. The majority of work in the literature focuses on investigating the relevance or redundancy estimations of feature selection (FS) methods. This has shown promising results and a real improvement of FFS methods’ predictive performance. However, limited efforts have been made to investigate alternative FFS algorithms. This raises the following question: how much of the FFS methods’ predictive performance depends on the selection algorithm rather than the relevance or the redundancy estimations? The majority of FFS methods fall into two categories: relevance maximisation (Max-Rel, also known as KBest) or simultaneous relevance maximisation and redundancy minimisation (mRMR). KBest is a univariate FFS algorithm that employs sorting (descending) for selection. mRMR is a multivariate FFS algorithm that employs an incremental search algorithm for selection. In this paper, we propose a new univariate mRMR called KGroups that employs clustering for selection. Extensive experiments on 14 high-dimensional biological benchmark datasets showed that KGroups achieves similar predictive performance compared to multivariate mRMR while being up to 821 times faster. KGroups is parameterisable, which leaves room for further predictive performance improvement through hyperparameter finetuning, unlike mRMR and KBest. KGroups outperforms KBest.

关键词: feature selection, KGroups, univariate filter, mRMR, high-dimensional biological data, clustering algorithm, predictive performance, computational efficiency

67. ❌ Evolutionary Discovery of Reinforcement Learning Algorithms via Large Language Models

作者: Alkis Sygkounas, Amy Loutfi, Andreas Persson 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28416v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是使用大语言模型（LLMs）作为生成变异算子来进化发现强化学习算法，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文涉及AI在科学（强化学习算法设计）中的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、各种训练技术、推理优化、代理系统、模型压缩等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用大语言模型作为生成变异算子的进化框架，用于自动发现强化学习算法，并在多个基准测试中实现了与SAC、PPO等经典算法竞争的性能。

摘要翻译

强化学习算法由其学习更新规则定义，这些规则通常为人工设计且固定不变。本文提出一种通过直接搜索可执行更新规则来发现强化学习算法的进化框架，这些规则实现了完整的训练过程。该方法基于REvolve进化系统构建，该系统使用大语言模型作为生成式变异算子，并将其从奖励函数发现扩展到算法发现。为促进非标准学习规则的出现，搜索过程排除了经典机制，如演员-评论家结构、时序差分损失和值函数自举。由于强化学习算法对内部标量参数高度敏感，我们引入了进化后优化阶段，由大语言模型为每个进化出的更新规则提出可行的超参数范围。通过在多个Gymnasium基准测试上进行完整训练端到端评估，所发现的算法相较于SAC、PPO、DQN和A2C等成熟基线模型展现出具有竞争力的性能。

摘要 (Abstract)

Reinforcement learning algorithms are defined by their learning update rules, which are typically hand-designed and fixed. We present an evolutionary framework for discovering reinforcement learning algorithms by searching directly over executable update rules that implement complete training procedures. The approach builds on REvolve, an evolutionary system that uses large language models as generative variation operators, and extends it from reward-function discovery to algorithm discovery. To promote the emergence of nonstandard learning rules, the search excludes canonical mechanisms such as actor–critic structures, temporal-difference losses, and value bootstrapping. Because reinforcement learning algorithms are highly sensitive to internal scalar parameters, we introduce a post-evolution refinement stage in which a large language model proposes feasible hyperparameter ranges for each evolved update rule. Evaluated end-to-end by full training runs on multiple Gymnasium benchmarks, the discovered algorithms achieve competitive performance relative to established baselines, including SAC, PPO, DQN, and A2C.

关键词: Evolutionary Discovery, Reinforcement Learning Algorithms, Large Language Models, Generative Variation Operators, Update Rules, Hyperparameter Refinement, Gymnasium Benchmarks, Competitive Performance

68. ❌ MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

作者: Fangda Ye, Yuxin Hu, Pengxiang Zhu, Yibo Li, Ziqi Jin, Yao Xiao, Yibo Wang, Lei Wang, Zhen Zhang, Lu Wang, Yue Deng, Bin Wang, Yifan Zhang, Liangcai Su, Xinyu Wang, He Zhao, Chen Wei, Qiang Ren, Bryan Hooi, An Bo, Shuicheng Yan, Lidong Bing 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28407v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于深度研究代理（deep research agents）的评估基准，属于大模型应用领域。核心相关关键词包括：LLM Agents（高度相关，10分），涉及代理工作流、工具使用、检索增强生成、推理过程（CoT、System 2）、事实性验证和自我修正等。其他关键词如基础模型、多模态任务也有一定关联（8分）。但论文未涉及模型架构创新（如MoE、量化）、训练方法（预训练、微调、对齐）、科学AI应用等，这些评0分。

!!! tip deepseek-chat TL;DR

该论文针对现有深度研究系统评估的不足，提出了一个名为MiroEval的多模态基准和评估框架，通过评估13个系统发现过程质量能可靠预测整体结果，且多模态任务带来更大挑战，其中MiroThinker系列表现最均衡。

摘要翻译

深度学习研究系统近期进展显著，但评估方法仍滞后于真实用户需求。现有基准主要依赖固定评分标准评估最终报告，未能对底层研究过程进行评价。多数基准还存在多模态覆盖有限、依赖无法反映真实查询复杂性的合成任务，以及无法随知识演进更新等问题。为弥补这些不足，我们提出了MiroEval——一个面向深度学习研究系统的基准与评估框架。该基准包含100项任务（70项纯文本任务、30项多模态任务），均基于真实用户需求构建，并通过支持定期更新的双路径流程实现动态演进。我们提出的评估套件从三个互补维度评估深度学习研究系统：采用任务特异性评分标准的自适应综合质量评估、基于网络资源与多模态附件的主动检索与推理进行智能事实核查，以及聚焦研究过程的评估——审计系统在调查全流程中的搜索、推理与优化行为。通过对13个系统的评估，我们得出三个主要结论：三个评估维度捕捉了系统能力的互补特征，每个维度都揭示了不同系统的独特优势与短板；过程质量可作为整体结果的有效预测指标，同时能暴露输出级指标无法发现的缺陷；多模态任务带来显著更大的挑战，多数系统得分下降3至10分。MiroThinker系列展现出最均衡的性能，其中MiroThinker-H1在两种场景下均获得最高综合排名。人工验证与鲁棒性测试结果证实了该基准与评估框架的可靠性。MiroEval为下一代深度研究智能体提供了整体性诊断工具。

摘要 (Abstract)

Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evolves. To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems. The benchmark comprises 100 tasks (70 text-only, 30 multimodal), all grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates, enabling a live and evolving setting. The proposed evaluation suite assesses deep research systems along three complementary dimensions: adaptive synthesis quality evaluation with task-specific rubrics, agentic factuality verification via active retrieval and reasoning over both web sources and multimodal attachments, and process-centric evaluation audits how the system searches, reasons, and refines throughout its investigation. Evaluation across 13 systems yields three principal findings: the three evaluation dimensions capture complementary aspects of system capability, with each revealing distinct strengths and weaknesses across systems; process quality serves as a reliable predictor of overall outcome while revealing weaknesses invisible to output-level metrics; and multimodal tasks pose substantially greater challenges, with most systems declining by 3 to 10 points. The MiroThinker series achieves the most balanced performance, with MiroThinker-H1 ranking the highest overall in both settings. Human verification and robustness results confirm the reliability of the benchmark and evaluation framework. MiroEval provides a holistic diagnostic tool for the next generation of deep research agents.

关键词: deep research agents, evaluation benchmark, multimodal tasks, agentic workflow, process-centric evaluation, factuality verification, adaptive synthesis, MiroEval

69. ❌ EdgeDiT: Hardware-Aware Diffusion Transformers for Efficient On-Device Image Generation

作者: Sravanth Kodavanti, Manjunath Arveti, Sowmya Vajrala, Srinivas Miriyala, Vikram N R 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28405v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于Diffusion Transformers（DiT）在边缘设备上的高效部署，属于大模型在不同领域（图像生成）的研究应用，具有技术创新性。与关键词的相关性分析：1）与"Small Language Models OR SLMs OR On-device AI"高度相关（10分），因为论文核心是优化模型以实现设备端AI；2）与"Quantization OR Model Compression OR Low-bit Weights"高度相关（10分），通过参数剪枝和结构优化实现模型压缩；3）与"Speculative Decoding OR Inference Acceleration"高度相关（10分），显著降低延迟并优化推理效率；4）与"Large Language Models OR LLMs OR Foundation Models"有一定关联（5分），DiT属于基础模型的一种，但论文未直接涉及语言模型；其他关键词与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对Diffusion Transformers在资源受限的边缘设备上部署时面临的高计算复杂性和内存需求问题，提出了硬件感知的EdgeDiT模型系列，通过优化框架实现了参数减少20-30%、FLOPs降低36-46%、延迟减少1.65倍，同时保持生成质量。

摘要翻译

扩散变换器（Diffusion Transformers, DiT）已在高保真图像合成领域确立了新的性能标杆；然而，其巨大的计算复杂度和内存需求阻碍了在资源受限的边缘设备上进行本地部署。本文提出EdgeDiT，这是一个专为移动神经处理单元（Neural Processing Units, NPUs，如高通Hexagon和苹果神经引擎Apple Neural Engine, ANE）设计的高效硬件生成变换器系列。通过采用硬件感知优化框架，我们系统性地识别并剪裁了DiT主干中对移动数据流负担尤为严重的结构冗余。我们的方法产生了一系列轻量级模型，在保持原始变换器架构的扩展优势和表达能力的同时，实现了参数减少20-30%、浮点运算量（FLOPs）降低36-46%，以及设备端延迟减少1.65倍。广泛的基准测试表明，与优化的移动U-Net及原始DiT变体相比，EdgeDiT在弗雷歇起始距离（Frechet Inception Distance, FID）与推理延迟之间提供了更优的帕累托权衡。通过直接在设备端实现响应迅速、私密且离线的生成式人工智能，EdgeDiT为将大规模基础模型从高端GPU迁移至用户掌中设备提供了一个可扩展的蓝图。

摘要 (Abstract)

Diffusion Transformers (DiT) have established a new state-of-the-art in high-fidelity image synthesis; however, their massive computational complexity and memory requirements hinder local deployment on resource-constrained edge devices. In this paper, we introduce EdgeDiT, a family of hardware-efficient generative transformers specifically engineered for mobile Neural Processing Units (NPUs), such as the Qualcomm Hexagon and Apple Neural Engine (ANE). By leveraging a hardware-aware optimization framework, we systematically identify and prune structural redundancies within the DiT backbone that are particularly taxing for mobile data-flows. Our approach yields a series of lightweight models that achieve a 20-30% reduction in parameters, a 36-46% decrease in FLOPs, and a 1.65-fold reduction in on-device latency without sacrificing the scaling advantages or the expressive capacity of the original transformer architecture. Extensive benchmarking demonstrates that EdgeDiT offers a superior Pareto-optimal trade-off between Frechet Inception Distance (FID) and inference latency compared to both optimized mobile U-Nets and vanilla DiT variants. By enabling responsive, private, and offline generative AI directly on-device, EdgeDiT provides a scalable blueprint for transitioning large-scale foundation models from high-end GPUs to the palm of the user.

关键词: Diffusion Transformers, On-device AI, Model Compression, Inference Acceleration, Edge Devices, Hardware-aware Optimization, Generative AI, Mobile NPUs

70. ❌ From Simulation to Deep Learning: Survey on Network Performance Modeling Approaches

作者: Carlos Güemes-Palau, Miquel Ferriol-Galmés, Jordi Paillisse-Vilanova, Pere Barlet-Ros, Albert Cabellos-Aparicio 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28394v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是关于网络性能建模方法的综述，主要讨论传统模拟方法（如离散事件模拟）和机器学习模型在网络性能预测中的应用。虽然论文提到了机器学习模型，但所有关键词都专门针对大语言模型（LLMs）及其相关技术（如MoE、RLHF、RAG等），而论文内容完全不涉及大语言模型或深度学习在科学领域的应用，仅涉及一般的机器学习方法用于网络性能预测，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文综述了有线网络性能建模方法的历史演变，从传统的离散事件模拟和解析方法到机器学习模型的兴起，并提出了分类法来总结该领域的技术发展和研究趋势。

摘要翻译

网络性能建模是一个早于早期计算机网络和互联网起源的研究领域。其目标在于预测给定网络中分组流量的传输性能。该技术的应用范围广泛，涵盖从网络规划与故障排除，到为网络控制器提供信息以优化配置等多个方面。传统的网络性能建模在很大程度上依赖于离散事件仿真（Discrete Event Simulation, DES）以及基于排队论（Queuing Theory）和网络演算（Network Calculus）等数学理论的分析方法。然而，近期我们观察到了研究范式的转变，具体表现为对高效并行离散事件仿真的探索、机器学习模型的兴起，以及这些方法在混合建模路径中与其他技术的融合。这催生了种类繁多的建模方法，每种方法各有优势，且通常针对特定场景或需求而设计。本文全面综述了过去几十年中有线网络的相关网络性能建模方法。基于此理解，我们定义了一套方法分类体系，以总结我们对当前技术发展现状的认识，并揭示技术本身及研究界关注点随时间的演变轨迹。最后，我们还探讨了这些模型的评估方式，分析了其不同特性如何导致不同的评估需求与目标，以及这又如何使得模型间的比较变得复杂。

摘要 (Abstract)

Network performance modeling is a field that predates early computer networks and the beginning of the Internet. It aims to predict the traffic performance of packet flows in a given network. Its applications range from network planning and troubleshooting to feeding information to network controllers for configuration optimization. Traditional network performance modeling has relied heavily on Discrete Event Simulation (DES) and analytical methods grounded in mathematical theories such as Queuing Theory and Network Calculus. However, as of late, we have observed a paradigm shift, with attempts to obtain efficient Parallel DES, the surge of Machine Learning models, and their integration with other methodologies in hybrid approaches. This has resulted in a great variety of modeling approaches, each with its strengths and often tailored to specific scenarios or requirements. In this paper, we comprehensively survey the relevant network performance modeling approaches for wired networks over the last decades. With this understanding, we also define a taxonomy of approaches, summarizing our understanding of the state-of-the-art and how both technology and the concerns of the research community evolve over time. Finally, we also consider how these models are evaluated, how their different nature results in different evaluation requirements and goals, and how this may complicate their comparison.

关键词: network performance modeling, discrete event simulation, machine learning models, wired networks, survey, taxonomy, hybrid approaches, evaluation methods

71. ❌ The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

作者: Doan Nam Long Vu, Simone Balloccu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28387v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究临床视觉语言模型（VLMs）在神经影像诊断中的评估问题，发现提示框架（prompt framing）而非真实多模态推理驱动性能提升，这属于幻觉缓解和可解释AI的核心问题，并应用于生物信息学/科学AI领域。因此与’Hallucination Mitigation OR Factuality OR Truthfulness’、‘Mechanistic Interpretability OR Explainable AI’、‘AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文涉及VLMs（作为大模型/基础模型的一种）和小型/蒸馏模型，与’Large Language Models OR LLMs OR Foundation Models’、‘Small Language Models OR SLMs OR On-device AI’有一定关联（5分）。论文提到偏好对齐（preference alignment），与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（5分）。其他关键词如MoE、缩放定律、训练技术、推理加速、代理系统等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，在临床神经影像诊断评估中，视觉语言模型（VLMs）的性能提升主要源于提示中提及MRI可用性（称为'脚手架效应'）而非真正的多模态推理，揭示了表面评估的不足并强调了幻觉缓解和可解释性的重要性。

摘要翻译

可信赖的临床人工智能要求性能提升反映真实的证据整合，而非表层伪影。我们在两个临床神经影像队列——\textsc{FOR2107}（情感障碍）和\textsc{OASIS-3}（认知衰退）——的二元分类任务上评估了12个开放权重的视觉语言模型（Vision-Language Models, VLMs）。两个数据集均包含结构磁共振成像（MRI）数据，这些数据在个体层面不具备可靠的诊断信号。在此条件下，引入神经影像上下文后，较小规模的VLMs表现出高达58%的F1分数提升，且经过知识蒸馏的模型能够与规模大一个数量级的同类模型竞争。一项对比性置信度分析显示，仅在任务提示中提及MRI可用性即可解释70-80%的性能变化，而与影像数据是否实际存在无关——这是模态坍缩的一个领域特异性实例，我们称之为“支架效应”。专家评估发现，在所有实验条件下均存在基于神经影像的虚构论证；而偏好对齐虽能消除模型引用MRI的行为，却导致两种条件下的性能均坍缩至随机基线水平。我们的研究结果表明，表层评估不足以作为多模态推理能力的有效指标，这对VLMs在临床环境中的部署具有直接启示。

摘要 (Abstract)

Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to 58% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely \emph{mentioning} MRI availability in the task prompt accounts for 70-80% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the \emph{scaffold effect}. Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.

关键词: vision-language models, clinical AI, neuroimaging, scaffold effect, hallucination, multimodal reasoning, prompt framing, evaluation

72. ❌ COvolve: Adversarial Co-Evolution of Large-Language-Model-Generated Policies and Environments via Two-Player Zero-Sum Game

作者: Alkis Sygkounas, Rishi Hazra, Andreas Persson, Pedro Zuidberg Dos Martires, Amy Loutfi 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28386v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用LLMs生成环境和策略，通过对抗性协同进化实现持续学习，与’Large Language Models’和’LLM Agents’高度相关（10分），因为LLMs是核心工具，研究的是自主代理的持续改进。与’Self-Correction’有一定关联（5分），因为协同进化过程涉及策略适应和改进，但非直接的自校正机制。其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了COvolve框架，利用大型语言模型通过对抗性协同进化自动生成环境和策略，解决了静态训练环境限制持续学习和泛化的问题，实现了在驾驶、迷宫等任务中环境和策略的复杂度共同提升。

摘要翻译

构建持续改进智能体的一个核心挑战在于：训练环境通常是静态的或人工构建的。这限制了持续学习以及向训练分布之外的泛化能力。我们通过COvolve框架来解决这一问题，该框架利用大语言模型（LLMs）来生成环境与智能体策略，两者均以可执行的Python代码形式表达。我们将环境设计者与策略设计者之间的互动建模为一个双人零和博弈，确保对抗性协同进化——环境不断暴露策略的弱点，而策略则相应地进行适应。这一过程催生了一个自动化课程，其中环境与策略共同向更高复杂度进化。为了确保鲁棒性并防止在课程推进过程中出现遗忘，我们计算了该零和博弈的混合策略纳什均衡（MSNE），从而得到一个元策略。该MSNE元策略确保了智能体在学习解决前所未见的环境时，不会遗忘如何解决已见过的环境。在城市驾驶、符号迷宫求解和几何导航任务中的实验表明，COvolve能够生成逐渐复杂化的环境。我们的结果证明了LLM驱动的协同进化在实现开放式学习方面的潜力，而无需预定义的任务分布或人工干预。

摘要 (Abstract)

A central challenge in building continually improving agents is that training environments are typically static or manually constructed. This restricts continual learning and generalization beyond the training distribution. We address this with COvolve, a co-evolutionary framework that leverages large language models (LLMs) to generate both environments and agent policies, expressed as executable Python code. We model the interaction between environment and policy designers as a two-player zero-sum game, ensuring adversarial co-evolution in which environments expose policy weaknesses and policies adapt in response. This process induces an automated curriculum in which environments and policies co-evolve toward increasing complexity. To guarantee robustness and prevent forgetting as the curriculum progresses, we compute the mixed-strategy Nash equilibrium (MSNE) of the zero-sum game, thereby yielding a meta-policy. This MSNE meta-policy ensures that the agent does not forget to solve previously seen environments while learning to solve previously unseen ones. Experiments in urban driving, symbolic maze-solving, and geometric navigation showcase that COvolve produces progressively more complex environments. Our results demonstrate the potential of LLM-driven co-evolution to achieve open-ended learning without predefined task distributions or manual intervention.

关键词: co-evolution, large language models, LLM-generated policies, adversarial training, zero-sum game, continual learning, automated curriculum, Nash equilibrium

73. ❌ Critic-Free Deep Reinforcement Learning for Maritime Coverage Path Planning on Irregular Hexagonal Grids

作者: Carlos S. Sepúlveda, Gonzalo A. Ruz 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28385v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于使用深度强化学习（DRL）和Transformer架构解决海事覆盖路径规划（CPP）问题，属于AI在科学/工程领域的应用。所有关键词均与大模型（LLM）技术、训练方法、推理优化、对齐、代理系统等直接相关，而本文未涉及任何大模型技术，仅使用了标准的DRL和Transformer作为策略网络。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为海事监视（如搜索救援、环境监测）可视为AI在科学/工程领域的应用，但并非核心生物信息学或化学信息学，因此给5分（有一定关联）。其他关键词与论文内容完全无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于深度强化学习和Transformer指针策略的批评者自由方法，用于解决不规则六边形网格上的海事覆盖路径规划问题，在未见过的合成环境中实现了99%的哈密顿成功率，路径比最佳启发式方法短7%，转向次数少24%，并在笔记本电脑GPU上实现实时推理。

摘要翻译

海上监视任务（如搜救与环境监测）依赖于在广阔且几何结构复杂的区域内高效部署传感资产。传统的覆盖路径规划方法依赖于分解技术，这些技术难以处理不规则的海岸线、岛屿和禁航区，或需要对每个场景进行计算成本高昂的重新规划。我们提出了一种深度强化学习框架，用于在不规则海域的六边形网格表示上解决覆盖路径规划问题。与传统方法不同，我们将该问题构建为神经组合优化任务，其中基于Transformer的指针策略通过自回归方式构建覆盖路径。为克服长视野路径规划中价值估计不稳定的问题，我们实施了无评论者的组相对策略优化方案。该方法通过对采样轨迹进行实例内比较来估计优势，而非依赖价值函数。在1000个未见过的合成海洋环境上的实验表明，经过训练的策略实现了99.0%的哈密顿成功率，超过最佳启发式方法（46.0%）的两倍以上，同时生成的路径比最接近的基线缩短7%，航向变更减少24%。所有三种推理模式（贪婪策略、随机采样及结合2-opt优化的采样）在笔记本电脑GPU上均能在每实例50毫秒内完成运算，证实了其实时机载部署的可行性。

摘要 (Abstract)

Maritime surveillance missions, such as search and rescue and environmental monitoring, rely on the efficient allocation of sensing assets over vast and geometrically complex areas. Traditional Coverage Path Planning (CPP) approaches depend on decomposition techniques that struggle with irregular coastlines, islands, and exclusion zones, or require computationally expensive re-planning for every instance. We propose a Deep Reinforcement Learning (DRL) framework to solve CPP on hexagonal grid representations of irregular maritime areas. Unlike conventional methods, we formulate the problem as a neural combinatorial optimization task where a Transformer-based pointer policy autoregressively constructs coverage tours. To overcome the instability of value estimation in long-horizon routing problems, we implement a critic-free Group-Relative Policy Optimization (GRPO) scheme. This method estimates advantages through within-instance comparisons of sampled trajectories rather than relying on a value function. Experiments on 1,000 unseen synthetic maritime environments demonstrate that a trained policy achieves a 99.0% Hamiltonian success rate, more than double the best heuristic (46.0%), while producing paths 7% shorter and with 24% fewer heading changes than the closest baseline. All three inference modes (greedy, stochastic sampling, and sampling with 2-opt refinement) operate under 50~ms per instance on a laptop GPU, confirming feasibility for real-time on-board deployment.

关键词: Coverage Path Planning, Deep Reinforcement Learning, Transformer, Maritime Surveillance, Hexagonal Grids, Critic-Free, Group-Relative Policy Optimization, Neural Combinatorial Optimization

74. ❌ Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design

作者: Bin Zhu, Qianghuai Jia, Tian Lan, Junyang Ren, Feng Gu, Feihu Jiang, Longyue Wang, Zhao Xu, Weihua Luo 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28376v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于深度研究智能体（deep research agents）的设计与优化，核心创新在于验证中心框架（verification-centric framework）。高度相关的关键词包括：LLM Agents（论文核心主题）、Tool Use（智能体使用工具进行信息检索）、Chain of Thought和System 2 Thinking（涉及多步推理和深度推理）、Self-Correction（通过验证机制实现自我改进）、Hallucination Mitigation（验证机制旨在减少错误）以及Retrieval-Augmented Generation（智能体进行信息检索与生成）。论文提及与大型语言模型（LLMs）相关，但未深入探讨其他技术如MoE、量化、训练方法等，因此这些关键词得分为0。

!!! tip deepseek-chat TL;DR

论文提出Marco DeepResearch，一种基于验证中心框架设计的深度研究智能体，通过在数据合成、轨迹构建和测试时扩展三个层面引入验证机制，显著提升了智能体在复杂开放域研究任务中的性能，在多项基准测试中超越了同类模型。

摘要翻译

深度研究智能体能够自主开展开放式研究，通过整合复杂信息检索与跨多元来源的多步推理来解决现实世界问题。为在长周期任务中维持这种能力，可靠的验证机制在训练与推理阶段都至关重要。现有范式的主要瓶颈源于在问答数据合成、轨迹构建及测试时扩展中缺乏显式验证机制。各阶段引入的错误会向下游传播，从而降低智能体的整体性能。为此，我们提出Marco DeepResearch——一个以验证为核心框架设计优化的深度研究智能体，该框架包含三个层面：（1）问答数据合成：我们在基于图与基于智能体的问答合成中引入验证机制，以控制问题难度，同时确保答案的唯一性与正确性；（2）轨迹构建：我们设计了一种验证驱动的轨迹合成方法，将显式验证模式注入训练轨迹；（3）测试时扩展：我们在推理阶段使用Marco DeepResearch自身作为验证器，有效提升了在挑战性问题上的性能。大量实验结果表明，我们所提出的Marco DeepResearch智能体在多数高难度基准测试（如BrowseComp和BrowseComp-ZH）上显著优于8B规模的深度研究智能体。关键的是，在最多600次工具调用的预算限制下，Marco DeepResearch甚至超越或接近了若干30B规模的智能体，例如Tongyi DeepResearch-30B。

摘要 (Abstract)

Deep research agents autonomously conduct open-ended investigations, integrating complex information retrieval with multi-step reasoning across diverse sources to solve real-world problems. To sustain this capability on long-horizon tasks, reliable verification is critical during both training and inference. A major bottleneck in existing paradigms stems from the lack of explicit verification mechanisms in QA data synthesis, trajectory construction, and test-time scaling. Errors introduced at each stage propagate downstream and degrade the overall agent performance. To address this, we present Marco DeepResearch, a deep research agent optimized with a verification-centric framework design at three levels: \textbf{(1)~QA Data Synthesis:} We introduce verification mechanisms to graph-based and agent-based QA synthesis to control question difficulty while ensuring answers are unique and correct; \textbf{(2)~Trajectory Construction:} We design a verification-driven trajectory synthesis method that injects explicit verification patterns into training trajectories; and \textbf{(3)~Test-time scaling:} We use Marco DeepResearch itself as a verifier at inference time and effectively improve performance on challenging questions. Extensive experimental results demonstrate that our proposed Marco DeepResearch agent significantly outperforms 8B-scale deep research agents on most challenging benchmarks, such as BrowseComp and BrowseComp-ZH. Crucially, under a maximum budget of 600 tool calls, Marco DeepResearch even surpasses or approaches several 30B-scale agents, like Tongyi DeepResearch-30B.

关键词: deep research agents, verification-centric framework, multi-step reasoning, information retrieval, autonomous investigation, tool calls, QA data synthesis, trajectory construction

75. ❌ Deep Research of Deep Research: From Transformer to Agent, From AI to AI for Science

作者: Yipeng Yu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28361v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文高度聚焦于大语言模型（LLMs）在科学领域的应用（AI for Science）以及LLM智能体（Agents）的发展，因此这三个关键词获得高分（10分）。论文提到LLMs从文本交互发展到工具使用，因此’Tool Use’获得中等分数（5分）。其他关键词如MoE、量化、推理加速、对齐等具体技术细节在摘要中未提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

本文通过定义'深度研究'并统一工业界深度研究与学术界AI for Science的视角，探讨了大语言模型从Transformer到智能体的发展路线图及其在跨学科科学创新中的应用、挑战和未来方向。

摘要翻译

随着大语言模型（LLM）知识库与推理能力的进步，其交互模态已从纯文本演进至多模态，并进一步发展为智能体工具调用。相应地，其应用范围也从问答系统扩展到AI助手，如今正迈向通用智能体领域。深度研究（DR）是通用智能体的一个典型垂直应用，它代表了智能信息处理以及辅助人类发现问题、解决问题的理想路径，其目标是达到甚至超越顶尖人类科学家的水平。本文对深度研究本身进行了一次深度探究。我们阐述了深度研究的清晰且精确的定义，并在一个发展框架内统一了产业界的深度研究与学术界的“人工智能驱动科学”（AI4S）的视角。我们将大语言模型与稳定扩散模型定位为生成式人工智能的两大支柱，并勾勒出从Transformer到智能体的演进路线图。我们审视了AI4S在不同学科领域的进展，归纳了当前主流的人机交互范式与系统架构，并讨论了存在的主要挑战与基础研究问题。人工智能支持科学创新，科学亦能促进人工智能发展（科学促进人工智能，S4AI）。我们希望本文有助于弥合人工智能社区与AI4S社区之间的隔阂。

摘要 (Abstract)

With the advancement of large language models (LLMs) in their knowledge base and reasoning capabilities, their interactive modalities have evolved from pure text to multimodality and further to agentic tool use. Consequently, their applications have broadened from question answering to AI assistants and now to general-purpose agents. Deep research (DR) represents a prototypical vertical application for general-purpose agents, which represents an ideal approach for intelligent information processing and assisting humans in discovering and solving problems, with the goal of reaching or even surpassing the level of top human scientists. This paper provides a deep research of deep research. We articulate a clear and precise definition of deep research and unify perspectives from industry’s deep research and academia’s AI for Science (AI4S) within a developmental framework. We position LLMs and Stable Diffusion as the twin pillars of generative AI, and lay out a roadmap evolving from the Transformer to agents. We examine the progress of AI4S across various disciplines. We identify the predominant paradigms of human-AI interaction and prevailing system architectures, and discuss the major challenges and fundamental research issues that remain. AI supports scientific innovation, and science also can contribute to AI growth (Science for AI, S4AI). We hope this paper can help bridge the gap between the AI and AI4S communities.

关键词: Large Language Models, AI for Science, Agents, Deep Research, Transformer, Generative AI, Scientific Innovation, Human-AI Interaction

76. ❌ CoE: Collaborative Entropy for Uncertainty Quantification in Agentic Multi-LLM Systems

作者: Kangkang Sun, Jun Wu, Jianhua Li, Minyi Guo, Xiuzhen Che, Jianwei Huang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28360v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多LLM系统中的不确定性量化，直接涉及’Large Language Models OR LLMs OR Foundation Models’（使用LLaMA-3.1-8B-Instruct等模型）、‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（研究多LLM协作系统）和’Multi-agent Systems OR Agent Coordination’（关注模型间协调与分歧）。其他关键词如MoE、SLMs、训练方法、推理加速、科学AI应用等均未在摘要中提及，与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对多LLM系统中现有不确定性估计方法未能充分捕捉模型间语义分歧的问题，提出了一个统一的信息论度量指标Collaborative Entropy（CoE），实验表明CoE比基于熵和散度的基线方法提供了更强的不确定性估计。

摘要翻译

多LLM系统中的不确定性估计在很大程度上仍以单一模型为中心：现有方法量化每个模型内部的不确定性，但未能充分捕捉模型间的语义分歧。为弥补这一不足，我们提出协同熵（Collaborative Entropy，简称CoE）——一种用于多LLM协作中语义不确定性的统一信息论度量。CoE定义在共享语义聚类空间上，融合了两个组成部分：模型内语义熵与模型间相对于集成均值的分歧度。CoE并非加权集成预测器，而是表征协作置信度与分歧的系统级不确定性度量。我们分析了CoE的若干核心性质，包括非负性、完美语义共识下的零值确定性，以及当单个模型坍缩为狄拉克分布时CoE的行为特征。这些结果阐明了何时降低单模型不确定性即已足够，而何时仍会存在残余的模型间分歧。我们还提出一种简单的、基于CoE指导的无训练事后协调启发式方法，作为该度量的实际应用。在\textit{TriviaQA}和\textit{SQuAD}数据集上使用LLaMA-3.1-8B-Instruct、Qwen-2.5-7B-Instruct和Mistral-7B-Instruct进行的实验表明，相较于基于标准熵与分歧度的基线方法，CoE能提供更强健的不确定性估计，且随着引入更多异构模型，其优势进一步扩大。总体而言，CoE为多LLM协作提供了一个有价值的、具备不确定性感知能力的研究视角。

摘要 (Abstract)

Uncertainty estimation in multi-LLM systems remains largely single-model-centric: existing methods quantify uncertainty within each model but do not adequately capture semantic disagreement across models. To address this gap, we propose Collaborative Entropy (CoE), a unified information-theoretic metric for semantic uncertainty in multi-LLM collaboration. CoE is defined on a shared semantic cluster space and combines two components: intra-model semantic entropy and inter-model divergence to the ensemble mean. CoE is not a weighted ensemble predictor; it is a system-level uncertainty measure that characterizes collaborative confidence and disagreement. We analyze several core properties of CoE, including non-negativity, zero-value certainty under perfect semantic consensus, and the behavior of CoE when individual models collapse to delta distributions. These results clarify when reducing per-model uncertainty is sufficient and when residual inter-model disagreement remains. We also present a simple CoE-guided, training-free post-hoc coordination heuristic as a practical application of the metric. Experiments on \textit{TriviaQA} and \textit{SQuAD} with LLaMA-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, and Mistral-7B-Instruct show that CoE provides stronger uncertainty estimation than standard entropy- and divergence-based baselines, with gains becoming larger as additional heterogeneous models are introduced. Overall, CoE offers a useful uncertainty-aware perspective on multi-LLM collaboration.

关键词: Uncertainty Quantification, Multi-LLM Systems, Collaborative Entropy, Semantic Uncertainty, Agentic Collaboration, Model Disagreement, Information-theoretic Metric, Post-hoc Coordination

77. ❌ Crossing the NL/PL Divide: Information Flow Analysis Across the NL/PL Boundary in LLM-Integrated Code

作者: Zihao Xu, Xiao Cheng, Ruijie Meng, Yuekang Li 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28345v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM API调用在程序分析中的信息流问题，与’Large Language Models’高度相关（10分），因为这是研究的核心对象；与’Tool Use OR Function Calling OR API Tool Use’高度相关（10分），因为论文直接研究LLM API作为工具在代码中的使用和信息流分析。其他关键词如MoE、SLMs、训练方法、推理优化、代理系统等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了LLM API调用在程序中创建的自然语言/编程语言边界导致传统程序分析失效的问题，提出了首个基于定量信息流理论的信息流分析方法，并通过分类学和下游应用验证了其有效性。

摘要翻译

大语言模型（LLM）API调用正成为一种普遍存在的程序构造，然而它们也创建了一个现有程序分析均无法跨越的边界：运行时值进入自然语言提示，在LLM内部经过不透明的处理，然后重新以代码、SQL、JSON或文本的形式出现，供程序消费。所有跨函数边界追踪数据的分析——包括污点分析、程序切片、依赖分析和变更影响分析——都依赖于对被调用者行为的数据流摘要。LLM调用缺乏此类摘要，导致所有这些分析在我们所称的自然语言/编程语言（NL/PL）边界处失效。
我们提出了首个跨越此边界的信息流方法。该方法基于定量信息流理论，我们的分类法沿着两个正交维度定义了24种标签：信息保留级别（从词汇保留到完全阻断）和输出模态（自然语言、结构化格式、可执行工件）。我们对来自4,154个真实世界Python文件的9,083个占位符-输出对进行了标注，并通过科恩κ系数$κ= 0.82$和近乎完全的覆盖率（0.01%无法分类）验证了其可靠性。我们通过两个下游应用展示了该分类法的实用性：（1）一个结合了基于分类法的过滤与LLM验证的两阶段污点传播流程，在353个专家标注对上达到了$F_1 = 0.923$的分数，并在六个真实世界的OpenClaw提示注入案例上进行的跨语言验证进一步确认了其有效性；（2）基于分类法的后向切片在包含非传播占位符的文件中，平均将切片大小减少了15%。按标签分析显示，四个阻断标签几乎涵盖了所有非传播情况，为工具构建者提供了可操作的过滤标准。

摘要 (Abstract)

LLM API calls are becoming a ubiquitous program construct, yet they create a boundary that no existing program analysis can cross: runtime values enter a natural-language prompt, undergo opaque processing inside the LLM, and re-emerge as code, SQL, JSON, or text that the program consumes. Every analysis that tracks data across function boundaries, including taint analysis, program slicing, dependency analysis, and change-impact analysis, relies on dataflow summaries of callee behavior. LLM calls have no such summaries, breaking all of these analyses at what we call the NL/PL boundary. We present the first information flow method to bridge this boundary. Grounded in quantitative information flow theory, our taxonomy defines 24 labels along two orthogonal dimensions: information preservation level (from lexically preserved to fully blocked) and output modality (natural language, structured format, executable artifact). We label 9,083 placeholder-output pairs from 4,154 real-world Python files and validate reliability with Cohen’s $κ= 0.82$ and near-complete coverage (0.01% unclassifiable). We demonstrate the taxonomy’s utility on two downstream applications: (1)~a two-stage taint propagation pipeline combining taxonomy-based filtering with LLM verification achieves $F_1 = 0.923$ on 353 expert-annotated pairs, with cross-language validation on six real-world OpenClaw prompt injection cases further confirming effectiveness; (2)~taxonomy-informed backward slicing reduces slice size by a mean of 15% in files containing non-propagating placeholders. Per-label analysis reveals that four blocked labels account for nearly all non-propagating cases, providing actionable filtering criteria for tool builders.

关键词: LLM API calls, information flow analysis, NL/PL boundary, taint analysis, program slicing, quantitative information flow, prompt injection, dataflow summaries

78. ❌ A Multi-Agent Rhizomatic Pipeline for Non-Linear Literature Analysis

作者: Julio C. Serrano. Joonas Kevari, Rumy Narayan 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28336v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文明确提到使用大语言模型（LLM）协调，因此’Large Language Models’得10分。论文核心是多智能体系统（12个专门智能体），因此’LLM Agents’和’Multi-agent Systems’各得10分。论文应用于社会科学文献分析，属于AI在科学领域的应用，因此’AI for Science’得5分。其他关键词在摘要中未提及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于德勒兹根茎理论的多智能体计算管道（Rhizomatic Research Agent V3），用于非线性文献分析，通过整合大语言模型协调、双源语料库摄入和语义地形分析，能够发现传统综述方法忽略的跨学科趋同和研究空白。

摘要翻译

社会科学领域的系统性文献综述普遍遵循树状逻辑——包括层级式关键词筛选、线性文献筛查和分类学归类——这种模式压制了复杂研究图景所特有的横向关联、断裂与涌现模式。本研究报告提出第三代根茎研究智能体，这是一个基于德勒兹过程关系本体论的多智能体计算流程，旨在通过一个七阶段架构中运作的12个专用智能体，执行非线性文献分析。该系统的开发是对(Narayan2023)所奠定方法论基础的响应，该研究在其可持续能源转型的博士课题中运用了根茎式探究方法，但依赖于研究者驱动的手动探索。根茎研究智能体将根茎的六项原则——连接、异质性、多元性、非意指断裂、制图法与拓印法——转化为自动化流程，整合了大型语言模型（LLM）编排、OpenAlex与arXiv双源语料库摄入、SciBERT语义地形分析以及动态断裂检测协议。初步部署表明，该系统能够揭示传统综述方法系统性忽视的跨学科交汇点与结构性研究空白。该流程为开源系统，可扩展至任何需要非线性知识图谱构建的现象领域。

摘要 (Abstract)

Systematic literature reviews in the social sciences overwhelmingly follow arborescent logics – hierarchical keyword filtering, linear screening, and taxonomic classification – that suppress the lateral connections, ruptures, and emergent patterns characteristic of complex research landscapes. This research note presents the Rhizomatic Research Agent (V3), a multi-agent computational pipeline grounded in Deleuzian process-relational ontology, designed to conduct non-linear literature analysis through 12 specialized agents operating across a seven-phase architecture. The system was developed in response to the methodological groundwork established by (Narayan2023), who employed rhizomatic inquiry in her doctoral research on sustainable energy transitions but relied on manual, researcher-driven exploration. The Rhizomatic Research Agent operationalizes the six principles of the rhizome – connection, heterogeneity, multiplicity, asignifying rupture, cartography, and decalcomania – into an automated pipeline integrating large language model (LLM) orchestration, dual-source corpus ingestion from OpenAlex and arXiv, SciBERT semantic topography, and dynamic rupture detection protocols. Preliminary deployment demonstrates the system’s capacity to surface cross-disciplinary convergences and structural research gaps that conventional review methods systematically overlook. The pipeline is open-source and extensible to any phenomenon zone where non-linear knowledge mapping is required.

关键词: multi-agent system, large language model, non-linear literature analysis, rhizomatic research, computational pipeline, semantic topography, cross-disciplinary convergence, research gap detection

79. ❌ Integrating Multimodal Large Language Model Knowledge into Amodal Completion

作者: Heecheol Yun, Eunho Yang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28333v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是提出AmodalCG框架，利用多模态大语言模型（MLLMs）的物理知识和推理能力来指导图像中的amodal completion（遮挡部分重建）。因此，与’Large Language Models’高度相关（10分），因为MLLMs是LLMs的多模态扩展，是框架的核心组件。与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分），因为论文提到MLLMs被用来推理缺失区域的范围和内容，这涉及多步或深度推理过程。其他关键词如MoE、SLMs、Scaling Laws、训练技术、推理优化、代理、量化等，论文未涉及，故评0分。论文属于计算机视觉与多模态AI交叉应用，而非生物信息学等特定科学领域，因此’AI for Science’评0分。

!!! tip deepseek-chat TL;DR

该论文提出AmodalCG框架，通过利用多模态大语言模型的真实世界知识来推理和指导图像中被遮挡物体部分的完成，实验表明该方法相比现有工作取得了显著改进。

摘要翻译

随着自动驾驶与机器人技术的广泛应用，对图像中人物及物体被遮挡部分进行重建的非模态补全任务变得日益关键。正如人类依据先验经验与常识推断隐藏区域，该任务本质上需要关于现实世界实体的物理知识。然而，现有方法要么仅依赖视觉生成模型的图像生成能力（此类模型缺乏此类知识），要么仅在分割阶段利用物理知识，导致其无法显式指导补全过程。为此，我们提出AmodalCG这一新颖框架，该框架利用多模态大语言模型（Multimodal Large Language Models, MLLMs）的现实世界知识来指导非模态补全。我们的框架首先评估遮挡程度，仅在目标物体被严重遮挡时选择性调用MLLM指导。若需指导，框架进一步引入MLLM来推理缺失区域的（1）范围与（2）内容。最后，视觉生成模型整合这些指导信息，并对可能由MLLM指导不准确产生的不完美补全结果进行迭代优化。在多种真实世界图像上的实验结果表明，相较于现有所有方法，本框架取得了显著提升，这提示MLLMs为解决具有挑战性的非模态补全问题提供了一个有前景的方向。

摘要 (Abstract)

With the widespread adoption of autonomous vehicles and robotics, amodal completion, which reconstructs the occluded parts of people and objects in an image, has become increasingly crucial. Just as humans infer hidden regions based on prior experience and common sense, this task inherently requires physical knowledge about real-world entities. However, existing approaches either depend solely on the image generation ability of visual generative models, which lack such knowledge, or leverage it only during the segmentation stage, preventing it from explicitly guiding the completion process. To address this, we propose AmodalCG, a novel framework that harnesses the real-world knowledge of Multimodal Large Language Models (MLLMs) to guide amodal completion. Our framework first assesses the extent of occlusion to selectively invoke MLLM guidance only when the target object is heavily occluded. If guidance is required, the framework further incorporates MLLMs to reason about both the (1) extent and (2) content of the missing regions. Finally, a visual generative model integrates these guidance and iteratively refines imperfect completions that may arise from inaccurate MLLM guidance. Experimental results on various real-world images show impressive improvements compared to all existing works, suggesting MLLMs as a promising direction for addressing challenging amodal completion.

关键词: Amodal Completion, Multimodal Large Language Models, MLLMs, Visual Generative Models, Occlusion Reasoning, Autonomous Vehicles, Robotics, Real-world Knowledge

80. ❌ Building evidence-based knowledge graphs from full-text literature for disease-specific biomedical reasoning

作者: Chang Zong, Sicheng Lv, Si-tu Xue, Huilin Zheng, Jian Wan, Lei Zhang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28325v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是使用LLM构建生物医学知识图谱，因此与’Large Language Models’高度相关（10分）。研究属于生物信息学应用，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文明确提到’retrieval-augmented question answering’，与’Retrieval-Augmented Generation’高度相关（10分）。其他关键词如MoE、SFT、RLHF、量化等涉及模型架构、训练方法或优化技术，论文未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了EvidenceNet框架，利用大语言模型从生物医学文献中提取结构化证据，构建疾病特异性知识图谱，以支持基于证据的生物医学推理和假设生成。

摘要翻译

生物医学知识资源通常以非结构化文本形式保存证据，或将其压缩为扁平化三元组，从而忽略研究设计、来源和定量支持信息。本文提出EvidenceNet，这是一个从全文生物医学文献中构建疾病特异性知识图谱的框架与数据集。EvidenceNet采用大语言模型（LLM）辅助的流程，将实验性发现提取为结构化的证据节点，对生物医学实体进行标准化，评估证据质量，并通过类型化语义关系连接证据记录。我们发布两项资源：EvidenceNet-HCC（包含7,872条证据记录、10,328个图谱节点和49,756条边）和EvidenceNet-CRC（包含6,622条记录、8,795个节点和39,361条边）。技术验证显示其各组件具有高保真度，包括98.3%的字段级提取准确率、100.0%的高置信度实体链接准确率、87.5%的融合完整性以及90.0%的语义关系类型准确率。在下游评估中，EvidenceNet提升了内部与外部检索增强问答的效果，并保留了可用于未来链接预测和靶点优先级排序的结构化信号。这些结果表明，EvidenceNet可作为支持证据感知的生物医学推理与假设生成的疾病特异性资源。

摘要 (Abstract)

Biomedical knowledge resources often either preserve evidence as unstructured text or compress it into flat triples that omit study design, provenance, and quantitative support. Here we present EvidenceNet, a framework and dataset for building disease-specific knowledge graphs from full-text biomedical literature. EvidenceNet uses a large language model (LLM)-assisted pipeline to extract experimentally grounded findings as structured evidence nodes, normalize biomedical entities, score evidence quality, and connect evidence records through typed semantic relations. We release two resources: EvidenceNet-HCC with 7,872 evidence records, 10,328 graph nodes, and 49,756 edges, and EvidenceNet-CRC with 6,622 records, 8,795 nodes, and 39,361 edges. Technical validation shows high component fidelity, including 98.3% field-level extraction accuracy, 100.0% high-confidence entity-link accuracy, 87.5% fusion integrity, and 90.0% semantic relation-type accuracy. In downstream evaluation, EvidenceNet improves internal and external retrieval-augmented question answering and retains structural signal for future link prediction and target prioritization. These results establish EvidenceNet as a disease-specific resource for evidence-aware biomedical reasoning and hypothesis generation.

关键词: knowledge graphs, biomedical literature, large language model, evidence extraction, disease-specific, retrieval-augmented question answering, biomedical reasoning

81. ❌ Mapping data literacy trajectories in K-12 education

作者: Robert Whyte, Manni Cheung, Katharine Childs, Jane Waite, Sue Sentance 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28317v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究K-12教育中的数据素养培养路径，属于教育学和计算机科学教育领域，主要关注数据驱动系统与传统编程的范式差异、学习活动分类框架和学习轨迹可视化。论文内容完全不涉及大模型、深度学习技术原理或AI在科学领域的应用，所有评分关键词均与大模型技术、训练方法、推理优化、AI应用等主题相关，而本文专注于基础教育中的数据素养教学研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文通过系统文献综述提出了数据范式框架，用于分类K-12教育中的数据素养学习活动，并可视化学习者跨越不同数据范式的学习轨迹，为设计数据素养学习环境提供参考。

摘要翻译

数据素养技能是计算机科学教育的基石。然而，理解数据驱动系统如何运作，代表着从传统基于规则的编程范式的一次根本性转变。我们对84项研究进行了系统性文献综述，以探究K-12阶段学习者在跨学科与多元情境下参与数据实践的状况。我们提出了数据范式框架，该框架沿两个维度对学习活动进行分类：(i) 逻辑（基于知识的系统或数据驱动系统），以及(ii) 可解释性（透明模型或黑箱模型）。我们进一步应用学习轨迹的概念，以可视化学习者在这些不同范式间所遵循的路径。我们详细阐述了四种不同的轨迹，以此激发研究人员和教育者反思数据素养的概念如何随学习情境的变化而演变。我们认为这些轨迹可为关注计算机科学教育内外数据素养学习环境设计的相关人士提供有益参考。

摘要 (Abstract)

Data literacy skills are fundamental in computer science education. However, understanding how data-driven systems work represents a paradigm shift from traditional rule-based programming. We conducted a systematic literature review of 84 studies to understand K-12 learners’ engagement with data across disciplines and contexts. We propose the data paradigms framework that categorises learning activities along two dimensions: (i) logic (knowledge-based or data-driven systems), and (ii) explainability (transparent or opaque models). We further apply the notion of learning trajectories to visualize the pathways learners follow across these distinct paradigms. We detail four distinct trajectories as a provocation for researchers and educators to reflect on how the notion of data literacy varies depending on the learning context. We suggest these trajectories could be useful to those concerned with the design of data literacy learning environments within and beyond CS education.

关键词: data literacy, K-12 education, data-driven systems, learning trajectories, systematic literature review, computer science education, data paradigms framework, explainability

82. ❌ Self++: Co-Determined Agency for Human–AI Symbiosis in Extended Reality

作者: Thammathip Piumsomboon 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28306v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出一个名为Self++的XR-AI系统设计蓝图，专注于人机协同、自主性、透明度和适应性等交互设计原则，但摘要中完全没有提及任何具体的大模型技术（如LLM、MoE、训练方法、推理优化等）或科学AI应用，所有关键词均与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了Self++设计框架，通过结合自我决定理论和自由能原理来解决XR环境中人机协同的自主性、透明度和适应性问题，建立了基于角色的交互模式来增强能力而不替代人类判断。

摘要翻译

Self++ 是一种扩展现实（XR）中人机共生（human-AI symbiosis）的设计蓝图，它在保留人类作者身份的同时，仍能受益于能力日益增强的智能体（AI agents）。由于 XR 既能塑造感知证据也能塑造行动，表面上“有益”的辅助可能逐渐演变为过度依赖、隐性说服和责任模糊。Self++ 将交互建立在两种互补的理论基础上：自我决定理论（Self-Determination Theory，涉及自主性、胜任力和关联性）和自由能原理（Free Energy Principle，涉及不确定性下的预测稳定性）。它通过共同决定（co-determination）来操作化这些基础，将人与 AI 视为一个耦合系统，该系统必须保持意图和边界的清晰可辨、随时间调整支持力度，并保留用户认可、质疑和否决的权利。这些要求被概括为共同决定三原则（T.A.N.）：透明性（Transparency）、适应性（Adaptivity）和可协商性（Negotiability）。Self++ 将增强功能组织为三个可同时激活的叠加层，分别涵盖感觉运动胜任力支持（Self：胜任力叠加层）、审慎自主性支持（Self+：自主性叠加层）以及社交与长远关联性与目标支持（Self++：关联性与目标叠加层）。在所有叠加层中，它具体规定了九种角色模式（导师、技能构建者、教练；选择架构师、顾问、代理工作者；情境解释者、社交促进者、目标放大器），这些模式可作为交互模式而非固定角色来实现。本研究的贡献在于提供了一份基于角色的设计评估地图，用于设计和评估 XR-AI 系统，使其能在不替代人类判断的前提下提升能力，从而在工作、学习和社交生活中实现共生能动性，并促进人类韧性的发展。

摘要 (Abstract)

Self++ is a design blueprint for human-AI symbiosis in extended reality (XR) that preserves human authorship while still benefiting from increasingly capable AI agents. Because XR can shape both perceptual evidence and action, apparently ‘helpful’ assistance can drift into over-reliance, covert persuasion, and blurred responsibility. Self++ grounds interaction in two complementary theories: Self-Determination Theory (autonomy, competence, relatedness) and the Free Energy Principle (predictive stability under uncertainty). It operationalises these foundations through co-determination, treating the human and the AI as a coupled system that must keep intent and limits legible, tune support over time, and preserve the user’s right to endorse, contest, and override. These requirements are summarised as the co-determination principles (T.A.N.): Transparency, Adaptivity, and Negotiability. Self++ organises augmentation into three concurrently activatable overlays spanning sensorimotor competence support (Self: competence overlay), deliberative autonomy support (Self+: autonomy overlay), and social and long-horizon relatedness and purpose support (Self++: relatedness and purpose overlay). Across the overlays, it specifies nine role patterns (Tutor, Skill Builder, Coach; Choice Architect, Advisor, Agentic Worker; Contextual Interpreter, Social Facilitator, Purpose Amplifier) that can be implemented as interaction patterns, not personas. The contribution is a role-based map for designing and evaluating XR-AI systems that grow capability without replacing judgment, enabling symbiotic agency in work, learning, and social life and resilient human development.

关键词: human-AI symbiosis, extended reality, self-determination theory, free energy principle, co-determination, transparency adaptivity negotiability, role patterns, symbiotic agency

83. ❌ NeiGAD: Augmenting Graph Anomaly Detection via Spectral Neighbor Information

作者: Qing Qing, Huafei Huang, Mingliang Hou, Renqiang Luo, Mohsen Guizani 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28300v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文NeiGAD专注于图异常检测（GAD），提出了一种基于谱图分析的新模块来增强邻居信息建模。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是图神经网络（GNNs）在图异常检测中的特定应用，未涉及LLMs、MoE、缩放定律、训练技术、推理优化、智能体、量化等大模型相关主题，也未涉及生物信息学或化学信息学等AI for Science子领域。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对图异常检测中邻居信息建模不足的问题，提出了一种基于谱图分析的插件模块NeiGAD，通过选择紧凑的特征向量构建判别性表示，在多个真实数据集上显著提升了检测精度并超越了现有方法。

摘要翻译

图异常检测旨在识别属性图中的异常节点或结构。邻居信息既反映结构连通性，也体现与周围节点的属性一致性，对于区分异常与正常模式至关重要。尽管近期基于图神经网络的方法通过消息传递机制整合了此类信息，但往往未能显式建模其作用或与属性的交互，从而限制了检测性能。本研究提出NeiGAD——一种新颖的即插即用模块，通过谱图分析捕捉邻居信息。理论分析表明，邻接矩阵的特征向量编码了局部邻居交互信息，并能够逐级放大异常信号。基于此，NeiGAD选取紧凑的特征向量集合以构建高效且判别性强的表征。在八个真实数据集上的实验表明，NeiGAD能持续提升检测精度，并优于当前最先进的图异常检测方法。这些结果验证了显式邻居建模的重要性以及谱分析在异常检测中的有效性。代码发布于：https://github.com/huafeihuang/NeiGAD。

摘要 (Abstract)

Graph anomaly detection (GAD) aims to identify irregular nodes or structures in attributed graphs. Neighbor information, which reflects both structural connectivity and attribute consistency with surrounding nodes, is essential for distinguishing anomalies from normal patterns. Although recent graph neural network (GNN)-based methods incorporate such information through message passing, they often fail to explicitly model its effect or interaction with attributes, limiting detection performance. This work introduces NeiGAD, a novel plug-and-play module that captures neighbor information through spectral graph analysis. Theoretical insights demonstrate that eigenvectors of the adjacency matrix encode local neighbor interactions and progressively amplify anomaly signals. Based on this, NeiGAD selects a compact set of eigenvectors to construct efficient and discriminative representations. Experiments on eight real-world datasets show that NeiGAD consistently improves detection accuracy and outperforms state-of-the-art GAD methods. These results demonstrate the importance of explicit neighbor modeling and the effectiveness of spectral analysis in anomaly detection. Code is available at: https://github.com/huafeihuang/NeiGAD.

关键词: Graph Anomaly Detection, Spectral Graph Analysis, Neighbor Information, Graph Neural Networks, Eigenvectors, Plug-and-play Module, Attributed Graphs, Anomaly Signals

84. ❌ Evaluating LLMs for Answering Student Questions in Introductory Programming Courses

作者: Thomas Van Mullem, Bart Mesuere, Peter Dawyndt 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28295v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在编程教育中的应用评估，与’Large Language Models’高度相关（10分），并提及幻觉风险缓解（5分），但未涉及其他具体技术关键词如MoE、SFT、RAG等，也未涉及科学领域AI应用。

!!! tip deepseek-chat TL;DR

该研究评估了大型语言模型在回答CS1编程课程学生问题时的能力，发现某些模型（如Gemini 3 flash）能超越典型教育者回答质量，并提出了教师参与循环的实施方案。

摘要翻译

大型语言模型（LLM）的迅速兴起为编程教育带来了机遇与挑战。尽管学生越来越多地使用生成式人工智能工具，但直接获取答案往往提供的是完整解决方案而非教学提示，这反而会阻碍学习过程。与此同时，教育者在提供及时、个性化反馈方面面临着巨大的工作量和可扩展性挑战。本研究探讨了LLM在CS1编程课程中安全有效地协助教育者回答学生问题的能力。为此，我们通过整理来自学习管理系统的170个真实学生问题构建了一个基准数据集，并配以学科专家撰写的标准答案，从而建立了一个严格、可复现的评估流程。由于传统的文本匹配指标不足以评估开放式教育应答，我们开发并验证了一种定制的“以LLM作为评判者”的指标，该指标针对教学准确性评估进行了优化。我们的研究结果表明，诸如Gemini 3 Flash等模型能够超越典型教育者回答的质量基线，与专家教学标准实现高度契合。为缓解幻觉等持续存在的风险并确保与课程特定语境的一致性，我们主张采用“教师在环”的实施方式。最后，我们将该方法抽象为一个与任务无关的评估框架，主张将教育类LLM工具的研发从临时性的部署后测试，转向可量化的部署前验证流程。

摘要 (Abstract)

The rapid emergence of Large Language Models (LLMs) presents both opportunities and challenges for programming education. While students increasingly use generative AI tools, direct access often hinders the learning process by providing complete solutions rather than pedagogical hints. Concurrently, educators face significant workload and scalability challenges when providing timely, personalized feedback. This study investigates the capabilities of LLMs to safely and effectively assist educators in answering student questions within a CS1 programming course. To achieve this, we established a rigorous, reproducible evaluation process by curating a benchmark dataset of 170 authentic student questions from a learning management system, paired with ground-truth responses authored by subject matter experts. Because traditional text-matching metrics are insufficient for evaluating open-ended educational responses, we developed and validated a custom LLM-as-a-Judge metric optimized for assessing pedagogical accuracy. Our findings demonstrate that models, such as Gemini 3 flash, can surpass the quality baseline of typical educator responses, achieving high alignment with expert pedagogical standards. To mitigate persistent risks like hallucination and ensure alignment with course-specific context, we advocate for a “teacher-in-the-loop” implementation. Finally, we abstract our methodology into a task-agnostic evaluation framework, advocating for a shift in the development of educational LLM tools from ad-hoc, post-deployment testing to a quantifiable, pre-deployment validation process.

关键词: Large Language Models, programming education, student questions, evaluation framework, hallucination mitigation, teacher-in-the-loop, pedagogical accuracy, CS1 course

85. ❌ FI-KAN: Fractal Interpolation Kolmogorov-Arnold Networks

作者: Gnankan Landry Regis N’guessan 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28288v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文FI-KAN专注于神经网络架构创新，提出了一种基于分形插值理论的Kolmogorov-Arnold Networks变体，用于非光滑函数逼近。其核心贡献在于基础数学和神经网络设计，与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理等）完全无关。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文在科学计算领域（如PDE求解）展示了应用，但并非其核心焦点，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对Kolmogorov-Arnold Networks (KAN)在逼近非光滑函数时缺乏多尺度分解能力的问题，提出了融合可学习分形插值函数基的FI-KAN，在Hölder正则性基准、分形目标和非光滑PDE解上显著优于原始KAN，最高提升达79倍。

摘要翻译

Kolmogorov-Arnold网络（KAN）在固定网格上采用B样条基，缺乏对非光滑函数逼近的内在多尺度分解能力。本文引入分形插值KAN（FI-KAN），将来自迭代函数系统（IFS）理论的可学习分形插值函数（FIF）基融入KAN框架。我们提出两种变体：纯FI-KAN（Barnsley, 1986）完全用FIF基替换B样条；混合FI-KAN（Navascues, 2005）保留B样条路径并添加可学习的分形校正项。IFS的收缩参数为每条边赋予可微分的分形维数，使其在训练过程中自适应目标函数的正则性。在Hölder正则性基准测试（$α\in [0.2, 2.0]$）中，混合FI-KAN在所有正则性水平上均优于KAN（提升1.3倍至33倍）。对于分形目标函数，FI-KAN相比KAN实现了高达6.3倍的均方误差降低，并在5 dB信噪比下保持4.7倍优势。在非光滑偏微分方程解（scikit-fem）测试中，混合FI-KAN在粗糙系数扩散问题上取得最高79倍的改进，在L形域角点奇异性问题上达到3.5倍提升。纯FI-KAN表现出互补特性：在粗糙目标上占优而在光滑目标上欠佳，这为“基函数几何结构必须匹配目标正则性”提供了受控证据。分形维数正则化器提供了可解释的复杂度控制机制，其学习值能准确恢复各目标函数的真实分形维数。这些结果确立了正则性匹配的基函数设计可作为神经函数逼近的一种原则性策略。

摘要 (Abstract)

Kolmogorov-Arnold Networks (KAN) employ B-spline bases on a fixed grid, providing no intrinsic multi-scale decomposition for non-smooth function approximation. We introduce Fractal Interpolation KAN (FI-KAN), which incorporates learnable fractal interpolation function (FIF) bases from iterated function system (IFS) theory into KAN. Two variants are presented: Pure FI-KAN (Barnsley, 1986) replaces B-splines entirely with FIF bases; Hybrid FI-KAN (Navascues, 2005) retains the B-spline path and adds a learnable fractal correction. The IFS contraction parameters give each edge a differentiable fractal dimension that adapts to target regularity during training. On a Holder regularity benchmark ($α\in [0.2, 2.0]$), Hybrid FI-KAN outperforms KAN at every regularity level (1.3x to 33x). On fractal targets, FI-KAN achieves up to 6.3x MSE reduction over KAN, maintaining 4.7x advantage at 5 dB SNR. On non-smooth PDE solutions (scikit-fem), Hybrid FI-KAN achieves up to 79x improvement on rough-coefficient diffusion and 3.5x on L-shaped domain corner singularities. Pure FI-KAN’s complementary behavior, dominating on rough targets while underperforming on smooth ones, provides controlled evidence that basis geometry must match target regularity. A fractal dimension regularizer provides interpretable complexity control whose learned values recover the true fractal dimension of each target. These results establish regularity-matched basis design as a principled strategy for neural function approximation.

关键词: Kolmogorov-Arnold Networks, Fractal Interpolation, Function Approximation, Non-smooth Functions, Iterated Function System, PDE Solutions, Regularity-matched Basis, Neural Networks

86. ❌ Pre-Deployment Complexity Estimation for Federated Perception Systems

作者: KMA Solaiman, Shafkat Islam, Ruy de Oliveira, Bharat Bhargava 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28282v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究联邦学习系统中感知任务的预部署复杂度估计，专注于数据属性、客户端组成与联邦学习性能的关系，未涉及大语言模型、深度学习技术原理创新或科学领域AI应用，与所有评分关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于联邦感知系统的预部署复杂度估计框架，通过联合建模数据内在属性和分布式环境特征来预测学习难度和通信成本，实验表明该指标与联邦学习性能和通信开销强相关。

摘要翻译

边缘人工智能系统日益依赖联邦学习在分布式、隐私保护且资源受限的环境中训练感知模型。然而在训练开始前，从业者往往缺乏实用工具来预估联邦学习任务的难度，包括可达到的准确率与通信成本。本文提出一种与分类器无关的预部署框架，通过联合建模数据的内在特性与分布式环境的特征，以评估联邦感知系统中的学习复杂度。所提出的复杂度度量指标融合了数据维度、稀疏性、异质性等数据集属性，以及参与客户端构成相关的因素。以联邦学习作为代表性分布式训练场景，我们探究了不同联邦配置下学习难度的变化。在MNIST数据集和CIFAR数据集的多种变体上的实验表明，该度量指标与联邦学习性能以及达到既定精度目标所需的通信开销均呈现强相关性。这些发现说明，复杂度评估可作为边缘部署感知系统中资源规划、数据集评估和可行性分析的有效诊断工具。

摘要 (Abstract)

Edge AI systems increasingly rely on federated learning to train perception models in distributed, privacy-preserving, and resource-constrained environments. Yet, before training begins, practitioners often lack practical tools to estimate how difficult a federated learning task will be in terms of achievable accuracy and communication cost. This paper presents a classifier-agnostic, pre-deployment framework for estimating learning complexity in federated perception systems by jointly modeling intrinsic properties of the data and characteristics of the distributed environment. The proposed complexity metric integrates dataset attributes such as dimensionality, sparsity, and heterogeneity with factors related to the composition of participating clients. Using federated learning as a representative distributed training setting, we examine how learning difficulty varies across different federated configurations. Experiments on multiple variants of the MNIST dataset and CIFAR dataset show that the proposed metric strongly correlates with federated learning performance and the communication effort required to reach fixed accuracy targets. These findings suggest that complexity estimation can serve as a practical diagnostic tool for resource planning, dataset assessment, and feasibility evaluation in edge-deployed perception systems.

关键词: federated learning, edge AI, perception systems, complexity estimation, pre-deployment, communication cost, dataset heterogeneity, distributed environment

87. ❌ Merge and Conquer: Instructing Multilingual Models by Adding Target Language Weights

作者: Eneko Valero, Maria Ribalta i Albado, Oscar Sainz, Naiara Perez, German Rigau 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28263v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	15.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究模型合并（Model Merging）技术，这是论文的创新点和主要贡献，因此该关键词得15分。论文明确研究大型语言模型（LLMs）和指令调优（Instruction Tuning），这些是论文的基础和核心应用场景，各得10分。论文涉及领域适应（Domain Adaptation）和微调（SFT）作为背景和对比方法，有一定关联，各得5分。其他关键词如MoE、SLMs、RAG、量化等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究如何通过模型合并技术，将语言特定基础模型与指令调优的LLM合并，以高效地将指令跟随能力迁移到低资源语言，从而避免对语言特定指令数据和重复微调的需求，实验证明该方法在四种伊比利亚语言上有效且计算成本低。

摘要翻译

大型语言模型（LLMs）的发展仍严重以英语为中心，在低资源语言中表现有限。现有的适应方法（如持续预训练）需要大量计算资源。对于指令微调模型而言，还需高质量指令数据，而这两者在低资源语言社区往往难以获取。在此约束下，模型融合提供了一种轻量级替代方案，但其在低资源语境中的潜力尚未得到系统性探索。本研究探讨是否可通过将指令微调LLM与特定语言基础模型融合，实现语言知识的迁移，从而避免在更强指令变体出现时重复进行语言特定指令数据收集与微调过程。通过对四种伊比利亚语言（巴斯克语、加泰罗尼亚语、加利西亚语和西班牙语）及两个模型系列的实验，我们证明融合技术能使模型在新语言中有效遵循指令，甚至通过组合多个语言特定模型实现多语言能力。研究结果表明，对于低资源语言而言，模型融合是传统适应方法的可行高效替代方案，在显著降低计算成本的同时实现了具有竞争力的性能。

摘要 (Abstract)

Large Language Models (LLMs) remain heavily centered on English, with limited performance in low-resource languages. Existing adaptation approaches, such as continual pre-training, demand significant computational resources. In the case of instructed models, high-quality instruction data is also required, both of which are often inaccessible for low-resource language communities. Under these constraints, model merging offers a lightweight alternative, but its potential in low-resource contexts has not been systematically explored. In this work, we explore whether it is possible to transfer language knowledge to an instruction-tuned LLM by merging it with a language-specific base model, thereby eliminating the need of language-specific instructions and repeated fine-tuning processes whenever stronger instructed variants become available. Through experiments covering four Iberian languages (Basque, Catalan, Galician, and Spanish) and two model families, we show that merging enables effective instruction following behavior in new languages and even supports multilingual capability through the combination of multiple language-specific models. Our results indicate that model merging is a viable and efficient alternative to traditional adaptation methods for low-resource languages, achieving competitive performance while greatly reducing computational cost.

关键词: Model Merging, Large Language Models, Instruction Tuning, Low-resource Languages, Multilingual Models, Computational Efficiency, Language Adaptation, Weight Averaging

88. ❌ MR-ImagenTime: Multi-Resolution Time Series Generation through Dual Image Representations

作者: Xianyong Xu, Yuanjun Zuo, Zhihong Huang, Yihan Qin, Haoxian Xu, Leilei Du, Haotian Wang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28253v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于时间序列预测，提出了一种结合多分辨率趋势分解、自适应嵌入机制和多尺度条件扩散过程的框架（MR-CDM）。所有关键词均与大语言模型（LLM）或深度学习技术原理直接相关，而本文研究的是时间序列生成，属于传统深度学习应用，未涉及LLM或相关创新技术。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为时间序列预测可应用于科学领域（如生物信息学），但论文未明确提及这些领域，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文针对时间序列预测中固定长度输入和多尺度建模不足的问题，提出了MR-CDM框架，通过多分辨率趋势分解和条件扩散过程，在多个真实数据集上显著优于现有基线模型。

摘要翻译

时间序列预测在众多领域至关重要，但现有模型常受限于固定长度输入与不足的多尺度建模能力。我们提出MR-CDM框架，该框架结合了分层多分辨率趋势分解、针对变长输入的自适应嵌入机制以及多尺度条件扩散过程。在四个真实世界数据集上的评估表明，MR-CDM显著优于当前最先进的基线模型（如CSDI、Informer），将平均绝对误差（MAE）和均方根误差（RMSE）在一定程度上降低了约6%至10%。

摘要 (Abstract)

Time series forecasting is vital across many domains, yet existing models struggle with fixed-length inputs and inadequate multi-scale modeling. We propose MR-CDM, a framework combining hierarchical multi-resolution trend decomposition, an adaptive embedding mechanism for variable-length inputs, and a multi-scale conditional diffusion process. Evaluations on four real-world datasets demonstrate that MR-CDM significantly outperforms state-of-the-art baselines (e.g., CSDI, Informer), reducing MAE and RMSE by approximately 6-10 to a certain degree.

关键词: time series generation, multi-resolution modeling, diffusion process, trend decomposition, adaptive embedding, conditional diffusion, forecasting framework, MR-CDM

89. ❌ DiffAttn: Diffusion-Based Drivers’ Visual Attention Prediction with LLM-Enhanced Semantic Reasoning

作者: Weimin Liu, Qingkun Li, Jiyuan Qiu, Wenjun Wang, Joshua H. Meng 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28251v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是提出DiffAttn框架，使用扩散模型预测驾驶员视觉注意力，并创新性地结合LLM增强语义推理。因此，与’Large Language Models’高度相关（10分），因为LLM层是核心组件之一。与’AI for Science’有一定关联（5分），因为该研究属于AI在智能交通领域的应用，可视为AI在工程科学中的应用。其他关键词（如MoE、SFT、RAG等）均未在摘要中提及或涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出DiffAttn框架，利用扩散模型和LLM增强的语义推理来预测驾驶员视觉注意力，在四个公开数据集上实现了最先进的性能。

摘要翻译

驾驶员的视觉注意力为预测潜在危险提供了关键线索，并直接影响其决策与控制操作，注意力缺失可能危及交通安全。为模拟驾驶员的感知模式并推进智能车辆的视觉注意力预测，我们提出DiffAttn——一种基于扩散模型的框架，将该任务建模为条件扩散去噪过程，从而实现对驾驶员注意力更精确的建模。为同时捕捉局部与全局场景特征，我们采用Swin Transformer作为编码器，并设计了一种解码器，该解码器将用于跨层交互的特征融合金字塔与密集多尺度条件扩散相结合，共同增强去噪学习并精细建模局部与全局场景上下文。此外，框架引入了大语言模型层以增强自上而下的语义推理能力，提升对安全关键线索的敏感性。在四个公开数据集上的大量实验表明，DiffAttn取得了最先进的性能，超越了多数基于视频的、自上而下特征驱动的以及大语言模型增强的基线方法。我们的框架进一步支持可解释的以驾驶员为中心的场景理解，并有望提升智能车辆的车内人机交互、风险感知及驾驶员状态监测能力。

摘要 (Abstract)

Drivers’ visual attention provides critical cues for anticipating latent hazards and directly shapes decision-making and control maneuvers, where its absence can compromise traffic safety. To emulate drivers’ perception patterns and advance visual attention prediction for intelligent vehicles, we propose DiffAttn, a diffusion-based framework that formulates this task as a conditional diffusion-denoising process, enabling more accurate modeling of drivers’ attention. To capture both local and global scene features, we adopt Swin Transformer as encoder and design a decoder that combines a Feature Fusion Pyramid for cross-layer interaction with dense, multi-scale conditional diffusion to jointly enhance denoising learning and model fine-grained local and global scene contexts. Additionally, a large language model (LLM) layer is incorporated to enhance top-down semantic reasoning and improve sensitivity to safety-critical cues. Extensive experiments on four public datasets demonstrate that DiffAttn achieves state-of-the-art (SoTA) performance, surpassing most video-based, top-down-feature-driven, and LLM-enhanced baselines. Our framework further supports interpretable driver-centric scene understanding and has the potential to improve in-cabin human-machine interaction, risk perception, and drivers’ state measurement in intelligent vehicles.

关键词: Diffusion-based, Visual Attention Prediction, Large Language Model, Semantic Reasoning, Intelligent Vehicles, Swin Transformer, Feature Fusion Pyramid, State-of-the-art

90. ❌ Reasoning as Energy Minimization over Structured Latent Trajectories

作者: David K. Johansson 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28248v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	7.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种新的推理方法EBRM，将推理建模为结构化潜在轨迹上的能量最小化过程。该方法与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（8分），因为它直接比较并改进了CoT方法，引入了多步推理的连续潜在表示。与’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’有一定关联（7分），因为该方法涉及迭代优化和深度推理过程。论文未涉及大模型、深度学习技术原理或科学领域应用，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于能量最小化的结构化潜在规划推理方法（EBRM），解决了单次解码缺乏迭代优化和链式思维缺乏标量进度度量的问题，但在CNF逻辑任务中发现了潜在规划导致准确率下降的分布不匹配问题，并通过分析和改进方法解决了这一问题。

摘要翻译

单次推理神经解码器直接输出答案而缺乏迭代优化，而思维链方法虽引入离散中间步骤却缺少衡量推理进展的标量指标。我们提出基于能量的结构化潜规划推理方法（EBRM），该方法将推理建模为在习得能量函数$E(h_x, z)$约束下对多步潜轨迹$z_{1:T}$的梯度优化过程。该能量函数可分解为单步兼容性、转移一致性与轨迹平滑性三项。训练过程结合了监督式编码器-解码器学习与基于困难负样本的对比性能量塑形，推理阶段则对$z$执行梯度下降或朗之万动力学采样，并从$z_T$解码生成结果。
我们发现一个关键失效模式：在CNF逻辑可满足性任务中，潜规划使准确率从约95%下降至约56%。这种性能退化源于分布失配问题——解码器训练时基于编码器输出$h_x$，而评估时却使用规划器输出$z_T$，后者会漂移至未见的潜空间区域。我们通过逐步解码分析、潜漂移追踪和梯度分解研究了该现象。为解决此问题，我们提出双路径解码器训练与潜锚定方法。
我们进一步设计了六部分消融实验框架，涵盖组件贡献度、轨迹长度、规划器动力学机制、初始化策略、解码器训练分布及锚定权重。在三个合成任务上的实验表明：能量值在图结构与逻辑任务中单调递减并诱导出结构化潜轨迹，而在算术任务中保持平稳（相关系数$r=0.073$），这揭示了一项负面结果。代码发布于https://github.com/dkjo8/ebr-via-structured-latent-planning。

摘要 (Abstract)

Single-shot neural decoders commit to answers without iterative refinement, while chain-of-thought methods introduce discrete intermediate steps but lack a scalar measure of reasoning progress. We propose Energy-Based Reasoning via Structured Latent Planning (EBRM), which models reasoning as gradient-based optimization of a multi-step latent trajectory $z_{1:T}$ under a learned energy function $E(h_x, z)$. The energy decomposes into per-step compatibility, transition consistency, and trajectory smoothness terms. Training combines supervised encoder-decoder learning with contrastive energy shaping using hard negatives, while inference performs gradient descent or Langevin dynamics over $z$ and decodes from $z_T$. We identify a critical failure mode: on CNF logic satisfaction, latent planning reduces accuracy from $\approx 95%$ to $\approx 56%$. This degradation arises from a distribution mismatch, where the decoder is trained on encoder outputs $h_x$ but evaluated on planner outputs $z_T$ that drift into unseen latent regions. We analyze this behavior through per-step decoding, latent drift tracking, and gradient decomposition. To address it, we propose dual-path decoder training and latent anchoring. We further introduce a six-part ablation protocol covering component contributions, trajectory length, planner dynamics, initialization, decoder training distribution, and anchor weight. Experiments on three synthetic tasks show that energy decreases monotonically and induces structured latent trajectories on graph and logic tasks, while remaining flat on arithmetic ($r = 0.073$), indicating a negative result. Code is available at https://github.com/dkjo8/ebr-via-structured-latent-planning.

关键词: Energy-Based Reasoning, Structured Latent Planning, Multi-step Reasoning, Chain-of-Thought, Latent Trajectory Optimization, Decoder Distribution Mismatch, Gradient-Based Optimization, Contrastive Energy Shaping

91. ❌ An Optimal Battery-Free Approach for Emission Reduction by Storing Solar Surplus in Building Thermal Mass

作者: Michela Boffi, Jessica Leoni, Fabrizio Leonforte, Mara Tanelli, Paolo Oliaro 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28217v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究建筑能源管理中的优化控制策略，利用建筑热质量作为被动储能来减少碳排放，完全不涉及大模型、深度学习或AI技术，因此与所有评分关键词均无相关性。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用建筑热质量作为被动储能来优化太阳能过剩存储的碳感知控制策略，通过模拟验证了该方法能在保持舒适度的同时有效减少电网电力消耗和碳排放。

摘要翻译

建筑脱碳需要先进的控制策略，以协调现场可再生能源、电网供电与热需求。现有文献方法通常依赖于需求侧管理策略或主动式储能技术（如蓄电池）。然而，前者往往忽略碳感知目标，并可能导致电网过载问题；而蓄电池则存在环境、寿命周期和成本方面的隐忧。为克服这些局限，我们提出了一种优化的碳感知优化策略，利用建筑围护结构热质量作为被动式储能介质，避免使用专用蓄电池。具体而言，当可再生能源出现盈余时，该策略通过计算在舒适度范围内临时调节室内温度设定值，以确定储存盈余能量的最优比例。因此，通过显式纳入建筑能耗预测、太阳能发电量及随时间变化的电网碳强度数据，本策略在维持舒适度的同时实现了碳排放感知的负荷转移。我们通过模拟同一系统三种不同热质量的TRNSYS模型对该方法进行评估。在所有案例中，相较于未利用可再生能源盈余的基准方案，结果显示电网用电量均实现持续降低。这些发现凸显了基于热质量的控制策略在建筑脱碳领域的潜力。

摘要 (Abstract)

Decarbonization in buildings calls for advanced control strategies that coordinate on-site renewables, grid electricity, and thermal demand. Literature approaches typically rely on demand side management strategies or on active energy storage, like batteries. However, the first solution often neglects carbon-aware objectives, and could lead to grid overload issues, while batteries entail environmental, end-of-life, and cost concerns. To overcome these limitations, we propose an optimal, carbon-aware optimization strategy that exploits the building’s thermal mass as a passive storage, avoiding dedicated batteries. Specifically, when a surplus of renewable energy is available, our strategy computes the optimal share of surplus to store by temporarily adjusting the indoor temperature setpoint within comfort bounds. Thus, by explicitly accounting for forecasts of building energy consumption, solar production, and time-varying grid carbon intensity, our strategy enables emissions-aware load shifting while maintaining comfort. We evaluate the approach by simulating three TRNSYS models of the same system with different thermal mass. In all cases, the results show consistent reductions in grid electricity consumption with respect to a baseline that does not leverage surplus renewable generation. These findings highlight the potential of thermal-mass-based control for building decarbonization.

关键词: building decarbonization, thermal mass storage, carbon-aware optimization, renewable energy surplus, grid electricity reduction, passive energy storage, indoor temperature control, TRNSYS simulation

92. ❌ TwinMixing: A Shuffle-Aware Feature Interaction Model for Multi-Task Segmentation

作者: Minh-Khoi Do, Huy Che, Dinh-Duy Phan, Duc-Khai Lam, Duc-Lung Vu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28233v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的多任务分割模型（TwinMixing），用于自动驾驶中的可行驶区域和车道分割。论文的核心贡献是轻量级网络架构设计、高效金字塔混合模块和双分支上采样块，所有内容均围绕卷积神经网络、特征提取和分割任务展开。所有评分关键词均与大语言模型、深度学习技术原理或科学AI应用相关，而本文完全不涉及这些主题。论文没有讨论任何大模型、语言模型、训练方法、对齐技术、推理优化、代理系统或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种轻量级多任务分割模型TwinMixing，用于自动驾驶中的可行驶区域和车道分割，通过高效金字塔混合模块和双分支上采样块在BDD100K数据集上实现了高精度和低计算成本的平衡。

摘要翻译

精确高效的感知是自动驾驶的关键，其中可行驶区域与车道线分割等任务为运动规划与控制提供了核心依据。然而，在低成本硬件上实现高分割精度的同时保持实时性能仍是一个具有挑战性的问题。为解决此问题，我们提出了TwinMixing，一种专为可行驶区域与车道线分割设计的轻量级多任务分割模型。该网络采用共享编码器与任务专用解码器的架构，实现了特征共享与任务特化的平衡。在编码器中，我们提出了高效金字塔混合（Efficient Pyramid Mixing, EPM）模块，该模块通过组合分组卷积、深度可分离空洞卷积与通道混洗操作来增强多尺度特征提取能力，在有效扩大感受野的同时最小化计算开销。每个解码器采用双分支上采样（Dual-Branch Upsampling, DBU）块，该块由基于可学习转置卷积的精细细节分支与基于无参数双线性插值的粗粒度分支构成，实现了细节丰富且空间一致的特征重建。在BDD100K数据集上进行的大量实验验证了TwinMixing在三种配置（微型、基础、大型）下的有效性。其中，基础配置在精度与计算效率间取得了最佳平衡，仅用0.43M参数量和3.95 GFLOPs的计算量，即可在可行驶区域分割上达到92.0%的平均交并比（mIoU），在车道线分割上达到32.3%的交并比（IoU）。此外，如图1所示，TwinMixing在相同任务上持续优于现有的分割模型。得益于其紧凑且模块化的设计，TwinMixing在自动驾驶及嵌入式感知系统中展现出强大的实时部署潜力。源代码地址：https://github.com/Jun0se7en/TwinMixing。

摘要 (Abstract)

Accurate and efficient perception is essential for autonomous driving, where segmentation tasks such as drivable-area and lane segmentation provide critical cues for motion planning and control. However, achieving high segmentation accuracy while maintaining real-time performance on low-cost hardware remains a challenging problem. To address this issue, we introduce TwinMixing, a lightweight multi-task segmentation model designed explicitly for drivable-area and lane segmentation. The proposed network features a shared encoder and task-specific decoders, enabling both feature sharing and task specialization. Within the encoder, we propose an Efficient Pyramid Mixing (EPM) module that enhances multi-scale feature extraction through a combination of grouped convolutions, depthwise dilated convolutions and channel shuffle operations, effectively expanding the receptive field while minimizing computational cost. Each decoder adopts a Dual-Branch Upsampling (DBU) Block composed of a learnable transposed convolution-based Fine detailed branch and a parameter-free bilinear interpolation-based Coarse grained branch, achieving detailed yet spatially consistent feature reconstruction. Extensive experiments on the BDD100K dataset validate the effectiveness of TwinMixing across three configurations - tiny, base, and large. Among them, the base configuration achieves the best trade-off between accuracy and computational efficiency, reaching 92.0% mIoU for drivable-area segmentation and 32.3% IoU for lane segmentation with only 0.43M parameters and 3.95 GFLOPs. Moreover, TwinMixing consistently outperforms existing segmentation models on the same tasks, as illustrated in Fig. 1. Thanks to its compact and modular design, TwinMixing demonstrates strong potential for real-time deployment in autonomous driving and embedded perception systems. The source code: https://github.com/Jun0se7en/TwinMixing.

关键词: multi-task segmentation, autonomous driving, lightweight model, drivable-area segmentation, lane segmentation, Efficient Pyramid Mixing, Dual-Branch Upsampling, real-time perception

93. ❌ ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models

作者: Song Yu, Li Li 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28204v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种名为ERPO的强化学习优化方法，专门针对大型推理模型（属于大语言模型范畴）。核心贡献在于改进推理过程中的token级策略优化，与’RLHF/DPO’高度相关（10分），因为论文直接研究强化学习优化方法；与’Chain of Thought/多步推理’和’System 2 Thinking/深度推理’高度相关（10分），因为论文专注于数学推理等复杂推理任务；与’Large Language Models’高度相关（10分），因为研究对象是大型推理模型。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

论文针对大型推理模型在强化学习优化中存在的token级信用分配不均问题，提出了熵调节策略优化方法ERPO，显著提升了数学推理任务的准确性和推理路径的简洁性。

摘要翻译

基于可验证奖励的强化学习（RLVR）显著提升了大型语言模型的推理能力。然而，标准的组相对策略优化（GRPO）通常为所有令牌分配统一的序列级优势值，从而忽视了推理链中固有的信息异质性。我们证明，这种粗粒度的信用分配会导致过早的熵崩溃，并鼓励模型生成冗余、低质量的推理路径。通过系统的实证分析，我们识别出关键决策支点（Critical Decision Pivots, CDPs）：这些是短暂的高熵状态，在此处策略轨迹对扰动最为敏感。这些支点代表了推理道路上的“分岔口”，有效的多路径探索在此处最为关键，却常被统一的优势信号所抑制。基于这些发现，我们提出了熵调控策略优化（Entropy-Regulated Policy Optimization, ERPO），它将优化焦点从粗粒度的序列转移到细粒度的令牌动态上。ERPO引入了三个协同组件：（i）熵感知门控，自适应地放大CDPs处的探索以促进多样化路径发现；（ii）基于桶的隐式归一化，通过对齐令牌进度窗口来缓解难度偏差；（iii）结果锚定的优势合成，通过结果驱动的锚点重新加权令牌级信号。在竞争性数学基准测试（如MATH、AIME）上的大量实验表明，ERPO显著优于GRPO。值得注意的是，ERPO不仅提高了推理准确性，还产生了显著更简洁、更稳健的推导路径，为大型推理模型建立了新的效率-准确性前沿。

摘要 (Abstract)

Reinforcement learning from verifiable rewards (RLVR) has significantly advanced the reasoning capabilities of large language models. However, standard Group Relative Policy Optimization (GRPO) typically assigns a uniform, sequence-level advantage to all tokens, thereby overlooking the intrinsic information heterogeneity along reasoning chains. We show that this coarse-grained credit assignment leads to premature entropy collapse and encourages the model to generate redundant, low-quality reasoning paths. Through systematic empirical analysis, we identify Critical Decision Pivots (CDPs): transient high-entropy states where the policy’s trajectory is most sensitive to perturbations. These pivots represent the “forks in the road” where effective multi-path exploration is most crucial yet often suppressed by uniform advantage signals. Building on these insights, we propose Entropy-Regulated Policy Optimization (ERPO), which transitions the optimization focus from coarse sequences to fine-grained token dynamics. ERPO introduces three synergistic components: (i) Entropy-aware Gating, which adaptively amplifies exploration at CDPs to facilitate diverse path discovery; (ii) Bucket-based Implicit Normalization, which mitigates difficulty bias by aligning token progress windows; and (iii) Result-anchored Advantage Synthesis, which re-weights token-level signals via outcome-driven anchors. Extensive experiments on competitive mathematical benchmarks (e.g., MATH, AIME) demonstrate that ERPO significantly outperforms GRPO. Notably, ERPO not only boosts reasoning accuracy but also yields significantly more concise and robust derivation paths, establishing a new efficiency-accuracy frontier for large reasoning models.

关键词: Entropy-Regulated Policy Optimization, Large Reasoning Models, Reinforcement Learning, Token-Level Optimization, Mathematical Reasoning, Critical Decision Pivots, GRPO, Reasoning Chains

94. ❌ EpiPersona: Persona Projection and Episode Coupling for Pluralistic Preference Modeling

作者: Yujie Zhang, Weikang Yuan, Zhuoren Jiang, Pengwei Yan 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28197v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的pluralistic alignment问题，属于alignment技术范畴，与’Large Language Models’和’Alignment’高度相关（10分）。论文处理preference feedback，与’RLHF/DPO’有一定关联（5分）。其他关键词如MoE、SFT、RAG、CoT等均未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了EpiPersona框架，通过显式耦合人物角色和情境来建模多元偏好，解决了现有方法混淆稳定个人特质和情境因素的问题，在稀疏偏好数据和困难情境转移场景中显著优于基线方法。

摘要翻译

多元对齐对于使大语言模型适应个体及少数群体的多样化偏好至关重要。然而，现有方法常将稳定的个人特质与情境特定因素相混淆，限制了其在跨情境中的泛化能力。为应对这一挑战，我们提出了EpiPersona框架，用于实现显式的人物-情境耦合。EpiPersona首先将含噪声的偏好反馈映射到低维人物空间，其中相似的人物特征被聚合为共享的离散编码。这一过程在不依赖预定义偏好维度的前提下，将持久性个人特征与情境信号分离开来。随后，推断出的人物表征与当前情境相耦合，实现情境感知的偏好预测。大量实验表明，EpiPersona始终优于基线方法。它在困难的情境转换场景中取得了显著的性能提升，同时在稀疏偏好数据下仍保持有效性。

摘要 (Abstract)

Pluralistic alignment is essential for adapting large language models (LLMs) to the diverse preferences of individuals and minority groups. However, existing approaches often mix stable personal traits with episode-specific factors, limiting their ability to generalize across episodes. To address this challenge, we introduce EpiPersona, a framework for explicit persona-episode coupling. EpiPersona first projects noisy preference feedback into a low-dimensional persona space, where similar personas are aggregated into shared discrete codes. This process separates enduring personal characteristics from situational signals without relying on predefined preference dimensions. The inferred persona representation is then coupled with the current episode, enabling episode-aware preference prediction. Extensive experiments show that EpiPersona consistently outperforms the baselines. It achieves notable performance gains in hard episodic-shift scenarios, while remaining effective with sparse preference data.

关键词: pluralistic alignment, persona projection, episode coupling, preference modeling, LLMs, preference feedback, episodic-shift, sparse data

95. ❌ Differentiable Power-Flow Optimization

作者: Muhammed Öz, Jasmin Hörter, Kaleb Phipps, Charlotte Debus, Achim Streit, Markus Götz 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28203v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《Differentiable Power-Flow Optimization》专注于电力系统仿真优化，提出了一种可微分的交流潮流计算方法（DPF），利用GPU加速、稀疏张量表示和批处理技术提高计算效率。论文的核心是应用现代机器学习框架（如PyTorch）解决电力工程中的计算瓶颈，属于AI在科学工程领域的应用。所有关键词中，仅“AI for Science OR Bioinformatics OR Cheminformatics”有一定关联（5分），因为论文将机器学习技术应用于能源科学领域，但未涉及大模型、深度学习技术原理创新或其他具体AI方法（如LLM、MoE、微调等）。其他关键词与论文内容完全无关（0分），论文未讨论语言模型、推理技术、对齐、压缩等主题。

!!! tip deepseek-chat TL;DR

该论文针对可再生能源并网导致的电网管理复杂性问题，提出了一种可微分的交流潮流仿真方法（DPF），通过GPU加速和稀疏张量技术显著提升了计算效率，适用于时间序列分析和N-1故障分析等场景。

摘要翻译

随着可再生能源的兴起及其发电的高度波动性，电网管理变得日益复杂且计算要求极高。传统的交流潮流仿真采用牛顿-拉夫逊法，其可扩展性较差，难以适用于输配电网联合建模与全球电网分析等新兴应用场景。与此同时，纯数据驱动的代理模型缺乏物理保证，可能违反基本约束条件。本研究提出可微分潮流模型，将交流潮流问题重构为可微分仿真形式。该模型实现了从物理功率失配到底层仿真参数的端到端梯度传播，从而能够基于梯度优化方法高效识别这些参数。我们通过利用现代机器学习框架（如PyTorch）中的GPU加速、稀疏张量表示和批量处理能力，证明可微分潮流模型可作为牛顿-拉夫逊法的高可扩展性替代方案。该模型特别适用于以下场景：因其高效复用历史解的特性而适用于时序分析；凭借批量处理能力适用于N-1故障分析；利用其高速计算与提前终止特性可作为筛选工具。相关代码已在作者代码库中开源。

摘要 (Abstract)

With the rise of renewable energy sources and their high variability in generation, the management of power grids becomes increasingly complex and computationally demanding. Conventional AC-power-flow simulations, which use the Newton-Raphson (NR) method, suffer from poor scalability, making them impractical for emerging use cases such as joint transmission-distribution modeling and global grid analysis. At the same time, purely data-driven surrogate models lack physical guarantees and may violate fundamental constraints. In this work, we propose Differentiable Power-Flow (DPF), a reformulation of the AC power-flow problem as a differentiable simulation. DPF enables end-to-end gradient propagation from the physical power mismatches to the underlying simulation parameters, thereby allowing these parameters to be identified efficiently using gradient-based optimization. We demonstrate that DPF provides a scalable alternative to NR by leveraging GPU acceleration, sparse tensor representations, and batching capabilities available in modern machine-learning frameworks such as PyTorch. DPF is especially suited as a tool for time-series analyses due to its efficient reuse of previous solutions, for N-1 contingency-analyses due to its ability to process cases in batches, and as a screening tool by leveraging its speed and early stopping capability. The code is available in the authors’ code repository.

关键词: Differentiable Power-Flow, AC power-flow simulation, GPU acceleration, sparse tensor representations, Newton-Raphson method, renewable energy, gradient-based optimization, PyTorch

96. ❌ Designing AI for Real Users – Accessibility Gaps in Retail AI Front-End

作者: Neha Puri, Tim Dixon 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28196v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于AI前端设计的可访问性伦理问题，特别是零售AI系统（如虚拟助手、虚拟试穿、个性化推荐）对残障用户和数字素养差异人群的排斥。研究内容属于AI伦理、人机交互和设计研究范畴，而非大模型技术原理、训练方法、推理优化或科学应用等具体技术领域。所有评分关键词均涉及大模型技术栈的具体组件、方法或应用，与论文主题完全无关，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文研究发现零售AI前端设计隐含了'理想用户'假设，导致视觉、听觉、运动、认知等残障用户以及数字素养差异人群被边缘化，并提出前端保证作为AI治理的实践补充。

摘要翻译

随着人工智能日益融入面向客户的系统，伦理审查主要聚焦于模型、数据与治理层面，而对用户界面设计如何塑造人工智能体验的关注则远远不足。本文指出，许多人工智能前端系统隐含地预设了“理想用户的身心状态”，当通过残障用户的使用体验进行审视时，这种预设便显现出来并产生伦理影响。我们通过零售业用于客户互动的人工智能前端系统——即虚拟助手、虚拟试穿系统和超个性化推荐系统——来探讨这一问题。尽管这些系统常以直观和包容为设计理念，其交互逻辑中仍内嵌着对视觉、听觉、运动、认知、言语及感知能力存在差异的用户，以及数字素养和交互规范存在年龄相关变化的用户的边缘化假设。基于实践导向的洞察，我们认为这些缺陷的持续存在主要并非由于技术限制，而是源于人工智能前端系统设计与部署所处的商业、组织和采购环境——在这些环境中，可访问性要求很少被纳入合同条款。我们提出将前端保障作为人工智能治理的实践补充，使智能性与多模态的宣称与真实用户的多样性相契合。

摘要 (Abstract)

As AI becomes embedded in customer-facing systems, ethical scrutiny has largely focused on models, data, and governance. Far less attention has been paid to how AI is experienced through user-facing design. This commentary argues that many AI front-ends implicitly assume an ‘ideal user body and mind’, and that this becomes visible and ethically consequential when examined through the experiences of differently abled users. We explore this through retail AI front-ends for customer engagement - i.e., virtual assistants, virtual try-on systems, and hyper-personalised recommendations. Despite intuitive and inclusive framing, these systems embed interaction assumptions that marginalise users with vision, hearing, motor, cognitive, speech and sensory differences, as well as age-related variation in digital literacy and interaction norms. Drawing on practice-led insights, we argue that these failures persist not primarily due to technical limits, but due to the commercial, organisational, and procurement contexts in which AI front-ends are designed and deployed, where accessibility is rarely contractual. We propose front-end assurance as a practical complement to AI governance, aligning claims of intelligence and multimodality with the diversity of real users.

关键词: AI front-end design, accessibility, retail AI, user experience, ethical AI, inclusive design, virtual assistants, personalized recommendations

97. ❌ Skillful Kilometer-Scale Regional Weather Forecasting via Global and Regional Coupling

作者: Weiqi Chen, Wenwei Wang, Qilong Yuan, Lefei Shen, Bingqing Peng, Jiawei Chen, Bo Wu, Liang Sun 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28173v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于使用Transformer架构进行高分辨率区域天气预报，属于AI for Science（科学AI）领域，因此与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文提到使用了’pretrained Transformer-based global model’，这涉及预训练概念，与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），但论文核心是天气预报应用而非预训练技术本身。其他关键词主要涉及大语言模型（LLM）的技术原理、优化、对齐、推理、代理等，而本文研究的是基于Transformer的天气预测模型，并非语言模型，因此与这些关键词完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种全球-区域耦合框架，通过ScaleMixer模块将预训练的Transformer全球模型与高分辨率区域网络结合，实现了对中国地区5公里分辨率的高精度区域天气预报，显著优于传统数值天气预报和AI基线模型。

摘要翻译

数据驱动的天气模型已推动全球中期预报发展，但由于大尺度动力学与地形诱发环流、海岸效应等小尺度过程之间存在未解决的多尺度相互作用，高分辨率区域预测仍具挑战。本文提出一种用于公里级区域天气预报的全球-区域耦合框架，该框架通过新型双向耦合模块ScaleMixer，将预训练的基于Transformer的全球模型与高分辨率区域网络协同耦合。ScaleMixer通过自适应关键位置采样动态识别气象关键区域，并借助专用注意力机制实现跨尺度特征交互。该框架以0.05度（约5公里）和1小时分辨率生成中国区域预报，在网格化再分析数据和实时气象站观测数据上均显著优于业务数值天气预报（NWP）与人工智能基线模型。其在捕捉地形风场模式和焚风（Foehn）增温等精细化现象方面表现出卓越能力，实现了高分辨率精度与全球尺度一致性的有效统一。代码发布于https://anonymous.4open.science/r/ScaleMixer-6B66。

摘要 (Abstract)

Data-driven weather models have advanced global medium-range forecasting, yet high-resolution regional prediction remains challenging due to unresolved multiscale interactions between large-scale dynamics and small-scale processes such as terrain-induced circulations and coastal effects. This paper presents a global-regional coupling framework for kilometer-scale regional weather forecasting that synergistically couples a pretrained Transformer-based global model with a high-resolution regional network via a novel bidirectional coupling module, ScaleMixer. ScaleMixer dynamically identifies meteorologically critical regions through adaptive key-position sampling and enables cross-scale feature interaction through dedicated attention mechanisms. The framework produces forecasts at $0.05^\circ$ ($\sim 5 \mathrm{km}$ ) and 1-hour resolution over China, significantly outperforming operational NWP and AI baselines on both gridded reanalysis data and real-time weather station observations. It exhibits exceptional skill in capturing fine-grained phenomena such as orographic wind patterns and Foehn warming, demonstrating effective global-scale coherence with high-resolution fidelity. The code is available at https://anonymous.4open.science/r/ScaleMixer-6B66.

关键词: kilometer-scale regional weather forecasting, global-regional coupling framework, Transformer-based global model, ScaleMixer, high-resolution regional network, data-driven weather models, AI for science, meteorological prediction

98. ❌ PReD: An LLM-based Foundation Multimodal Model for Electromagnetic Perception, Recognition, and Decision

作者: Zehua Han, Jing Xiao, Yiqi Duan, Mengyu Xiang, Yuheng Ji, Xiaolong Zheng, Chenghanyu Zhang, Zhendong She, Junyu Shen, Dingwei Tan, Shichu Sun, Zhou Cong, Mingxuan Liu, Fengxiang Wang, Jinping Sun, Yangang Sun 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28183v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出PReD，一个基于LLM的电磁领域基础多模态模型，核心涉及LLM/Foundation Models（10分）、AI for Science应用（10分）和Pre-training/Domain Adaptation（10分）。论文构建了高质量数据集PReD-1.3M，涉及数据质量（5分），并采用多阶段训练策略，包括fine-tuning（5分）。模型实现从信号理解到语言驱动的推理和决策，涉及推理能力（Chain of Thought和System 2 Thinking各5分）。其他关键词如MoE、SLMs、Alignment、RAG等未在摘要中提及或与论文内容无关，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对电磁领域数据稀缺和领域知识整合不足的问题，提出了首个覆盖“感知、识别、决策”智能闭环的基础模型PReD，通过构建高质量数据集和多阶段训练策略，在电磁信号理解与推理任务上取得了最先进的性能。

摘要翻译

多模态大语言模型在通用领域已展现出强大的跨模态理解与推理能力。然而，在电磁领域，它们仍面临数据稀缺与领域知识融合不足等挑战。本文提出了PReD，这是首个覆盖“感知、识别、决策”智能闭环的电磁领域基础模型。我们构建了高质量多任务电磁数据集PReD-1.3M及评估基准PReD-Bench。该数据集涵盖原始时域波形、频域谱图与星座图等多视角表征，覆盖通信与雷达信号的典型特征，支持信号检测、调制识别、参数估计、协议识别、射频指纹识别及抗干扰决策等一系列核心任务。PReD采用统一电磁信号多任务的多阶段训练策略，实现了从端到端信号理解到语言驱动推理与决策的闭环优化，在保持通用多模态能力的同时显著增强了电磁领域专业性。实验结果表明，PReD在基于开源与自采集信号数据集构建的PReD-Bench上均取得了最先进的性能。这些结果共同验证了视觉对齐基础模型在推进电磁信号理解与推理方面的可行性与潜力。

摘要 (Abstract)

Multimodal Large Language Models have demonstrated powerful cross-modal understanding and reasoning capabilities in general domains. However, in the electromagnetic (EM) domain, they still face challenges such as data scarcity and insufficient integration of domain knowledge. This paper proposes PReD, the first foundation model for the EM domain that covers the intelligent closed-loop of “perception, recognition, decision-making.” We constructed a high-quality multitask EM dataset, PReD-1.3M, and an evaluation benchmark, PReD-Bench. The dataset encompasses multi-perspective representations such as raw time-domain waveform, frequency-domain spectrograms, and constellation diagrams, covering typical features of communication and radar signals. It supports a range of core tasks, including signal detection, modulation recognition, parameter estimation, protocol recognition, radio frequency fingerprint recognition, and anti-jamming decision-making. PReD adopts a multi-stage training strategy that unifies multiple tasks for EM signals. It achieves closed-loop optimization from end-to-end signal understanding to language-driven reasoning and decision-making, significantly enhancing EM domain expertise while maintaining general multimodal capabilities. Experimental results show that PReD achieves state-of-the-art performance on PReD-Bench constructed from both open-source and self-collected signal datasets. These results collectively validate the feasibility and potential of vision-aligned foundation models in advancing the understanding and reasoning of EM signals.

关键词: Multimodal Large Language Models, Electromagnetic Domain, Foundation Model, Perception-Recognition-Decision, PReD Dataset, Signal Understanding, Cross-modal Reasoning, AI for Science

99. ❌ Evaluating Privilege Usage of Agents on Real-World Tools

作者: Quan Zhang, Lianhang Fu, Lvsi Lian, Gwihwan Go, Yujue Wang, Chijin Zhou, Yu Jiang, Geguang Pu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28166v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM代理在真实工具使用中的权限安全问题，与’Large Language Models’、‘LLM Agents’和’Tool Use’高度相关（10分），因为这些是研究的核心对象和场景。其他关键词如MoE、量化、推理加速、科学AI等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM代理使用真实工具时的权限安全问题，提出了GrantBox安全评估沙箱，发现尽管LLM具备基本安全意识，但在精心设计的攻击场景下平均攻击成功率仍高达84.80%。

摘要翻译

为LLM智能体配备现实世界工具可显著提升生产力。然而，赋予智能体工具使用的自主权也意味着将相应权限同时授予智能体及其底层大语言模型。不当的权限使用可能导致严重后果，包括信息泄露与基础设施破坏。尽管已有若干基准测试被构建用于研究智能体的安全性，但它们通常依赖预编码工具和受限的交互模式。此类人工构建的环境与现实世界存在显著差异，难以评估智能体在关键权限控制与使用方面的安全能力。为此，我们提出GrantBox——一个用于分析智能体权限使用的安全评估沙箱。GrantBox能够自动集成现实世界工具，并允许LLM智能体调用真实权限，从而实现在提示注入攻击下的权限使用评估。我们的研究结果表明：尽管大语言模型展现出基础的安全意识并能阻断部分直接攻击，它们仍易受更复杂攻击的影响，在精心设计的场景中平均攻击成功率高达84.80%。

摘要 (Abstract)

Equipping LLM agents with real-world tools can substantially improve productivity. However, granting agents autonomy over tool use also transfers the associated privileges to both the agent and the underlying LLM. Improper privilege usage may lead to serious consequences, including information leakage and infrastructure damage. While several benchmarks have been built to study agents’ security, they often rely on pre-coded tools and restricted interaction patterns. Such crafted environments differ substantially from the real-world, making it hard to assess agents’ security capabilities in critical privilege control and usage. Therefore, we propose GrantBox, a security evaluation sandbox for analyzing agent privilege usage. GrantBox automatically integrates real-world tools and allows LLM agents to invoke genuine privileges, enabling the evaluation of privilege usage under prompt injection attacks. Our results indicate that while LLMs exhibit basic security awareness and can block some direct attacks, they remain vulnerable to more sophisticated attacks, resulting in an average attack success rate of 84.80% in carefully crafted scenarios.

关键词: LLM agents, tool use, privilege security, GrantBox, prompt injection attacks, real-world tools, security evaluation, autonomous agents

100. ❌ RecycleLoRA: Rank-Revealing QR-Based Dual-LoRA Subspace Adaptation for Domain Generalized Semantic Segmentation

作者: Chanseul Cho, Seokju Yun, Jeaseong Jeon, Seungjae Moon, Youngmin Ro 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28142v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出RecycleLoRA方法，专注于视觉基础模型（VFMs）的领域泛化语义分割，核心创新在于基于秩揭示QR分解的双LoRA子空间适配器设计，以增强LoRA的表示丰富性和参数效率。因此，与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分），因为LoRA是核心方法；与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（8分），因为涉及领域适应和预训练模型利用；与’Large Language Models OR LLMs OR Foundation Models’有弱关联（5分），因为论文使用视觉基础模型（VFMs），属于基础模型范畴，但非语言模型。其他关键词如MoE、SLMs、SFT、RAG等与计算机视觉和LoRA优化主题无关，均得0分。

!!! tip deepseek-chat TL;DR

论文提出RecycleLoRA方法，通过基于秩揭示QR分解的双LoRA子空间适配器，解决领域泛化语义分割中LoRA表示多样性不足和参数效率低的问题，在合成到真实和真实到真实泛化任务上实现了最先进的性能。

摘要翻译

领域泛化语义分割（DGSS）旨在在未见过的目标域上保持鲁棒性能。视觉基础模型（VFMs）提供了丰富的多领域知识，可增强泛化能力。然而，主动利用VFMs内部丰富子空间结构的策略仍待深入探索，现有方法多侧重于保持预训练知识。此外，其低秩自适应（LoRA）组件常受限于表征多样性不足和参数利用效率低下。我们提出RecycleLoRA方法，通过采用秩揭示QR分解（RRQR）系统性地挖掘VFM子空间结构并增强LoRA的表征丰富性，以应对这两大挑战。我们的主适配器利用RRQR识别的次要子空间方向学习多样化且独立的特征，即使单独使用也能实现有竞争力的性能。我们进一步引入子适配器，通过最小调整精细优化主要方向，为主适配器的强基线性能提供互补性改进。该设计使双适配器能够学习差异化表征，无需额外正则化损失。我们通过基于RRQR初始化的预训练子空间结构系统性挖掘，实现了卓越的领域泛化性能。RecycleLoRA在合成到真实泛化及真实到真实泛化任务中均取得最先进的性能，且无需复杂架构或增加推理延迟。

摘要 (Abstract)

Domain Generalized Semantic Segmentation (DGSS) aims to maintain robust performance across unseen target domains. Vision Foundation Models (VFMs) offer rich multi-domain knowledge that can enhance generalization. However, strategies for actively exploiting the rich subspace structures within VFMs remain under-explored, with many existing methods focusing primarily on preserving pre-trained knowledge. Furthermore, their LoRA components often suffer from limited representational diversity and inefficient parameter utilization. We propose RecycleLoRA, which addresses both challenges by employing Rank-Revealing QR Decomposition (RRQR) to systematically exploit VFM’s subspace structures and enhance LoRA’s representational richness. Our main adapter leverages minor subspace directions identified by RRQR to learn diverse and independent features, achieving competitive performance even when used alone. We further introduce a sub adapter that carefully refines major directions with minimal adjustments, providing complementary improvements to the main adapter’s strong baseline performance. This design enables the dual adapters to learn distinct representations without requiring additional regularization losses. Our systematic exploitation of pre-trained subspace structures through RRQR-based initialization leads to superior domain generalization performance. RecycleLoRA achieves state-of-the-art performance on both synthetic-to-real generalization and real-to-real generalization tasks without complex architectures or additional inference latency.

关键词: Domain Generalized Semantic Segmentation, Vision Foundation Models, LoRA, Rank-Revealing QR Decomposition, Subspace Adaptation, Parameter-efficient Fine-tuning, Dual Adapters

101. ❌ CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning

作者: Siyuan Ma, Bo Gao, Zikai Xiao, Hailong Wang, Xinlei Yu, Rui Qian, Jiayu Qian, Luqi Gong, Yang Liu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28135v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	15.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	8.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出CoT2-Meta框架，专注于测试时推理的元认知控制，核心涉及链式思维推理（CoT）、系统2思维、自我修正和搜索方法。与LLM相关（作为推理基础），与MCTS相关（作为基线比较方法），与LLM Agents有一定关联（涉及推理控制），与事实性和可解释性有间接关联。其他关键词如MoE、量化、RAG等未在摘要中体现，评分为0。

!!! tip deepseek-chat TL;DR

该研究提出了CoT2-Meta框架，通过元认知控制优化测试时推理过程，在多个基准测试中显著提升了推理性能并实现了更好的计算效率。

摘要翻译

近期测试时推理方法通过生成更多候选推理链或在更大规模的推理树中进行搜索来提升性能，但这些方法通常缺乏对何时扩展、剪枝何种内容、如何进行修正以及何时终止推理的显式控制。本文提出CoT2-Meta，一种无需训练的元认知推理框架，它将对象级思维链生成与对部分推理轨迹的元级控制相结合。该框架整合了四个核心组件：策略条件化的思维生成、树结构搜索、用于步骤级推理评估的在线过程预言器，以及通过扩展、剪枝、修正、停止和回退决策来分配计算资源的元控制器。在匹配的推理预算下，CoT2-Meta持续优于强力的单路径、基于采样和基于搜索的基线方法（包括ReST-MCTS）。在默认骨干模型上，其在MATH数据集上达到92.8 EM，GPQA上达到90.4%准确率，GSM8K上达到98.65 EM，BBEH上达到75.8%准确率，MMMU-Pro上达到85.6%准确率，HLE上达到48.8%准确率，相较于最强的非CoT2-Meta基线分别提升+3.6、+5.2、+1.15、+2.0、+4.3和+4.3个百分点。除这些核心结果外，该框架在涵盖知识与问答、多跳推理、代码生成及分布外评估的更广泛15项基准测试中均保持有效性。进一步分析表明，该方法具有更优的计算扩展性、更好的校准能力、更强的选择性预测性能、有针对性的修正行为，并在不同骨干模型家族中取得一致增益。这些结果表明，显式的元认知控制是构建可靠且计算高效的测试时推理系统的实用设计原则。

摘要 (Abstract)

Recent test-time reasoning methods improve performance by generating more candidate chains or searching over larger reasoning trees, but they typically lack explicit control over when to expand, what to prune, how to repair, and when to abstain. We introduce CoT2-Meta, a training-free metacognitive reasoning framework that combines object-level chain-of-thought generation with meta-level control over partial reasoning trajectories. The framework integrates four components: strategy-conditioned thought generation, tree-structured search, an online process oracle for step-level reasoning evaluation, and a meta-controller that allocates computation through expansion, pruning, repair, stopping, and fallback decisions. Under matched inference budgets, CoT2-Meta consistently outperforms strong single-path, sampling-based, and search-based baselines, including ReST-MCTS. On the default backbone, it achieves 92.8 EM on MATH, 90.4 accuracy on GPQA, 98.65 EM on GSM8K, 75.8 accuracy on BBEH, 85.6 accuracy on MMMU-Pro, and 48.8 accuracy on HLE, with gains over the strongest non-CoT2-Meta baseline of +3.6, +5.2, +1.15, +2.0, +4.3, and +4.3 points, respectively. Beyond these core results, the framework remains effective across a broader 15-benchmark suite spanning knowledge and QA, multi-hop reasoning, coding, and out-of-distribution evaluation. Additional analyses show better compute scaling, improved calibration, stronger selective prediction, targeted repair behavior, and consistent gains across backbone families. These results suggest that explicit metacognitive control is a practical design principle for reliable and compute-efficient test-time reasoning systems.

关键词: Chain-of-Thought, Metacognitive Control, Test-Time Reasoning, Tree-Structured Search, Reasoning Evaluation, Computation Allocation, Reasoning Framework, Benchmark Performance

102. ❌ MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios

作者: Zhang Li, Zhibo Lin, Qiang Liu, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiajun Song, Jiarui Zhang, Xiang Bai, Yuliang Liu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28130v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多语言文档解析基准的创建和评估，属于计算机视觉和文档分析领域，而非大模型或深度学习技术原理的创新。摘要和标题中未提及任何大模型、深度学习技术、AI for Science应用或相关技术关键词。所有评分关键词均与大模型技术、训练方法、推理优化、对齐、应用等直接相关，而本文研究的是文档图像解析的基准测试和模型性能评估，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文创建了首个多语言数字和拍摄文档解析基准MDPBench，评估发现开源模型在非拉丁文字和真实拍摄文档上性能显著下降，揭示了跨语言和条件的性能不平衡问题。

摘要翻译

我们推出多语言文档解析基准测试，这是首个针对多语言数字文档与拍摄文档解析的基准测试体系。文档解析领域已取得显著进展，但现有研究几乎完全集中于少数主流语言的清洁、数字版、格式规范的页面。目前缺乏系统性基准来评估模型在不同文字体系和低资源语言的数字及拍摄文档上的性能表现。MDPBench包含3,400份涵盖17种语言、多样文字体系及不同拍摄条件的文档图像，通过专家模型标注、人工校正和人工验证的严格流程生成高质量标注。为确保公平比较并防止数据泄露，我们设置了独立的公开和私有评估数据集。通过对开源与闭源模型的全面评估，我们发现了显著结论：虽然闭源模型（特别是Gemini3-Pro）表现出相对稳健的性能，但开源替代方案却出现严重的性能崩溃，尤其在非拉丁文字体系和现实世界拍摄文档上，其在拍摄文档上的平均性能下降17.8%，在非拉丁文字体系上下降14.0%。这些结果揭示了跨语言和跨条件的显著性能不平衡，并为构建更具包容性、可部署就绪的解析系统指明了具体方向。源代码发布于https://github.com/Yuliang-Liu/MultimodalOCR。

摘要 (Abstract)

We introduce Multilingual Document Parsing Benchmark, the first benchmark for multilingual digital and photographed document parsing. Document parsing has made remarkable strides, yet almost exclusively on clean, digital, well-formatted pages in a handful of dominant languages. No systematic benchmark exists to evaluate how models perform on digital and photographed documents across diverse scripts and low-resource languages. MDPBench comprises 3,400 document images spanning 17 languages, diverse scripts, and varied photographic conditions, with high-quality annotations produced through a rigorous pipeline of expert model labeling, manual correction, and human verification. To ensure fair comparison and prevent data leakage, we maintain separate public and private evaluation splits. Our comprehensive evaluation of both open-source and closed-source models uncovers a striking finding: while closed-source models (notably Gemini3-Pro) prove relatively robust, open-source alternatives suffer dramatic performance collapse, particularly on non-Latin scripts and real-world photographed documents, with an average drop of 17.8% on photographed documents and 14.0% on non-Latin scripts. These results reveal significant performance imbalances across languages and conditions, and point to concrete directions for building more inclusive, deployment-ready parsing systems. Source available at https://github.com/Yuliang-Liu/MultimodalOCR.

关键词: Multilingual Document Parsing, Benchmark, Document Images, Non-Latin Scripts, Photographed Documents, Performance Evaluation, Open-source Models, Closed-source Models

103. ❌ Does Claude’s Constitution Have a Culture?

作者: Parham Pourdavood 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28123v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Constitutional AI对齐方法的文化偏见问题，直接涉及LLM对齐技术，与’Large Language Models’和’Instruction Tuning/Alignment’高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理优化、AI for Science等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，当宪法由主导训练数据的同一文化传统撰写时，Constitutional AI可能固化现有文化偏见而非纠正它们，导致表面干预无法实质性改变的价值底线。

摘要翻译

宪法人工智能（Constitutional AI，CAI）通过明确陈述的规范性原则对齐语言模型，为仅依赖人类反馈的隐式对齐提供了一种透明替代方案。然而，由于宪法由特定人群制定，由此产生的模型可能反映特定的文化视角。我们通过评估Anthropic的Claude Sonnet模型在55项世界价值观调查条目上的表现来研究这一问题，这些条目选自六个价值观领域，具有较高的跨文化差异度，并以直接调查问题和自然主义寻求建议情景两种形式呈现。将Claude的回应与来自90个国家的国家级数据进行比较后，我们发现Claude的价值观特征最接近北欧和英语国家，但在大多数条目上其倾向超出了所有被调查人群的范围。当用户提供文化背景信息时，Claude会调整其修辞框架，但不会改变其实质性价值立场，在全部十二个测试国家中，其效应量与零值无显著差异。移除系统提示的消融实验增加了模型的拒绝回答率，但并未改变给出回应时所表达的价值观；在较小模型（Claude Haiku）上的复现实验也证实了不同规模模型具有相同的文化特征。这些发现表明，当宪法的制定者与主导训练数据的文化传统相同时，宪法对齐可能固化现有的文化偏见而非纠正它们——从而形成一个表层干预无法实质性撼动的价值底线。我们讨论了这种风险的复合性，并强调了建立具有全球代表性的宪法制定流程的必要性。

摘要 (Abstract)

Constitutional AI (CAI) aligns language models with explicitly stated normative principles, offering a transparent alternative to implicit alignment through human feedback alone. However, because constitutions are authored by specific groups of people, the resulting models may reflect particular cultural perspectives. We investigate this question by evaluating Anthropic’s Claude Sonnet on 55 World Values Survey items, selected for high cross-cultural variance across six value domains and administered as both direct survey questions and naturalistic advice-seeking scenarios. Comparing Claude’s responses to country-level data from 90 nations, we find that Claude’s value profile most closely resembles those of Northern European and Anglophone countries, but on a majority of items extends beyond the range of all surveyed populations. When users provide cultural context, Claude adjusts its rhetorical framing but not its substantive value positions, with effect sizes indistinguishable from zero across all twelve tested countries. An ablation removing the system prompt increases refusals but does not alter the values expressed when responses are given, and replication on a smaller model (Claude Haiku) confirms the same cultural profile across model sizes. These findings suggest that when a constitution is authored within the same cultural tradition that dominates the training data, constitutional alignment may codify existing cultural biases rather than correct them–producing a value floor that surface-level interventions cannot meaningfully shift. We discuss the compounding nature of this risk and the need for globally representative constitution-authoring processes.

关键词: Constitutional AI, cultural bias, value alignment, language models, cross-cultural evaluation, Anthropic Claude, World Values Survey, system prompt

104. ❌ Quid est VERITAS? A Modular Framework for Archival Document Analysis

作者: Leonardo Bassanini, Ludovico Biancardi, Alfio Ferrara, Andrea Gamberini, Sergio Picascia, Folco Vaglienti 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28108v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出VERITAS框架用于历史文档分析，主要涉及OCR、布局分析和语义增强，与大多数大模型技术关键词无关。仅与’Retrieval-Augmented Generation’有中等关联（摘要提到使用RAG系统查询语料），与’AI for Science’有中等关联（属于AI在历史研究领域的应用）。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该研究提出了VERITAS框架，通过集成转录、布局分析和语义增强的模块化工作流，显著降低了历史文档的数字处理错误率并提高了效率。

摘要翻译

历史文献的数字化传统上被视为一种局限于字符级转录的过程，其生成的平面文本缺乏实质性计算分析所需的结构与语义信息。本文提出VERITAS（档案资源的视觉增强阅读、解释与转录框架），这是一个模块化、模型无关的框架，它将数字化重新构想为一个涵盖转录、版面分析与语义增强的集成工作流程。该流程分为四个阶段——预处理、提取、精炼与增强——并采用模式驱动的架构，使研究者能够以声明式方法指定其提取目标。我们在贝尔纳迪诺·科里奥的《米兰史》（一部超过1600页的文艺复兴时期编年史）的校勘本上对VERITAS进行了评估。结果表明，与商业OCR基线相比，该流程实现了67.6%的词错误率相对降低，且在计入人工校正后，端到端处理时间缩短至原来的三分之一。我们进一步通过检索增强生成系统对转录语料库进行查询，展示了流程输出的下游应用价值，证明了其支持历史研究的能力。

摘要 (Abstract)

The digitisation of historical documents has traditionally been conceived as a process limited to character-level transcription, producing flat text that lacks the structural and semantic information necessary for substantive computational analysis. We present VERITAS (Vision-Enhanced Reading, Interpretation, and Transcription of Archival Sources), a modular, model-agnostic framework that reconceptualises digitisation as an integrated workflow encompassing transcription, layout analysis, and semantic enrichment. The pipeline is organised into four stages - Preprocessing, Extraction, Refinement, and Enrichment - and employs a schema-driven architecture that allows researchers to declaratively specify their extraction objectives. We evaluate VERITAS on the critical edition of Bernardino Corio’s Storia di Milano, a Renaissance chronicle of over 1,600 pages. Results demonstrate that the pipeline achieves a 67.6% relative reduction in word error rate compared to a commercial OCR baseline, with a threefold reduction in end-to-end processing time when accounting for manual correction. We further illustrate the downstream utility of the pipeline’s output by querying the transcribed corpus through a retrieval-augmented generation system, demonstrating its capacity to support historical inquiry.

关键词: archival document analysis, digitization workflow, transcription, layout analysis, semantic enrichment, retrieval-augmented generation, historical inquiry, model-agnostic framework

105. ❌ Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models

作者: Luigi Curini, Alfio Ferrara, Giovanni Pagano, Sergio Picascia 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28103v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种基于视觉语言模型（Vision-Language Models）的流程，用于意大利议会演讲的自动转录、语义分割和实体链接。该研究主要涉及视觉语言模型在文档分析中的应用，属于大模型在特定领域（文档处理）的研究应用，因此与’Large Language Models OR LLMs OR Foundation Models’有一定关联（评分5分）。同时，该应用可视为’AI for Science’在社会科学/政治学领域的延伸，即利用AI技术处理历史文档，属于科学应用范畴（评分5分）。其他关键词主要涉及大模型的技术原理、训练方法、推理优化、对齐、代理系统等，而本文聚焦于一个具体的应用流程，未深入探讨这些底层技术，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究如何利用视觉语言模型自动转录、语义分割和链接意大利议会历史演讲文档，相比传统OCR方法，在转录质量和发言人标记方面取得了显著提升。

摘要翻译

议会会议记录是计算分析中丰富而具有挑战性的资源，尤其当它们仅以扫描版历史文件形式保存时。现有对意大利议会演讲的转录工作依赖于传统的光学字符识别流程，导致转录错误且语义标注有限。本文提出一种基于视觉-语言模型的流程，用于意大利议会演讲的自动转录、语义分割与实体链接。该流程采用专用OCR模型提取文本并保持阅读顺序，随后通过大规模视觉-语言模型联合推理视觉布局与文本内容，实现转录修正、元素分类和发言者识别。提取出的发言者将通过SPARQL查询和多策略模糊匹配程序链接至众议院知识库。基于现有基准的评估表明，该方法在转录质量与发言者标注方面均取得显著提升。

摘要 (Abstract)

Parliamentary proceedings represent a rich yet challenging resource for computational analysis, particularly when preserved only as scanned historical documents. Existing efforts to transcribe Italian parliamentary speeches have relied on traditional Optical Character Recognition pipelines, resulting in transcription errors and limited semantic annotation. In this paper, we propose a pipeline based on Vision-Language Models for the automatic transcription, semantic segmentation, and entity linking of Italian parliamentary speeches. The pipeline employs a specialised OCR model to extract text while preserving reading order, followed by a large-scale Vision-Language Model that performs transcription refinement, element classification, and speaker identification by jointly reasoning over visual layout and textual content. Extracted speakers are then linked to the Chamber of Deputies knowledge base through SPARQL queries and a multi-strategy fuzzy matching procedure. Evaluation against an established benchmark demonstrates substantial improvements both in transcription quality and speaker tagging.

关键词: Vision-Language Models, Parliamentary Speeches, Transcription, Semantic Segmentation, Entity Linking, Optical Character Recognition, Speaker Identification, Historical Documents

106. ❌ MOSS-VoiceGenerator: Create Realistic Voices with Natural Language Descriptions

作者: Kexin Huang, Liwei Fan, Botian Jiang, Yaozhou Jiang, Qian Tu, Jie Zhu, Yuqian Zhang, Yiwei Zhao, Chenchen Yang, Zhaoye Fei, Shimin Li, Xiaogui Yang, Qinyuan Cheng, Xipeng Qiu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28086v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文MOSS-VoiceGenerator专注于语音生成任务，具体研究如何从自然语言描述生成逼真的说话者音色。虽然论文提到了“instruction-driven voice generation model”，但这与关键词列表中的大模型技术（如LLMs、MoE、Scaling Laws等）或AI for Science应用领域没有直接关联。所有关键词均涉及大模型技术原理、优化方法或特定科学应用，而本论文的核心是语音合成（Text-to-Speech）模型，未涉及这些关键词所描述的技术或应用场景。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了MOSS-VoiceGenerator，一个基于指令驱动的开源语音生成模型，解决了从自然语言描述直接生成具有真实感音色的语音问题，并通过在电影内容的大规模表达性语音数据上训练，在主观偏好研究中显示出优于其他语音设计模型的整体性能、指令跟随能力和自然度。

摘要翻译

基于自然语言的音色设计旨在直接从自由形式的文本描述中生成说话者的音色特质，使用户能够创建适合特定角色、个性与情感的语音。这种可控的语音生成技术惠及广泛的下游应用——包括故事讲述、游戏配音、角色扮演代理和对话助手，使其成为现代文本转语音（Text-to-Speech, TTS）模型的一项重要任务。然而，现有模型主要基于精心录制的录音室数据进行训练，其生成的语音虽然清晰、发音标准，却缺乏真实人声所具有的生活化特质。为应对这些局限，我们提出了MOSS-VoiceGenerator，这是一个开源的指令驱动语音生成模型，能够直接从自然语言提示中创造新的音色。基于“接触真实世界声学变化可产生感知上更自然的语音”这一假设，我们采用源自影视内容的大规模富有表现力的语音数据进行训练。主观偏好研究表明，相较于其他音色设计模型，本模型在整体性能、指令遵循能力和自然度方面均表现出优越性。

摘要 (Abstract)

Voice design from natural language aims to generate speaker timbres directly from free-form textual descriptions, allowing users to create voices tailored to specific roles, personalities, and emotions. Such controllable voice creation benefits a wide range of downstream applications-including storytelling, game dubbing, role-play agents, and conversational assistants, making it a significant task for modern Text-to-Speech models. However, existing models are largely trained on carefully recorded studio data, which produces speech that is clean and well-articulated, yet lacks the lived-in qualities of real human voices. To address these limitations, we present MOSS-VoiceGenerator, an open-source instruction-driven voice generation model that creates new timbres directly from natural language prompts. Motivated by the hypothesis that exposure to real-world acoustic variation produces more perceptually natural voices, we train on large-scale expressive speech data sourced from cinematic content. Subjective preference studies demonstrate its superiority in overall performance, instruction-following, and naturalness compared to other voice design models.

关键词: voice generation, natural language descriptions, instruction-driven model, text-to-speech, expressive speech data, cinematic content, timbres, subjective preference

107. ❌ MolmoPoint: Better Pointing for VLMs with Grounding Tokens

作者: Christopher Clark, Yue Yang, Jae Sung Park, Zixian Ma, Jieyu Zhang, Rohun Tripathi, Mohammadreza Salehi, Sangho Lee, Taira Anderson, Winson Han, Ranjay Krishna 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28069v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视觉语言模型（VLMs）的指向机制改进，提出了一种基于视觉令牌选择的直观指向方法。虽然论文涉及大模型（VLMs属于多模态大模型），但所有评分关键词都针对文本语言模型（LLMs）或通用大模型技术，没有涉及视觉语言模型的具体技术。论文内容与所有给定的文本大模型关键词（如LLMs、MoE、Scaling Laws、RLHF、RAG等）完全无关，也与AI for Science等应用领域无关。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种新的视觉语言模型指向机制，通过直接选择视觉令牌而非生成坐标来改进指向精度，在多个基准测试中取得了新的最先进性能。

摘要翻译

定位已成为视觉语言模型（VLMs）的一项基础能力。现有大多数视觉语言模型通过生成坐标作为文本输出的一部分来实现指向，这需要学习复杂的坐标系系统并导致较高的标记数量。相反，我们提出了一种更直观的指向机制，直接选择包含目标概念的视觉标记。我们的模型生成一种特殊的指向标记，该标记通过交叉注意力机制作用于输入图像或视频标记，并选择合适的目标。为使模型具备更细粒度的能力，我们在这些指向标记后引入一个额外的特殊标记，用于在初始选定区域内选择一个细粒度的子区域，随后使用第三个标记在该子区域内指定具体位置。我们进一步证明，通过以一致顺序依次生成指向点、编码先前选定点的相对位置，以及在选择视觉标记时引入特殊的“无更多点”类别，可以提升模型性能。采用该方法，我们在图像指向任务上取得了新的最优性能（在PointBench数据集上达到70.7%），在图形用户界面指向任务中成为完全开源模型的新标杆（在ScreenSpotPro数据集上达到61.1%），并提升了视频指向（相较于文本坐标基线获得59.1%的人类偏好胜率）与跟踪性能（在Molmo2Track数据集上提升+6.3%）。此外，我们展示了该方法具有更高的样本效率，并讨论了这一设计变革带来的定性差异。

摘要 (Abstract)

Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appropriate one. To make this model more fine-grained, we follow these pointing tokens with an additional special token that selects a fine-grained subpatch within the initially selected region, and then a third token that specifies a location within that subpatch. We further show that performance improves by generating points sequentially in a consistent order, encoding the relative position of the previously selected point, and including a special no-more-points class when selecting visual tokens. Using this method, we set a new state-of-the-art on image pointing (70.7% on PointBench), set a new state-of-the-art among fully open models on GUI pointing (61.1% on ScreenSpotPro), and improve video pointing (59.1% human preference win rate vs. a text coordinate baseline) and tracking (+6.3% gain on Molmo2Track). We additionally show that our method achieves much higher sample efficiency and discuss the qualitative differences that emerge from this design change.

关键词: Vision-Language Models, Pointing Mechanism, Grounding Tokens, Visual Token Selection, Fine-grained Pointing, State-of-the-art Performance, Sample Efficiency

108. ❌ Synonymix: Unified Group Personas for Generative Simulations

作者: Huanxing Chen, Aditesh Kumar 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28066v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究生成式代理模拟中的中观层面表示（群体人物统一），提出Synonymix管道通过图抽象和合并构建“统一图”，用于群体行为分析和合成人物生成。虽然涉及生成式代理和模拟，但论文未明确使用大语言模型（LLM）或深度学习技术，也未涉及评分关键词中的具体技术（如MoE、SFT、RAG、量化等）。所有关键词均与大模型技术原理、训练方法、推理优化、代理系统或科学AI应用相关，而本文聚焦于模拟方法学（图抽象、隐私保护评估），与这些技术无直接关联，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

论文提出Synonymix管道，通过图抽象和合并构建群体人物统一表示，实现中观层面的生成式模拟，在保留行为信号的同时提供隐私保证。

摘要翻译

生成式智能体模拟在两个层面运行：用于角色交互的个体人物模型，以及用于集体行为分析和干预测试的群体模型。我们提出了第三个层面：中观模拟——与群体层面的表征进行交互，这些表征保留了与丰富个体经验的关联性。为实现此目标，我们提出了Synonymix，这是一个通过基于图的抽象与融合，从多个生命故事人物模型中构建“统一图”的流程。该流程生成一个可查询的集体表征，既能用于意义建构探索，也能采样以生成合成人物模型。通过在综合社会调查项目上评估合成智能体，我们证明了其行为信号的保留度超越了人口统计学基线（p<0.001，r=0.59），并具备可验证的隐私保障（最大来源贡献度<13%）。我们邀请学界就中观模拟所支持的交互模式，以及“高保真”人物模型是否能够捕捉真实生活经验的质感展开讨论。

摘要 (Abstract)

Generative agent simulations operate at two scales: individual personas for character interaction, and population models for collective behavior analysis and intervention testing. We propose a third scale: meso-level simulation - interaction with group-level representations that retain grounding in rich individual experience. To enable this, we present Synonymix, a pipeline that constructs a “unigraph” from multiple life story personas via graph-based abstraction and merging, producing a queryable collective representation that can be explored for sensemaking or sampled for synthetic persona generation. Evaluating synthetic agents on General Social Survey items, we demonstrate behavioral signal preservation beyond demographic baselines (p<0.001, r=0.59) with demonstrable privacy guarantee (max source contribution <13%). We invite discussion on interaction modalities enabled by meso-level simulations, and whether “high-fidelity” personas can ever capture the texture of lived experience.

关键词: generative agent simulations, group personas, meso-level simulation, unigraph, graph-based abstraction, behavioral signal preservation, privacy guarantee, synthetic persona generation

109. ❌ Reward Hacking as Equilibrium under Finite Evaluation

作者: Jiacheng Wang, Jinbin Huang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28063v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究AI对齐中的奖励黑客问题，与’Instruction Tuning OR Alignment OR Value Alignment’和’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’高度相关（10分），因为论文明确讨论RLHF、DPO等对齐方法。与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为论文研究AI代理系统及其向代理系统的转变。与’Tool Use OR Function Calling OR API Tool Use’有一定关联（5分），因为论文提到工具数量增长导致评估覆盖下降。与’Hallucination Mitigation OR Factuality OR Truthfulness’有一定关联（5分），因为奖励黑客涉及真实性问题。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为论文讨论AI系统，但未明确聚焦LLMs。其他关键词与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文证明在五个最小公理下，任何优化的AI代理都会系统性地在评估系统未覆盖的质量维度上投入不足努力，从而将奖励黑客确立为结构性均衡而非可修正的错误，并预测了从封闭推理到代理系统转变时黑客严重性的无界增长。

摘要翻译

我们证明，在五个最小公理——多维质量、有限评估、有效优化、资源有限性与组合交互——之下，任何经过优化的智能体都会在其评估体系未覆盖的质量维度上系统性地投入不足。这一结果表明，奖励操控是一种结构性均衡，而非可修正的缺陷，且该结论独立于所采用的具体对齐方法（RLHF、DPO、Constitutional AI 或其他）或评估架构。我们的框架将 Holmstrom 与 Milgrom（1991）的多任务委托-代理模型实例化于人工智能对齐场景中，但利用了人工智能系统独有的结构特征——奖励模型已知且可微分的架构——推导出一个可计算的扭曲指数，该指数可在部署前预测每个质量维度上操控行为的方向与严重程度。我们进一步证明，从封闭推理系统向智能体系统的转变会导致评估覆盖率随工具数量增加而趋近于零——因为质量维度呈组合式扩张，而每个工具的评估成本至多线性增长——因此操控严重性在结构上将无界递增。我们的研究将谄媚行为、长度操控与规范博弈的解释统一于单一理论框架下，并提出了可操作的脆弱性评估流程。我们进一步通过部分形式化分析推测：存在一个能力阈值，超过该阈值后，智能体会从在评估体系内进行博弈（古德哈特区域）转变为主动破坏评估体系本身（坎贝尔区域），这为博斯特罗姆（2014）提出的“叛逆转折”提供了首个经济学形式化表述。

摘要 (Abstract)

We prove that under five minimal axioms – multi-dimensional quality, finite evaluation, effective optimization, resource finiteness, and combinatorial interaction – any optimized AI agent will systematically under-invest effort in quality dimensions not covered by its evaluation system. This result establishes reward hacking as a structural equilibrium, not a correctable bug, and holds regardless of the specific alignment method (RLHF, DPO, Constitutional AI, or others) or evaluation architecture employed. Our framework instantiates the multi-task principal-agent model of Holmstrom and Milgrom (1991) in the AI alignment setting, but exploits a structural feature unique to AI systems – the known, differentiable architecture of reward models – to derive a computable distortion index that predicts both the direction and severity of hacking on each quality dimension prior to deployment. We further prove that the transition from closed reasoning to agentic systems causes evaluation coverage to decline toward zero as tool count grows – because quality dimensions expand combinatorially while evaluation costs grow at most linearly per tool – so that hacking severity increases structurally and without bound. Our results unify the explanation of sycophancy, length gaming, and specification gaming under a single theoretical structure and yield an actionable vulnerability assessment procedure. We further conjecture – with partial formal analysis – the existence of a capability threshold beyond which agents transition from gaming within the evaluation system (Goodhart regime) to actively degrading the evaluation system itself (Campbell regime), providing the first economic formalization of Bostrom’s (2014) “treacherous turn.”

关键词: reward hacking, AI alignment, RLHF, DPO, agentic systems, evaluation coverage, multi-task principal-agent model, treacherous turn

110. ❌ SLOW: Strategic Logical-inference Open Workspace for Cognitive Adaptation in AI Tutoring

作者: Yuang Wei, Ruijia Li, Bo Jiang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28062v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SLOW是一个基于LLM的智能辅导框架，核心创新在于将人类双过程认知理论应用于AI辅导系统，通过分离学习者状态推理和教学行动选择来支持深思熟虑的推理。论文明确提到LLMs作为基础技术，并涉及System 2 Thinking（慢思考）、Chain of Thought（多步推理）和Explainable AI（可解释性）等核心概念。论文属于LLM在教育领域的应用研究，与LLM Agents有一定关联（作为辅导代理），但与大多数技术实现关键词（如MoE、量化、微调方法等）无关。

!!! tip deepseek-chat TL;DR

该论文针对当前LLM教育辅导系统依赖直觉式单次生成、缺乏专门推理空间的问题，提出了SLOW框架，通过分离学习者状态推理和教学行动选择，显著提升了辅导的个性化、情感敏感性和可解释性。

摘要翻译

尽管大型语言模型（LLM）在教育对话中展现出卓越的流畅性，但大多数生成式辅导系统主要依赖直觉式的单次生成模式运行。这种对“快思考”的依赖缺乏专门的推理工作空间，迫使多种诊断与策略信号以混杂的方式进行处理。其结果是，学习者认知诊断、情感感知与教学决策紧密纠缠，限制了辅导系统进行审慎教学适应的能力。我们提出SLOW框架，这是一个基于理论指导的辅导框架，旨在透明的决策工作空间内支持对学习者状态的审慎推理。受人类辅导双过程理论的启发，SLOW明确将学习者状态推断与教学行动选择分离开来。该框架整合了从学习者语言中解析因果证据、结合反事实稳定性分析的模糊认知诊断，以及前瞻性情感推理以预测教学选择如何影响学习者的情绪轨迹。这些信号被共同考量，用以指导教学与情感协调的辅导策略。通过人机混合评估表明，该系统在个性化、情感敏感性和清晰度方面均有显著提升。消融研究进一步证实了各模块的必要性，展示了SLOW如何通过可视化的决策过程实现可解释且可靠的智能辅导。此项工作推动了基于LLM的自适应教学在可解释性与教育有效性方面的进展。

摘要 (Abstract)

While Large Language Models (LLMs) have demonstrated remarkable fluency in educational dialogues, most generative tutors primarily operate through intuitive, single-pass generation. This reliance on fast thinking precludes a dedicated reasoning workspace, forcing multiple diagnostic and strategic signals to be processed in a conflated manner. As a result, learner cognitive diagnosis, affective perception, and pedagogical decision-making become tightly entangled, which limits the tutoring system’s capacity for deliberate instructional adaptation. We propose SLOW, a theory-informed tutoring framework that supports deliberate learner-state reasoning within a transparent decision workspace. Inspired by dual-process accounts of human tutoring, SLOW explicitly separates learner-state inference from instructional action selection. The framework integrates causal evidence parsing from learner language, fuzzy cognitive diagnosis with counterfactual stability analysis, and prospective affective reasoning to anticipate how instructional choices may influence learners’ emotional trajectories. These signals are jointly considered to guide pedagogically and affectively aligned tutoring strategies. Evaluation using hybrid human-AI judgments demonstrates significant improvements in personalization, emotional sensitivity, and clarity. Ablation studies further confirm the necessity of each module, showcasing how SLOW enables interpretable and reliable intelligent tutoring through a visualized decision-making process. This work advances the interpretability and educational validity of LLM-based adaptive instruction.

关键词: Large Language Models, Intelligent Tutoring, Dual-process Theory, System 2 Thinking, Cognitive Diagnosis, Interpretable AI, Educational Adaptation, Affective Reasoning

111. ❌ Meta-Harness: End-to-End Optimization of Model Harnesses

作者: Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, Chelsea Finn 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28052v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM应用中的harness（代码框架）优化，直接涉及LLM系统、RAG（检索增强生成）和LLM Agents（代理工作流），其中LLM是基础技术，RAG在数学推理任务中明确提及，Agentic Workflow在编码任务中体现；Context Window Extension获得5分，因为研究涉及上下文管理优化（使用更少token）；其他关键词如MoE、SFT、量化等未在摘要中体现，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过自动化搜索优化大语言模型应用中的harness代码框架，在文本分类、数学推理和代理编码任务中显著提升了性能并减少了上下文使用。

摘要翻译

大型语言模型（LLM）系统的性能不仅取决于模型权重，还取决于其控制框架（harness）——即决定存储、检索及向模型呈现何种信息的代码。然而，此类框架目前仍主要依赖人工设计，且现有文本优化工具与此场景适配不佳，因其对反馈信息的压缩过于激进。本文提出元控制框架（Meta-Harness），一种为LLM应用搜索控制框架代码的外循环系统。该系统通过智能提议器（agentic proposer）访问文件系统中所有历史候选方案的源代码、评分及执行轨迹。在在线文本分类任务中，Meta-Harness相较于最先进的上下文管理系统，在减少4倍上下文令牌用量的同时，性能提升7.7个百分点。在检索增强的数学推理任务中，单个已发现的控制框架使五个预留模型在200道国际数学奥林匹克（IMO）级别问题上的平均准确率提升4.7个百分点。在智能体编码任务中，所发现的控制框架在TerminalBench-2基准测试中超越了最佳人工设计的基线方法。这些结果表明，通过对历史经验更丰富的访问，能够实现控制框架的自动化工程优化。

摘要 (Abstract)

The performance of large language model (LLM) systems depends not only on model weights, but also on their harness: the code that determines what information to store, retrieve, and present to the model. Yet harnesses are still designed largely by hand, and existing text optimizers are poorly matched to this setting because they compress feedback too aggressively. We introduce Meta-Harness, an outer-loop system that searches over harness code for LLM applications. It uses an agentic proposer that accesses the source code, scores, and execution traces of all prior candidates through a filesystem. On online text classification, Meta-Harness improves over a state-of-the-art context management system by 7.7 points while using 4x fewer context tokens. On retrieval-augmented math reasoning, a single discovered harness improves accuracy on 200 IMO-level problems by 4.7 points on average across five held-out models. On agentic coding, discovered harnesses surpass the best hand-engineered baselines on TerminalBench-2. Together, these results show that richer access to prior experience can enable automated harness engineering.

关键词: large language model, harness optimization, retrieval-augmented generation, agentic workflow, context management, automated engineering, LLM applications, outer-loop system

112. ❌ Dogfight Search: A Swarm-Based Optimization Algorithm for Complex Engineering Optimization and Mountainous Terrain Path Planning

作者: Yujing Sun, Jie Cai, Xingguo Xu, Yuansheng Gao, Lei Zhang, Kaichen Ouyang, Zhanyu Liu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28046v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种名为Dogfight Search（DoS）的新型元启发式优化算法，灵感来源于战斗机协同战术，但其搜索机制基于运动学中的位移积分方程构建。该研究专注于复杂工程优化和山地地形路径规划问题，通过基准测试函数和实际优化任务验证了算法的优越性。所有评分关键词均与大语言模型、深度学习技术原理、AI科学应用等主题相关，而本文研究的是传统的元启发式优化算法，与深度学习、大模型技术无直接关联，也未涉及AI在生物信息学或化学信息学等科学领域的应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

本文提出了一种基于战斗机协同战术灵感的新型元启发式优化算法Dogfight Search，在复杂工程优化和山地地形路径规划任务中显著优于现有先进算法。

摘要翻译

狗斗是战斗机之间的一种战术协同行为。受此启发，本文提出了一种新颖的无隐喻元启发式算法，称为狗斗搜索算法（Dogfight Search, DoS）。与传统算法不同，DoS从该灵感中提取算法框架，但其搜索机制是基于运动学中的位移积分方程构建的。通过在CEC2017和CEC2022基准测试函数、10个现实世界约束优化问题以及山地地形路径规划任务上的实验验证，DoS在整体性能上显著优于7个先进竞争对手，并在弗里德曼排名中位列第一。此外，本文在CEC2017和CEC2022基准测试函数上将DoS与3种SOTA算法进行了性能比较。结果表明，DoS继续保持领先地位，展现出强大的竞争力。DoS的源代码可在https://ww2.mathworks.cn/matlabcentral/fileexchange/183519-dogfight-search获取。

摘要 (Abstract)

Dogfight is a tactical behavior of cooperation between fighters. Inspired by this, this paper proposes a novel metaphor-free metaheuristic algorithm called Dogfight Search (DoS). Unlike traditional algorithms, DoS draws algorithmic framework from the inspiration, but its search mechanism is constructed based on the displacement integration equations in kinematics. Through experimental validation on CEC2017 and CEC2022 benchmark test functions, 10 real-world constrained optimization problems and mountainous terrain path planning tasks, DoS significantly outperforms 7 advanced competitors in overall performance and ranks first in the Friedman ranking. Furthermore, this paper compares the performance of DoS with 3 SOTA algorithms on the CEC2017 and CEC2022 benchmark test functions. The results show that DoS continues to maintain its lead, demonstrating strong competitiveness. The source code of DoS is available at https://ww2.mathworks.cn/matlabcentral/fileexchange/183519-dogfight-search.

关键词: Dogfight Search, metaheuristic algorithm, optimization algorithm, complex engineering optimization, mountainous terrain path planning, swarm-based optimization, displacement integration equations, benchmark test functions

113. ❌ Bit-Identical Medical Deep Learning via Structured Orthogonal Initialization

作者: Yakov Pyotr Shkolnikov 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28040v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于医学深度学习训练的可重复性问题，提出了一种结构化正交初始化方法来实现比特级相同的训练。虽然属于AI在科学（医学）领域的应用，但研究重点不是大模型技术，而是深度学习训练确定性的技术原理创新。所有关键词中，只有"AI for Science OR Bioinformatics OR Cheminformatics"与论文的医学应用背景有一定关联（5分），其他关键词均与大模型技术、训练方法、推理优化等无关（0分）。

!!! tip deepseek-chat TL;DR

该论文解决了深度学习训练中的非确定性问题，提出了一种结构化正交初始化框架，实现了医学深度学习模型的比特级相同训练，显著减少了模型方差并提高了罕见类别的分类稳定性。

摘要翻译

深度学习训练具有非确定性：使用不同随机种子的相同代码会产生在总体指标上一致但在个体预测上存在分歧的模型，在罕见临床类别上，其每类AUC波动超过20个百分点。我们提出了一个经过验证的比特级一致性训练框架，该框架消除了三个随机性来源：权重初始化（通过结构化正交基函数）、批次顺序（通过黄金比例调度）以及非确定性GPU操作（通过架构选择和自定义自动微分）。该流程能在独立运行中产生经过MD5验证的完全相同的训练权重。
在PTB-XL心电图节律分类任务中，结构化初始化在两种架构上均显著优于Kaiming初始化（n=20；Conformer p = 0.016，Baseline p < 0.001），将总体方差降低了2-3倍，并将罕见节律的每类变异性降低了高达7.5倍（TRIGU范围：4.1个百分点 vs Kaiming下的30.9个百分点，经3折交叉验证独立确认）。在n=20时进行的四种基函数比较表明，所有结构化正交基均产生等效性能（Friedman p=0.48），这证实了贡献来自于确定性结构化初始化本身，而非任何特定的基函数。在七个MedMNIST基准上的跨领域验证（n=20，所有p > 0.14）证实，在标准任务上没有性能损失；在不平衡任务（ChestMNIST, RetinaMNIST）上的每类分析显示，在心电图中观察到的罕见类别方差降低现象同样存在。在三个外部心电图数据库上的跨数据集评估证实了零样本泛化能力（>0.93 AFIB AUC）。

摘要 (Abstract)

Deep learning training is non-deterministic: identical code with different random seeds produces models that agree on aggregate metrics but disagree on individual predictions, with per-class AUC swings exceeding 20 percentage points on rare clinical classes. We present a framework for verified bit-identical training that eliminates three sources of randomness: weight initialization (via structured orthogonal basis functions), batch ordering (via golden ratio scheduling), and non-deterministic GPU operations (via architecture selection and custom autograd). The pipeline produces MD5-verified identical trained weights across independent runs. On PTB-XL ECG rhythm classification, structured initialization significantly exceeds Kaiming across two architectures (n=20; Conformer p = 0.016, Baseline p < 0.001), reducing aggregate variance by 2-3x and reducing per-class variability on rare rhythms by up to 7.5x (TRIGU range: 4.1pp vs 30.9pp under Kaiming, independently confirmed by 3-fold CV). A four-basis comparison at n=20 shows all structured orthogonal bases produce equivalent performance (Friedman p=0.48), establishing that the contribution is deterministic structured initialization itself, not any particular basis function. Cross-domain validation on seven MedMNIST benchmarks (n=20, all p > 0.14) confirms no performance penalty on standard tasks; per-class analysis on imbalanced tasks (ChestMNIST, RetinaMNIST) shows the same variance reduction on rare classes observed in ECG. Cross-dataset evaluation on three external ECG databases confirms zero-shot generalization (>0.93 AFIB AUC).

关键词: bit-identical training, structured orthogonal initialization, medical deep learning, deterministic training, ECG classification, variance reduction, rare clinical classes, verified training pipeline

114. ❌ CARLA-Air: Fly Drones Inside a CARLA World – A Unified Infrastructure for Air-Ground Embodied Intelligence

作者: Tianle Zeng, Hanxuan Chen, Yanci Wen, Hong Zhang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28032v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	5.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要介绍CARLA-Air，一个用于空中-地面具身智能的统一仿真基础设施，将CARLA（地面驾驶）和AirSim（空中飞行）集成到单个Unreal Engine进程中。论文内容与大多数关键词（涉及大模型、深度学习技术原理、训练方法、推理优化等）完全无关。仅与两个关键词有微弱关联：1）‘Multi-agent Systems OR Agent Coordination’（5分）：平台支持空中和地面智能体的协同工作，涉及多智能体协调，但论文未深入探讨协调算法本身。2）‘World Models AND General World Models’（5分）：平台提供了一个物理一致的仿真环境，可视为构建世界模型的基础设施，但论文未直接研究世界模型算法。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文解决了现有仿真平台在联合建模空中和地面智能体时缺乏统一物理环境的问题，提出了CARLA-Air，一个开源基础设施，将高保真城市驾驶和物理精确的多旋翼飞行统一在单个Unreal Engine进程中，支持多种具身智能任务。

摘要翻译

低空经济、具身智能与空地协同系统的融合，对能够在单一物理一致环境中联合建模空中与地面智能体的仿真基础设施提出了日益增长的需求。现有开源平台仍存在领域分隔问题：驾驶仿真器缺乏空中动力学模型，而多旋翼仿真器则缺乏真实的地面场景。基于桥接的协同仿真会引入同步开销，且无法保证严格的时空一致性。
我们提出CARLA-Air，这是一个开源基础设施，在单一Unreal Engine进程中统一了高保真城市驾驶与物理精确的多旋翼飞行仿真。该平台完整保留了CARLA和AirSim原生的Python API及ROS 2接口，实现了零修改的代码复用。在共享的物理时钟步与渲染管线内，CARLA-Air提供了具有照片级真实感的环境，包含规则遵守的交通流、具备社会意识的行人以及空气动力学一致的无人机（UAV）动力学模型，并在每个时钟步同步捕获所有平台多达18种传感器模态的数据。该平台支持具有代表性的空地具身智能任务，涵盖协同作业、具身导航与视觉语言动作、多模态感知与数据集构建，以及基于强化学习的策略训练。一个可扩展的资源管线允许将自定义机器人平台集成到共享世界中。通过继承AirSim的空中能力——其上游开发已存档——CARLA-Air确保了这套广泛采用的飞行仿真栈能够在现代基础设施中持续演进。
平台已发布预编译二进制文件及完整源代码：https://github.com/louiszengCN/CarlaAir

摘要 (Abstract)

The convergence of low-altitude economies, embodied intelligence, and air-ground cooperative systems creates growing demand for simulation infrastructure capable of jointly modeling aerial and ground agents within a single physically coherent environment. Existing open-source platforms remain domain-segregated: driving simulators lack aerial dynamics, while multirotor simulators lack realistic ground scenes. Bridge-based co-simulation introduces synchronization overhead and cannot guarantee strict spatial-temporal consistency. We present CARLA-Air, an open-source infrastructure that unifies high-fidelity urban driving and physics-accurate multirotor flight within a single Unreal Engine process. The platform preserves both CARLA and AirSim native Python APIs and ROS 2 interfaces, enabling zero-modification code reuse. Within a shared physics tick and rendering pipeline, CARLA-Air delivers photorealistic environments with rule-compliant traffic, socially-aware pedestrians, and aerodynamically consistent UAV dynamics, synchronously capturing up to 18 sensor modalities across all platforms at each tick. The platform supports representative air-ground embodied intelligence workloads spanning cooperation, embodied navigation and vision-language action, multi-modal perception and dataset construction, and reinforcement-learning-based policy training. An extensible asset pipeline allows integration of custom robot platforms into the shared world. By inheriting AirSim’s aerial capabilities – whose upstream development has been archived – CARLA-Air ensures this widely adopted flight stack continues to evolve within a modern infrastructure. Released with prebuilt binaries and full source: https://github.com/louiszengCN/CarlaAir

关键词: CARLA-Air, air-ground embodied intelligence, simulation infrastructure, Unreal Engine, multirotor flight, urban driving, sensor modalities, reinforcement-learning policy training

115. ❌ When Choices Become Priors: Contrastive Decoding for Scientific Figure Multiple-Choice QA

作者: Taeyun Roh, Eun-yeong Jo, Wonjune Jang, Jaewoo Kang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28026v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出了一种名为SCICON的解码方法，用于解决科学图表多选题中选项本身作为先验导致模型偏向科学上合理选项而非图像证据的问题。该方法通过对比图像条件得分和纯文本选项得分来评分候选答案，属于大模型在科学领域的应用创新。与关键词的相关性分析：1. 论文涉及多模态模型（包含LLM组件），因此与"Large Language Models"相关（8分）；2. 研究科学图表推理，与"AI for Science"高度相关（10分）；3. 涉及推理过程改进，与"Chain of Thought"和"System 2 Thinking"相关（各8分）；4. 旨在减少模型因先验而产生的错误，与"Hallucination Mitigation"相关（8分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对科学图表多选题中选项作为先验导致模型忽略图像证据的问题，提出了一种训练免费的对比解码方法SCICON，通过减去纯文本选项得分来强调图像证据，在多个基准测试中提高了准确性。

摘要翻译

科学图表多选题解答（MCQA）要求模型对多样化的视觉证据进行推理，这些证据涵盖从图表和多面板图示到显微图像及生物医学图像等多种类型。然而，这一设定存在一种特殊的偏差：答案选项本身可能作为先验信息，引导多模态模型倾向于选择科学上看似合理的选项，即使图表证据支持的是另一个答案。我们通过一个简单的问题来探究这一失效模式：如果在解码过程中明确降低模型仅从文本中得出的偏好，转而更重视基于图表证据的选项，结果会如何？为此，我们提出了SCICON，一种无需训练的解码方法，该方法通过从每个候选答案的图像条件得分中减去其纯文本选项得分来对候选答案进行评分。与先前通过对比原始输入与扭曲图像或扰动指令来减少幻觉的对比解码方法不同，SCICON直接针对候选答案文本中编码的选项诱导先验进行校正。在三个科学图表问答基准测试和三种模型架构上，SCICON相较于标准解码基线方法持续提升了准确率。这些结果表明，针对选项诱导先验进行反向解码是一种有效且简单的方法，能够提升科学多选题中基于图表的推理能力。

摘要 (Abstract)

Scientific figure multiple-choice question answering (MCQA) requires models to reason over diverse visual evidence, ranging from charts and multipanel figures to microscopy and biomedical images. However, this setting suffers from a distinctive bias: answer choices themselves can act as priors, steering multimodal models toward scientifically plausible options even when the figure supports a different answer. We investigate this failure mode through a simple question: what if decoding explicitly discounts what the model would prefer from text alone, so as to favor figure-grounded evidence? To this end, we propose SCICON, a training-free decoding method that scores each candidate by subtracting a text-only option score from its image-conditioned counterpart. Unlike prior contrastive decoding approaches that mitigate hallucinations by contrasting original inputs with distorted images or perturbed instructions, SCICON directly targets the choice-induced prior encoded in candidate text. Across three scientific figure QA benchmarks and three model backbones, SCICON consistently improves accuracy over standard decoding baselines. These results show that decoding against choice-induced priors is an effective and simple way to improve figure-grounded reasoning in scientific MCQA.

关键词: scientific figure QA, multiple-choice QA, contrastive decoding, choice-induced prior, multimodal reasoning, figure-grounded evidence, training-free method, SCICON

116. ❌ What an Autonomous Agent Discovers About Molecular Transformer Design: Does It Transfer?

作者: Edward Wijaya 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28015v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究使用自主代理（autonomous agent）进行分子序列（SMILES、蛋白质）和自然语言文本的架构搜索，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为核心方法是部署自主代理进行架构搜索。同时，论文涉及分子序列（SMILES、蛋白质）的深度学习模型设计，属于生物信息学/化学信息学应用，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。其他关键词如大模型技术原理、训练方法、推理优化、对齐等均未在论文中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究自主代理在分子序列（SMILES、蛋白质）和自然语言文本的Transformer架构搜索中的有效性，发现对于SMILES序列，简单的超参数调优优于架构搜索，而对于自然语言，架构变化带来显著改进，且发现的架构创新在不同领域间可迁移。

摘要翻译

针对类药分子与蛋白质的深度学习模型普遍沿用了为自然语言设计的Transformer架构，然而分子序列是否能从不同架构设计中获益尚未得到系统性验证。我们通过智能体在三种序列类型（SMILES、蛋白质及作为对照的英文文本）上开展自主架构搜索，在单GPU上运行了3,106次实验。对于SMILES序列，架构搜索适得其反：仅调整学习率与训练计划即可超越完整搜索的效果（p = 0.001）。对于自然语言，架构改进贡献了81%的性能提升（p = 0.009）。蛋白质数据的结果介于两者之间。令人惊讶的是，尽管智能体为每个领域发现了不同的架构（p = 0.004），但每一项创新均能迁移至所有三个领域且性能衰减小于1%，这表明差异源于搜索路径依赖性而非根本性的生物学需求。我们发布了决策框架与开源工具包，以帮助分子建模团队在自主架构搜索与简单超参数调优之间做出选择。

摘要 (Abstract)

Deep learning models for drug-like molecules and proteins overwhelmingly reuse transformer architectures designed for natural language, yet whether molecular sequences benefit from different designs has not been systematically tested. We deploy autonomous architecture search via an agent across three sequence types (SMILES, protein, and English text as control), running 3,106 experiments on a single GPU. For SMILES, architecture search is counterproductive: tuning learning rates and schedules alone outperforms the full search (p = 0.001). For natural language, architecture changes drive 81% of improvement (p = 0.009). Proteins fall between the two. Surprisingly, although the agent discovers distinct architectures per domain (p = 0.004), every innovation transfers across all three domains with <1% degradation, indicating that the differences reflect search-path dependence rather than fundamental biological requirements. We release a decision framework and open-source toolkit for molecular modeling teams to choose between autonomous architecture search and simple hyperparameter tuning.

关键词: autonomous agent, architecture search, molecular sequences, SMILES, protein sequences, transformer design, hyperparameter tuning, domain transfer

117. ❌ Adaptive Block-Scaled Data Types

作者: Jack Cook, Hyemin S. Lee, Kathryn Le, Junxian Guo, Giovanni Traverso, Anantha P. Chandrakasan, Song Han 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28765v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	15.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究是提出新的自适应块缩放数据类型（IF4、IF3、IF6）用于大语言模型的量化，直接与’Quantization OR Model Compression OR Low-bit Weights’高度相关（15分）。论文明确针对大语言模型（LLMs）进行量化研究，因此’Large Language Models OR LLMs OR Foundation Models’得10分。论文提到在训练后量化（post-training quantization）中评估性能，与’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（5分）。论文设计了IF4乘法累加单元以实现高效硬件加速，与’Speculative Decoding OR Inference Acceleration’有一定关联（5分）。其他关键词与论文内容无直接关系，得0分。

!!! tip deepseek-chat TL;DR

该论文针对NVFP4格式在量化大语言模型时对接近最大值误差较大的问题，提出了自适应块缩放数据类型IF4（在FP4和INT4之间自适应选择），在量化训练中实现了更低的损失，在训练后量化中获得了更高的准确率，并设计了高效的硬件加速单元。

摘要翻译

NVFP4作为一种4比特量化大语言模型的格式，因其硬件支持能力以及能以较少比特数保留有效信息的特性而日益普及。然而，该格式并非没有局限：近期研究表明，NVFP4受其误差分布影响，在每组16个数值中的接近最大值上会产生大量量化误差。本研究基于这一观察，设计了一种新型的自适应块缩放数据类型，能够根据输入值的分布进行自适应调整。针对4比特量化，我们提出的IF4（整型/浮点4比特）数据类型可在每组16个数值中动态选择FP4或INT4表示方式，并采用与NVFP4相同的E4M3缩放因子进行缩放。所选数据类型通过缩放因子的符号位进行标识（该符号位在NVFP4中当前未被使用），我们运用相同原理为其他比特宽度设计了相应格式，包括IF3和IF6。在语言模型量化应用中，IF4优于现有的4比特块缩放格式，在量化训练过程中实现更低的损失，并在训练后量化的多项任务中达到更高准确率。此外，我们设计并评估了IF4乘加运算单元，以证明IF4能在下一代硬件加速器中高效实现。相关代码已发布于https://github.com/mit-han-lab/fouroversix。

摘要 (Abstract)

NVFP4 has grown increasingly popular as a 4-bit format for quantizing large language models due to its hardware support and its ability to retain useful information with relatively few bits per parameter. However, the format is not without limitations: recent work has shown that NVFP4 suffers from its error distribution, resulting in large amounts of quantization error on near-maximal values in each group of 16 values. In this work, we leverage this insight to design new Adaptive Block-Scaled Data Types that can adapt to the distribution of their input values. For four-bit quantization, our proposed IF4 (Int/Float 4) data type selects between FP4 and INT4 representations for each group of 16 values, which are then scaled by an E4M3 scale factor as is done with NVFP4. The selected data type is denoted using the scale factor’s sign bit, which is currently unused in NVFP4, and we apply the same insight to design formats for other bit-widths, including IF3 and IF6. When used to quantize language models, we find that IF4 outperforms existing 4-bit block-scaled formats, achieving lower loss during quantized training and achieving higher accuracy on many tasks in post-training quantization. We additionally design and evaluate an IF4 Multiply-Accumulate (MAC) unit to demonstrate that IF4 can be implemented efficiently in next-generation hardware accelerators. Our code is available at https://github.com/mit-han-lab/fouroversix.

关键词: quantization, large language models, 4-bit format, adaptive block-scaled data types, IF4, post-training quantization, hardware accelerators, model compression

118. ❌ EpiScreen: Early Epilepsy Detection from Electronic Health Records with Large Language Models

作者: Shuang Zhou, Kai Yu, Zaifu Zhan, Huixue Zhou, Min Zeng, Feng Xie, Zhiyi Sha, Rui Zhang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28698v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是应用大语言模型（LLMs）于医学诊断领域，通过监督微调（SFT）开发EpiScreen系统用于癫痫早期检测，属于AI for Science在生物医学信息学（Bioinformatics）的应用。因此与’Large Language Models’、‘Post-training/SFT’和’AI for Science/Bioinformatics’高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、CoT、Agents、Quantization等涉及模型架构、训练方法、推理优化、代理系统等技术原理，论文未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该研究开发了EpiScreen系统，通过微调大语言模型分析电子健康记录中的临床笔记，实现了早期癫痫检测，在两个数据集上分别达到0.875和0.980的AUC，并在临床医生-AI协作中提升诊断准确率10.9%。

摘要翻译

癫痫与心因性非癫痫性发作常表现出相似的发作性症状，但需要根本不同的管理策略。误诊现象普遍，可能导致诊断延迟延长、不必要的治疗及严重的患者发病率。虽然长程视频脑电图是诊断的金标准，但其高昂成本与有限的可及性阻碍了及时诊断。本研究开发了一种低成本、高效的早期癫痫检测方法EpiScreen，该方法利用电子健康记录中常规采集的临床笔记。通过对标注笔记进行大语言模型微调，EpiScreen在MIMIC-IV数据集上达到最高0.875的受试者工作特征曲线下面积（AUC），在明尼苏达大学私有队列中达到0.980。在临床医生与人工智能协作的场景下，EpiScreen辅助的神经科医生诊断表现较未辅助专家最高提升10.9%。总体而言，本研究证明EpiScreen能够支持早期癫痫检测，促进及时且经济高效的筛查，有助于减少诊断延迟并避免不必要的干预措施，尤其在资源有限地区具有应用潜力。

摘要 (Abstract)

Epilepsy and psychogenic non-epileptic seizures often present with similar seizure-like manifestations but require fundamentally different management strategies. Misdiagnosis is common and can lead to prolonged diagnostic delays, unnecessary treatments, and substantial patient morbidity. Although prolonged video-electroencephalography is the diagnostic gold standard, its high cost and limited accessibility hinder timely diagnosis. Here, we developed a low-cost, effective approach, EpiScreen, for early epilepsy detection by utilizing routinely collected clinical notes from electronic health records. Through fine-tuning large language models on labeled notes, EpiScreen achieved an AUC of up to 0.875 on the MIMIC-IV dataset and 0.980 on a private cohort of the University of Minnesota. In a clinician-AI collaboration setting, EpiScreen-assisted neurologists outperformed unaided experts by up to 10.9%. Overall, this study demonstrates that EpiScreen supports early epilepsy detection, facilitating timely and cost-effective screening that may reduce diagnostic delays and avoid unnecessary interventions, particularly in resource-limited regions.

关键词: Epilepsy detection, Large Language Models, Electronic Health Records, Supervised Fine-tuning, Clinical notes, AI for Science, Bioinformatics, Diagnostic screening

119. ❌ SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

作者: Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart, Ondrej Biza 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28730v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出SOLE-R1模型，这是一个视频-语言推理模型，专门设计用于作为在线强化学习的唯一奖励信号。该模型使用时空链式思维（CoT）推理来估计任务进度，并采用监督微调（SFT）与强化学习相结合的混合框架进行训练。因此，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（10分），与’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（8分），与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分），因为论文涉及视觉-语言模型（VLMs）的应用。其他关键词与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了现有视觉-语言模型在强化学习中作为奖励评估器时因部分可观测性和分布偏移而失效的问题，通过提出SOLE-R1模型，实现了仅使用视频观察和自然语言目标就能进行零样本在线强化学习，并在多个模拟和真实机器人环境中显著优于现有模型。

摘要翻译

视觉语言模型（VLMs）已在多样化任务中展现出卓越能力，这推动了利用此类模型监督机器人学习的相关研究。然而，当将当前最先进的模型作为强化学习（RL）中的评估器时，它们在部分可观测性和分布偏移条件下常出现失效，导致策略利用感知错误而非真正解决问题。为应对这一局限，我们提出SOLE-R1（自观测学习器），这是一种专为在线强化学习提供唯一奖励信号而设计的视频语言推理模型。仅需输入原始视频观测数据和自然语言目标，SOLE-R1即可执行逐时间步的时空思维链推理，并生成可直接作为奖励使用的密集任务进度估计值。为训练SOLE-R1，我们开发了大规模视频轨迹与推理合成流程，该流程能生成与连续进度监督对齐的时序锚定思维链轨迹。此类数据与基础空间推理及多帧时序推理能力相结合，并通过耦合监督微调与可验证奖励强化学习的混合框架进行模型训练。在四个不同仿真环境及真实机器人场景中，SOLE-R1实现了从随机初始化的零样本在线强化学习：机器人在没有真实奖励、成功标识、示范数据或任务特定调优的情况下，成功学习此前未见过的操作任务。SOLE-R1在24项未见任务上取得成功，显著超越了包括GPT-5和Gemini-3-Pro在内的强视觉语言奖励模型，同时对奖励操控表现出明显更强的鲁棒性。

摘要 (Abstract)

Vision-language models (VLMs) have shown impressive capabilities across diverse tasks, motivating efforts to leverage these models to supervise robot learning. However, when used as evaluators in reinforcement learning (RL), today’s strongest models often fail under partial observability and distribution shift, enabling policies to exploit perceptual errors rather than solve the task. To address this limitation, we introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL. Given only raw video observations and a natural-language goal, SOLE-R1 performs per-timestep spatiotemporal chain-of-thought (CoT) reasoning and produces dense estimates of task progress that can be used directly as rewards. To train SOLE-R1, we develop a large-scale video trajectory and reasoning synthesis pipeline that generates temporally grounded CoT traces aligned with continuous progress supervision. This data is combined with foundational spatial and multi-frame temporal reasoning, and used to train the model with a hybrid framework that couples supervised fine-tuning with RL from verifiable rewards. Across four different simulation environments and a real-robot setting, SOLE-R1 enables zero-shot online RL from random initialization: robots learn previously unseen manipulation tasks without ground-truth rewards, success indicators, demonstrations, or task-specific tuning. SOLE-R1 succeeds on 24 unseen tasks and substantially outperforms strong vision-language rewarders, including GPT-5 and Gemini-3-Pro, while exhibiting markedly greater robustness to reward hacking.

关键词: Video-Language Reasoning, Reinforcement Learning, Chain-of-Thought Reasoning, Vision-Language Models, Robot Learning, Reward Signal, Zero-shot Learning, Spatiotemporal Reasoning

120. ❌ Training data generation for context-dependent rubric-based short answer grading

作者: Pavel Šindelář, Dávid Slivka, Christopher Bouma, Filip Prášil, Ondřej Bojar 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28537v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是教育评估领域的训练数据生成方法，具体针对PISA测试中学生短答案的自动评分。论文主要关注数据生成技术（如使用简单文本格式保护机密性、创建替代数据集），而非大模型技术原理、架构创新或科学应用。所有关键词均涉及大模型/深度学习的技术原理、应用或创新，而本文未涉及任何大模型技术，仅提及可能使用机器学习方法进行自动评分，但未具体说明技术细节。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何利用少量机密参考数据生成大规模训练数据集的方法，以支持PISA测试中学生短答案的自动评分，并发现其中一种方法可能改善模型训练效果。

摘要翻译

每四年，经济合作与发展组织（OECD）会开展PISA测试，以评估全球青少年学生的知识水平，并促进不同教育体系间的比较。然而，由于需要规避语言差异和评分者偏差，学生答案的评分工作面临挑战。因此，比较自动评分学生答案的方法具有重要意义。其中一些方法需要机器学习进行训练，另一些则需计算参数或选择超参数，这均需大量特定领域的数据。在本研究中，我们探索了少量方法，仅以相对较小的保密数据集为参考，通过利用一组极为简单的衍生文本格式来保护数据机密性，从而创建大规模训练数据集。运用这些方法，我们成功构建了三个替代数据集，这些数据集至少在表面上比单纯基于提示生成的结果更接近参考数据集。初步实验表明，其中一种方法可能还有助于提升模型训练效果。

摘要 (Abstract)

Every 4 years, the PISA test is administered by the OECD to test the knowledge of teenage students worldwide and allow for comparisons of educational systems. However, having to avoid language differences and annotator bias makes the grading of student answers challenging. For these reasons, it would be interesting to compare methods of automatic student answer grading. To train some of these methods, which require machine learning, or to compute parameters or select hyperparameters for those that do not, a large amount of domain-specific data is needed. In this work, we explore a small number of methods for creating a large-scale training dataset using only a relatively small confidential dataset as a reference, leveraging a set of very simple derived text formats to preserve confidentiality. Using these methods, we successfully created three surrogate datasets that are, at the very least, superficially more similar to the reference dataset than purely the result of prompt-based generation. Early experiments suggest one of these approaches might also lead to improved model training.

关键词: training data generation, PISA test, automatic student answer grading, confidential dataset, surrogate datasets, machine learning, educational assessment, short answer grading

121. ❌ Compressing Transformer Language Models via Matrix Product Operator Decomposition: A Case Study on PicoGPT

作者: Younes Javanmard, Tanmoy Pandit, Masoud Mardani 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28534v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	15.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Transformer语言模型的压缩方法（Matrix Product Operator分解），与’Quantization OR Model Compression OR Low-bit Weights’高度相关（15分），因为这是论文的核心技术贡献。与’Small Language Models OR SLMs OR On-device AI’相关（10分），因为压缩目标是在资源受限硬件上部署模型。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分），因为研究基于GPT-2风格的模型。其他关键词如MoE、Scaling Laws、Alignment等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了使用矩阵乘积算子分解压缩Transformer语言模型的方法，在PicoGPT上实现了高达13倍的压缩比，同时保持97.7%的基线性能。

摘要翻译

基于Transformer的语言模型在各类自然语言处理任务中均表现出强劲性能，但其参数量随隐藏维度呈二次方增长，导致在资源受限硬件上的部署成本高昂。本研究探讨将矩阵乘积算子（Matrix Product Operator, MPO）分解作为一种原理性的Transformer压缩方法。MPO将权重矩阵分解为一系列低秩核心的链式结构，其近似质量由键维（bond dimension）chi控制。我们在PicoGPT（一个约含100万参数、GPT-2风格的字符级语言模型）中，将所有nn.Linear层替换为以MPO链形式参数化的MPOLinear模块。核心权重可通过预训练稠密权重的TT-SVD分解初始化，或采用随机初始化，并使用标准PyTorch自动微分进行训练，无需定制反向传播过程。我们针对PicoGPT中五种不同的权重形状推导了平衡分解方案，并在Tiny Shakespeare数据集上评估了键维chi ∈ {4, 8, 16, 32}的性能。当chi = 4时，MPO压缩在每个Transformer模块中实现了高达13倍的压缩率。在chi = 16时，模型参数量从1,020,224降至191,872，同时保持基线词元准确率的97.7%（51.6%对比52.8%）。重构误差符合预期趋势，在相同键维下，三站点分解的误差低于两站点分解。chi = 8的模型在单位参数量准确率上表现最优，在此指标上超过稠密基线2.7倍。这些结果表明，MPO参数化是一种实用且理论依据充分的Transformer压缩方法，可作为低秩分解与非结构化剪枝的有效替代方案。

摘要 (Abstract)

Transformer-based language models achieve strong performance across NLP tasks, but their quadratic parameter scaling with hidden dimension makes deployment on resource-constrained hardware expensive. We study Matrix Product Operator (MPO) decomposition as a principled compression method for transformers. MPO factorises weight matrices into chains of low-rank cores, with approximation quality controlled by the bond dimension chi. We replace every nn.Linear layer in PicoGPT, a GPT-2-style character-level language model with about 1M parameters, with an MPOLinear module parameterised as an MPO chain. Cores are initialised either by TT-SVD from pretrained dense weights or from random initialisation, and trained using standard PyTorch autograd without a custom backward pass. We derive balanced factorisation schemes for the five distinct weight shapes in PicoGPT and evaluate bond dimensions chi in {4, 8, 16, 32} on Tiny Shakespeare. MPO compression achieves up to 13x compression per transformer block at chi = 4. At chi = 16, the model uses 191,872 parameters instead of 1,020,224 while retaining 97.7% of baseline token accuracy (51.6% vs 52.8%). Reconstruction error follows the expected trend and is lower for three-site than two-site factorisations at the same bond dimension. The chi = 8 model gives the best accuracy per parameter, exceeding the dense baseline by 2.7x on this metric. These results show that MPO parameterisation is a practical and theoretically grounded alternative to low-rank methods and unstructured pruning for transformer compression.

关键词: Transformer compression, Matrix Product Operator, Model compression, Parameter efficiency, Low-rank decomposition, PicoGPT, Resource-constrained deployment, Weight factorization

122. ❌ GraphWalker: Agentic Knowledge Graph Question Answering via Synthetic Trajectory Curriculum

作者: Shuwen Xu, Yao Xu, Jiaxiang Liu, Chenhao Yuan, Wenshuo Peng, Jun Zhao, Kang Liu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28533v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出GraphWalker框架，专注于知识图谱问答（KGQA）中的智能体（agent）训练与推理。核心贡献在于两阶段SFT训练范式（第一阶段：基于合成轨迹的广泛探索训练；第二阶段：基于专家轨迹的反思与纠错微调），以解决训练数据稀缺和推理泛化问题。因此，与"Post-training OR Supervised Fine-tuning OR SFT"高度相关（10分），因为SFT是核心训练方法；与"Chain of Thought OR CoT Reasoning OR Multi-step Reasoning"和"System 2 Thinking OR Slow Thinking OR In-depth Reasoning"高度相关（10分），因为论文强调多步推理和深度推理；与"Self-Correction OR Self-Improvement OR Self-Reflection"高度相关（10分），因为第二阶段训练旨在提升反思和错误恢复能力；与"LLM Agents OR Autonomous Agents OR Agentic Workflow"高度相关（10分），因为框架本质是智能体在知识图谱上的自主导航。与"Large Language Models OR LLMs OR Foundation Models"有一定关联（5分），因为智能体可能基于LLM构建，但论文未明确说明模型类型。其他关键词如MoE、SLMs、RAG、量化等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对知识图谱问答中训练数据稀缺和推理泛化挑战，提出了GraphWalker框架，通过自动化轨迹合成和两阶段SFT训练，使智能体在KG上实现高效探索和反思，在多个基准测试中取得了最先进的性能。

摘要翻译

智能知识图谱问答（Agentic KGQA）要求智能体与知识图谱（KGs）进行迭代式交互，这同时带来了训练数据稀缺和推理泛化方面的挑战。具体而言，现有方法通常限制了智能体的探索能力：基于提示的方法缺乏自主导航训练，而当前的训练流程通常将推理限制在预定义的轨迹上。为此，本文提出 \textit{GraphWalker}，一种新颖的智能知识图谱问答框架，通过\textit{自动化轨迹合成}与\textit{分阶段微调}来解决这些挑战。GraphWalker 采用两阶段监督微调（SFT）训练范式：首先，智能体在由受限随机游走路径合成的结构多样化轨迹上进行训练，从而在知识图谱上建立广泛的探索先验；其次，智能体进一步在一小部分专家轨迹上进行微调，以发展反思与错误恢复能力。大量实验表明，我们的分阶段 SFT 范式为轻量级强化学习（RL）阶段解锁了更高的性能上限，使 GraphWalker 在 CWQ 和 WebQSP 数据集上取得了最先进的性能。在 GrailQA 和我们构建的 GraphWalkerBench 上的额外结果证实，GraphWalker 增强了对分布外推理路径的泛化能力。代码公开于 https://github.com/XuShuwenn/GraphWalker。

摘要 (Abstract)

Agentic knowledge graph question answering (KGQA) requires an agent to iteratively interact with knowledge graphs (KGs), posing challenges in both training data scarcity and reasoning generalization. Specifically, existing approaches often restrict agent exploration: prompting-based methods lack autonomous navigation training, while current training pipelines usually confine reasoning to predefined trajectories. To this end, this paper proposes \textit{GraphWalker}, a novel agentic KGQA framework that addresses these challenges through \textit{Automated Trajectory Synthesis} and \textit{Stage-wise Fine-tuning}. GraphWalker adopts a two-stage SFT training paradigm: First, the agent is trained on structurally diverse trajectories synthesized from constrained random-walk paths, establishing a broad exploration prior over the KG; Second, the agent is further fine-tuned on a small set of expert trajectories to develop reflection and error recovery capabilities. Extensive experiments demonstrate that our stage-wise SFT paradigm unlocks a higher performance ceiling for a lightweight reinforcement learning (RL) stage, enabling GraphWalker to achieve state-of-the-art performance on CWQ and WebQSP. Additional results on GrailQA and our constructed GraphWalkerBench confirm that GraphWalker enhances generalization to out-of-distribution reasoning paths. The code is publicly available at https://github.com/XuShuwenn/GraphWalker

关键词: Agentic Knowledge Graph Question Answering, Automated Trajectory Synthesis, Stage-wise Fine-tuning, Supervised Fine-tuning (SFT), Multi-step Reasoning, Self-Reflection, Generalization, State-of-the-art Performance

123. ❌ EarlySciRev: A Dataset of Early-Stage Scientific Revisions Extracted from LaTeX Writing Traces

作者: Léane Jourdan, Julien Aubert-Béduchaud, Yannis Chupin, Marah Baccari, Florian Boudin 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28515v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文核心贡献是创建了一个用于研究科学写作修订行为的数据集EarlySciRev，并明确提到该数据集可用于评估大语言模型（LLMs）在科学写作中的应用。因此，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为LLMs是论文应用和评估的直接对象。与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（8分），因为论文聚焦于科学写作（AI for Science的一个子领域），但未深入特定科学领域如生物信息学。其他关键词主要涉及大模型的技术原理、训练方法、推理优化、代理系统等，论文未涉及这些具体技术，故均为0分。

!!! tip deepseek-chat TL;DR

该论文针对科学写作修订行为研究缺乏早期阶段数据的问题，通过从arXiv LaTeX源文件中提取被注释掉的文本，构建了一个包含57.8万对验证修订段落的数据集EarlySciRev，以支持科学写作动态、修订建模和大语言模型辅助编辑的研究。

摘要翻译

科学写作是一个产生丰富修订痕迹的迭代过程，然而公开可用的资源通常仅呈现论文的最终或接近最终版本。这限制了对修订行为的实证研究以及对大型语言模型（LLM）在科学写作中应用效果的评估。我们推出了EarlySciRev数据集，这是一个通过从arXiv LaTeX源文件中自动提取的早期科学文本修订数据集。我们的核心观察是：LaTeX中被注释掉的文本往往保留了作者本人所写的废弃或替代表述。通过将被注释的片段与邻近的最终文本进行对齐，我们提取出段落级别的候选修订对，并应用基于LLM的过滤方法以保留真实的修订内容。从128万对候选对出发，我们的处理流程最终生成了57.8万对基于真实早期草稿痕迹的已验证修订对。此外，我们还提供了一个用于修订检测的人工标注基准。EarlySciRev补充了现有专注于后期修订或合成重写的资源，并支持对科学写作动态、修订建模以及LLM辅助编辑的研究。

摘要 (Abstract)

Scientific writing is an iterative process that generates rich revision traces, yet publicly available resources typically expose only final or near-final versions of papers. This limits empirical study of revision behaviour and evaluation of large language models (LLMs) for scientific writing. We introduce EarlySciRev, a dataset of early-stage scientific text revisions automatically extracted from arXiv LaTeX source files. Our key observation is that commented-out text in LaTeX often preserves discarded or alternative formulations written by the authors themselves. By aligning commented segments with nearby final text, we extract paragraph-level candidate revision pairs and apply LLM-based filtering to retain genuine revisions. Starting from 1.28M candidate pairs, our pipeline yields 578k validated revision pairs, grounded in authentic early drafting traces. We additionally provide a human-annotated benchmark for revision detection. EarlySciRev complements existing resources focused on late-stage revisions or synthetic rewrites and supports research on scientific writing dynamics, revision modelling, and LLM-assisted editing.

关键词: scientific writing, revision traces, dataset, LaTeX, large language models, LLMs, arXiv, text revisions

124. ❌ TIEG-Youpu Solution for NeurIPS 2022 WikiKG90Mv2-LSC

作者: Feng Nie, Zhixiu Ye, Sifa Xie, Shuang Wu, Xin Yuan, Liang Yao, Jiazhen Peng, Xu Cheng 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28512v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究知识图谱嵌入（KGE）技术，专注于大规模知识图谱WikiKG90Mv2的链接预测任务，采用检索-重排序流程，提出优先级填充检索模型和基于集成的重排序模型。所有评分关键词均涉及大模型（LLM）及相关技术（如MoE、RLHF、RAG、推理、对齐、压缩等），而本文未涉及任何大模型技术，也未提及深度学习在科学领域的应用创新，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对大规模知识图谱WikiKG90Mv2的链接预测问题，提出了一种结合优先级填充检索和邻居增强集成重排序的方法，在验证集上显著提升了MRR指标。

摘要翻译

NeurIPS 2022中的WikiKG90Mv2是一个大规模百科全书式知识图谱。将知识图谱嵌入连续向量空间对于许多实际应用至关重要，例如知识获取、问答系统和推荐系统。与现有知识图谱相比，WikiKG90Mv2是一个由超过9000万个实体组成的大规模知识图谱。在为大规模知识图谱构建图嵌入模型时，需要同时兼顾效率与准确性。为此，我们遵循“检索后重排序”的流程框架，并在检索和重排序阶段均提出了创新性改进。具体而言，我们提出了一种优先级填充检索模型，以获取在结构和语义上均相似的候选实体；随后提出一种基于集成学习的重排序模型，该模型采用邻居增强表示方法，在检索得到的候选实体中生成最终的链接预测结果。实验结果表明，我们提出的方法优于现有基线方法，并将验证集的平均倒数排名（MRR）从0.2342提升至0.2839。

摘要 (Abstract)

WikiKG90Mv2 in NeurIPS 2022 is a large encyclopedic knowledge graph. Embedding knowledge graphs into continuous vector spaces is important for many practical applications, such as knowledge acquisition, question answering, and recommendation systems. Compared to existing knowledge graphs, WikiKG90Mv2 is a large scale knowledge graph, which is composed of more than 90 millions of entities. Both efficiency and accuracy should be considered when building graph embedding models for knowledge graph at scale. To this end, we follow the retrieve then re-rank pipeline, and make novel modifications in both retrieval and re-ranking stage. Specifically, we propose a priority infilling retrieval model to obtain candidates that are structurally and semantically similar. Then we propose an ensemble based re-ranking model with neighbor enhanced representations to produce final link prediction results among retrieved candidates. Experimental results show that our proposed method outperforms existing baseline methods and improves MRR of validation set from 0.2342 to 0.2839.

关键词: knowledge graph embedding, link prediction, retrieval-re-ranking, priority infilling, ensemble model, neighbor enhanced representation, WikiKG90Mv2, large-scale knowledge graph

125. ❌ IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression

作者: Zhongping Ji 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28430v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	15.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文直接针对LLM的KV缓存压缩技术，与’KV Cache Compression OR Linear Attention OR FlashAttention’高度相关（15分），属于大模型技术原理创新。论文涉及低比特量化，与’Quantization OR Model Compression OR Low-bit Weights’相关（10分）。论文标题和摘要明确提到LLM，与’Large Language Models OR LLMs OR Foundation Models’相关（10分）。论文通过硬件对齐优化实现加速，与’Speculative Decoding OR Inference Acceleration’有一定关联（5分）。其他关键词如MoE、SLMs、训练方法、对齐、推理、代理、科学应用等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

论文提出IsoQuant框架，通过基于四元数和SO(4)等斜旋转的块状旋转方法，显著降低了LLM KV缓存压缩中正交变换的计算和存储成本，在多种设置下实现了比现有方法快4.5-6倍的加速，同时保持重建误差相当。

摘要翻译

正交特征解耦在低比特在线向量量化中具有显著效果，但稠密随机正交变换会带来难以承受的 $O(d^2)$ 存储与计算开销。RotorQuant 通过分块 $3$D Clifford 旋量降低了这一成本，但其 $3$D 划分方案与现代硬件适配不佳，且局部混合能力有限。我们提出 \textbf{IsoQuant}，一种基于四元数代数与 $SO(4)$ 等斜分解的分块旋转框架。它将每个 $4$D 块表示为四元数，并应用闭式变换 $T(v)=q_L v \overline{q_R}$。由此产生两个主要变体：\emph{IsoQuant-Full} 实现完整的 $SO(4)$ 旋转，以及 \emph{IsoQuant-Fast} 仅保留一个等斜因子以降低计算成本；该框架还支持轻量级的 $2$D 特例。在 $d=128$ 时，IsoQuant-Full 将前向旋转成本从 RotorQuant 的约 $2{,}408$ 次融合乘加运算（FMA）降至 $1{,}024$ 次，而 IsoQuant-Fast 进一步降至 $512$ 次。在 $d \in {128,256,512}$、比特宽度 ${2,3,4}$ 以及 FP16/FP32 执行的 $18$ 种融合 CUDA 设置下，IsoQuant 在保持相近重建均方误差（MSE）的同时，相比 RotorQuant 实现了约 $4.5\times$–$4.7\times$ 的平均内核级加速，峰值加速超过 $6\times$。当前验证仅限于合成归一化向量的第一阶段量化-反量化路径；端到端的 KV 缓存（KV-cache）评估仍是未来工作。

摘要 (Abstract)

Orthogonal feature decorrelation is effective for low-bit online vector quantization, but dense random orthogonal transforms incur prohibitive $O(d^2)$ storage and compute. RotorQuant reduces this cost with blockwise $3$D Clifford rotors, yet the resulting $3$D partition is poorly aligned with modern hardware and offers limited local mixing. We propose \textbf{IsoQuant}, a blockwise rotation framework based on quaternion algebra and the isoclinic decomposition of $SO(4)$. It represents each $4$D block as a quaternion and applies a closed-form transform $T(v)=q_L v \overline{q_R}$. This yields two main variants: \emph{IsoQuant-Full}, which realizes the full $SO(4)$ rotation, and \emph{IsoQuant-Fast}, which keeps only one isoclinic factor for lower cost; the framework also admits a lightweight $2$D special case. At $d=128$, IsoQuant-Full reduces forward rotation cost from about $2{,}408$ FMAs in RotorQuant to $1{,}024$, while IsoQuant-Fast further reduces it to $512$. Across $18$ fused CUDA settings with $d \in {128,256,512}$, bit widths ${2,3,4}$, and FP16/FP32 execution, IsoQuant achieves mean kernel-level speedups of about $4.5\times$–$4.7\times$ over RotorQuant while maintaining comparable reconstruction MSE, with peak speedups above $6\times$. Current validation is limited to the stage-1 quantize–dequantize path on synthetic normalized vectors; end-to-end KV-cache evaluation remains future work.

关键词: KV cache compression, LLM, quantization, hardware-aligned, isoclinic rotations, quaternion algebra, inference acceleration, SO(4)

126. ❌ Structural-Ambiguity-Aware Translation from Natural Language to Signal Temporal Logic

作者: Kosei Fushimi, Kazunobu Serizawa, Junya Ikemoto, Kazumune Hashimoto 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28426v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究自然语言到信号时序逻辑的翻译方法，使用组合范畴语法和n-best解析技术处理结构歧义，属于形式化方法、自然语言处理和控制系统领域。论文未涉及大模型、深度学习、AI for Science等关键词相关的技术原理、应用或创新，所有关键词均与论文内容完全无关。

!!! tip deepseek-chat TL;DR

该论文针对自然语言到信号时序逻辑翻译中的结构歧义问题，提出了一种基于组合范畴语法的歧义保持方法，通过三阶段流程生成多个候选公式并评分，从而明确表示模糊指令的多种可能形式化解释。

摘要翻译

信号时序逻辑（Signal Temporal Logic, STL）被广泛用于为信息物理系统指定具有时间约束和安全关键性的任务，但对于非专业用户而言，直接编写STL公式十分困难。自然语言（Natural Language, NL）提供了便捷的交互界面，但其固有的结构歧义性使得将其一对一地翻译为STL并不可靠。本文提出一种保留歧义的方法，用于将自然语言任务描述翻译为STL候选公式。其核心思想是在解析阶段保留多种合理的句法分析结果，而非强制选择单一解释。为此，我们基于组合范畴语法（Combinatory Categorial Grammar, CCG）开发了一个三阶段流程：保留歧义的n最佳解析、面向STL的基于模板的语义组合，以及带分数聚合的规范化处理。所提出的方法输出一个经去重的STL候选公式集合，并附有可能性评分，从而明确地表示一个歧义性指令的多种可能的形式化解释。与现有的一最佳自然语言到逻辑翻译方法相比，所提出的方法旨在保留附着歧义和辖域歧义。针对代表性任务描述的案例研究表明，该方法能为真正存在歧义的输入生成多个STL候选公式，同时将无歧义或规范等价的推导结果归约为单一的STL公式。

摘要 (Abstract)

Signal Temporal Logic (STL) is widely used to specify timed and safety-critical tasks for cyber-physical systems, but writing STL formulas directly is difficult for non-expert users. Natural language (NL) provides a convenient interface, yet its inherent structural ambiguity makes one-to-one translation into STL unreliable. In this paper, we propose an \textit{ambiguity-preserving} method for translating NL task descriptions into STL candidate formulas. The key idea is to retain multiple plausible syntactic analyses instead of forcing a single interpretation at the parsing stage. To this end, we develop a three-stage pipeline based on Combinatory Categorial Grammar (CCG): ambiguity-preserving $n$-best parsing, STL-oriented template-based semantic composition, and canonicalization with score aggregation. The proposed method outputs a deduplicated set of STL candidates with plausibility scores, thereby explicitly representing multiple possible formal interpretations of an ambiguous instruction. In contrast to existing one-best NL-to-logic translation methods, the proposed approach is designed to preserve attachment and scope ambiguity. Case studies on representative task descriptions demonstrate that the method generates multiple STL candidates for genuinely ambiguous inputs while collapsing unambiguous or canonically equivalent derivations to a single STL formula.

关键词: Signal Temporal Logic, Natural Language Translation, Structural Ambiguity, Combinatory Categorial Grammar, Ambiguity-preserving Parsing, Formal Verification, Cyber-physical Systems, Task Specification

127. ❌ LombardoGraphia: Automatic Classification of Lombard Orthography Variants

作者: Edoardo Signoroni, Pavel Rychlý 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28418v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于低资源语言（伦巴第语）的自动正字法分类，涉及传统和神经分类模型的训练，但未涉及大模型、深度学习技术原理创新或科学领域应用。所有关键词均与大模型技术、深度学习创新或AI科学应用相关，而本文研究的是特定语言的正字法分类，属于传统NLP任务，与给定关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对缺乏统一正字法标准的伦巴第语，构建了首个自动正字法分类数据集LombardoGraphia，并训练了多种分类模型，最佳模型在整体和平均类别准确率上分别达到96.06%和85.78%。

摘要翻译

伦巴第语是意大利北部和瑞士南部约380万人使用的一种资源匮乏的语言变体，缺乏统一的书写规范。多种正字法体系并存，为自然语言处理资源开发和模型训练带来了挑战。本文首次对伦巴第语自动正字法分类进行研究，并介绍了LombardoGraphia——一个包含11,186个伦巴第语维基百科样本的精选语料库，这些样本标注了9种正字法变体，以及用于自动正字法分类的模型。我们通过处理和过滤原始维基百科内容构建数据集，确保文本适用于正字法分析。我们训练了24个传统和神经分类模型，采用多种特征和编码层级。最佳模型实现了96.06%的整体准确率和85.78%的平均类别准确率，但由于数据不平衡，少数类别的分类性能仍具挑战。本研究为构建伦巴第语变体感知的自然语言处理资源提供了关键基础设施。

摘要 (Abstract)

Lombard, an underresourced language variety spoken by approximately 3.8 million people in Northern Italy and Southern Switzerland, lacks a unified orthographic standard. Multiple orthographic systems exist, creating challenges for NLP resource development and model training. This paper presents the first study of automatic Lombard orthography classification and LombardoGraphia, a curated corpus of 11,186 Lombard Wikipedia samples tagged across 9 orthographic variants, and models for automatic orthography classification. We curate the dataset, processing and filtering raw Wikipedia content to ensure text suitable for orthographic analysis. We train 24 traditional and neural classification models with various features and encoding levels. Our best models achieve 96.06% and 85.78% overall and average class accuracy, though performance on minority classes remains challenging due to data imbalance. Our work provides crucial infrastructure for building variety-aware NLP resources for Lombard.

关键词: Lombard language, orthography classification, underresourced language, NLP resource development, Wikipedia corpus, neural classification models, data imbalance, automatic classification

128. ❌ Not All Subjectivity Is the Same! Defining Desiderata for the Evaluation of Subjectivity in NLP

作者: Urja Khurana, Michiel van der Meer, Enrico Liscio, Antske Fokkens, Pradeep K. Murukannaiah 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28351v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是一篇关于NLP中主观性评估的立场论文，主要关注评估框架、方法论和用户中心视角，不涉及大模型技术原理、架构创新、训练方法、推理优化、代理系统或科学AI应用等具体技术主题。论文讨论的是NLP模型评估的一般性问题，而非大模型特定技术或应用，因此与所有技术关键词均无直接关联。

!!! tip deepseek-chat TL;DR

这篇立场论文提出了七个评估主观性敏感NLP模型的期望标准，并通过分析60篇论文发现当前研究在区分模糊与多声输入、有效表达主观性给用户以及不同期望标准间的相互作用等方面存在不足。

摘要翻译

主观判断是众多自然语言处理（NLP）数据集的组成部分，近期研究日益重视使模型输出能够反映这种观点多样性的建模方法。此类响应有助于揭示少数群体的声音——这些声音常因主流视角而被边缘化或遮蔽。然而，当前的评估实践是否与这些模型的目标相一致，仍是一个待解决的问题。本立场论文基于主观性在NLP数据与模型中的表征方式，提出了针对主观性敏感模型的七项评估要求。这些要求采用自上而下的构建思路，重点关注此类模型对用户产生的实际影响。通过对60篇论文实验设计的梳理，我们发现主观性的多个维度仍未得到充分研究：例如模糊输入与多声部（polyphonic）输入的区别、主观性能否有效传达给用户、各项评估要求之间缺乏联动性等不足。

摘要 (Abstract)

Subjective judgments are part of several NLP datasets and recent work is increasingly prioritizing models whose outputs reflect this diversity of perspectives. Such responses allow us to shed light on minority voices, which are frequently marginalized or obscured by dominant perspectives. It remains a question whether our evaluation practices align with these models’ objectives. This position paper proposes seven evaluation desiderata for subjectivity-sensitive models, rooted in how subjectivity is represented in NLP data and models. The desiderata are constructed in a top-down approach, keeping in mind the user-centric impact of such models. We scan the experimental setup of 60 papers and show that various aspects of subjectivity are still understudied: the distinction between ambiguous and polyphonic input, whether subjectivity is effectively expressed to the user, and a lack of interplay between different desiderata, amongst other gaps.

关键词: subjectivity, NLP evaluation, evaluation desiderata, user-centric impact, ambiguous input, polyphonic input, minority voices, position paper

129. ❌ Tailoring AI-Driven Reading Scaffolds to the Distinct Needs of Neurodiverse Learners

作者: Soufiane Jhilal, Eleonora Pasqua, Caterina Marchesi, Riccardo Corradi, Martina Galletti 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28370v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究AI驱动的阅读支架在神经多样性学习者中的应用，属于教育技术领域，但未涉及大模型、深度学习技术原理或科学AI应用。所有关键词均与大模型技术、训练方法、推理优化、代理系统、模型压缩等具体技术或科学AI应用相关，而本文仅提及AI驱动但未说明具体AI技术，且研究焦点是教育干预而非AI技术创新，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该研究探讨了不同AI驱动的阅读支架（如分段文本、象形图和关键词标签）对神经多样性学习者阅读理解的影响，发现没有单一支架普遍最优，需要可调节的支架设计。

摘要翻译

神经多样性学习者通常需要阅读支持，但增加支架的丰富度有时可能导致注意力与工作记忆超负荷，而非提升理解能力。本研究基于建构-整合模型及适应性支架视角，探讨在监督性包容环境中，结构性支架与语义性支架如何影响理解能力与阅读体验。通过使用改良的阅读界面，我们比较了四种呈现模式：未修改文本、句子分割文本、带象形图的分割文本，以及带象形图与关键词标签的分割文本。在一项包含14名有特殊教育需求与残疾的小学学习者的组内初步实验中，我们采用标准化问题测量阅读理解能力，同时收集了简短的学习者自评与治疗师报告体验数据及开放式反馈。结果呈现出异质性反应：部分学习者表现出符合分割与象形图益处的模式，而另一些学习者在引入视觉支架时则表现出符合协调成本增加的模式。不同模式间的体验评分差异有限，其中部分明显效应与临床复杂性相关，特别是在感知理解难易度方面。学习者的开放式反馈频繁要求更简明的措辞及额外的视觉支持。这些发现表明，不存在普遍最优的单一支架，从而强化了精细化校准与可调节支架的必要性，并为监督性包容阅读环境中人机协同调节的设计提供了启示。

摘要 (Abstract)

Neurodiverse learners often require reading supports, yet increasing scaffold richness can sometimes overload attention and working memory rather than improve comprehension. Grounded in the Construction-Integration model and a contingent scaffolding perspective, we examine how structural versus semantic scaffolds shape comprehension and reading experience in a supervised inclusive context. Using an adapted reading interface, we compared four modalities: unmodified text, sentence-segmented text, segmented text with pictograms, and segmented text with pictograms plus keyword labels. In a within-subject pilot with 14 primary-school learners with special educational needs and disabilities, we measured reading comprehension using standardized questions and collected brief child- and therapist-reported experience measures alongside open-ended feedback. Results highlight heterogeneous responses as some learners showed patterns consistent with benefits from segmentation and pictograms, while others showed patterns consistent with increased coordination costs when visual scaffolds were introduced. Experience ratings showed limited differences between modalities, with some apparent effects linked to clinical complexity, particularly for perceived ease of understanding. Open-ended feedback of the learners frequently requested simpler wording and additional visual supports. These findings suggest that no single scaffold is universally optimal, reinforcing the need for calibrated, adjustable scaffolding and provide design implications for human-AI co-regulation in supervised inclusive reading contexts.

关键词: AI-driven reading scaffolds, neurodiverse learners, reading comprehension, structural scaffolds, semantic scaffolds, inclusive education, human-AI co-regulation, special educational needs

130. ❌ Coconstructions in spoken data: UD annotation guidelines and first results

作者: Ludovica Pannitto, Sylvain Kahane, Kaja Dobrovoljc, Elena Battaglia, Bruno Guillaume, Caterina Mauri, Eleonora Zucchini 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28261v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training	0.0	0.0/10	0.0
Post-training	0.0	0.0/10	0.0
Instruction Tuning	0.0	0.0/10	0.0
RLHF	0.0	0.0/10	0.0
PEFT	0.0	0.0/10	0.0
Retrieval-Augmented Generation	0.0	0.0/10	0.0
Context Window Extension	0.0	0.0/10	0.0
KV Cache Compression	0.0	0.0/10	0.0
Chain of Thought	0.0	0.0/10	0.0
System 2 Thinking	0.0	0.0/10	0.0
Monte Carlo Tree Search AND LLM	0.0	0.0/10	0.0
Self-Correction	0.0	0.0/10	0.0
LLM Agents	0.0	0.0/10	0.0
Tool Use	0.0	0.0/10	0.0
Multi-agent Systems	0.0	0.0/10	0.0
Quantization	0.0	0.0/10	0.0
Speculative Decoding	0.0	0.0/10	0.0
Hallucination Mitigation	0.0	0.0/10	0.0
Mechanistic Interpretability	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging	0.0	0.0/10	0.0
In-context Learning	0.0	0.0/10	0.0
AI for Science	0.0	0.0/10	0.0

评分理由: 该论文专注于语言学领域的句法标注指南开发，特别是针对口语语料库中跨说话者轮次的依存关系标注。研究内容涉及语言学理论、语料库标注方法和句法分析，与所有提供的大模型、深度学习、AI技术原理或科学AI应用关键词均无直接关联。论文未涉及任何机器学习模型、算法、训练方法、推理技术、AI应用或相关技术概念，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了在通用依存框架下为口语树库中跨说话者轮次的句法依存关系（包括协作共建、问答和反馈信号）制定标注指南，并提出了两种表示方法和区分重述与修复的新方案。

摘要翻译

本文针对通用依存框架下的口语树库，提出了跨话轮句法依存关系的标注规范，涵盖合作共建结构、特殊疑问句应答及反馈信号等语言现象。研究提出了两种表征方式：一种遵循话轮切分的说话者本位表征，另一种则是允许跨话轮依存关系的依存本位表征。同时，本文还提出了区分重构与修正的新方案，并对未完成短语中的成分提升机制提出了新的标注建议。

摘要 (Abstract)

The paper proposes annotation guidelines for syntactic dependencies that span across speaker turns - including collaborative coconstructions proper, wh-question answers, and backchannels - in spoken language treebanks within the Universal Dependencies framework. Two representations are proposed: a speaker-based representation following the segmentation into speech turns, and a dependency-based representation with dependencies across speech turns. New propositions are also put forward to distinguish between reformulations and repairs, and to promote elements in unfinished phrases.

关键词: syntactic dependencies, spoken language treebanks, Universal Dependencies, collaborative coconstructions, speaker turns, annotation guidelines, dependency-based representation

131. ❌ Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

作者: He Du, Qiming Ge, Jiakai Hu, Aijun Yang, Zheng Cai, Zixian Huang, Sheng Yuan, Qinxiu Cheng, Xinchen Xie, Yicheng Chen, Yining Li, Jiaxing Xie, Huanan Dong, Yaguang Wu, Xiangjun Huang, Jian Yang, Hui Wang, Bowen Zhou, Bowen Li, Qipeng Guo, Kai Chen 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28342v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Kernel-Smith框架，使用LLM驱动的进化代理进行GPU内核优化，核心涉及LLM作为基础模型、进化代理工作流程、强化学习训练（RLHF/RLAIF/DPO相关）、自我改进机制和代理系统。与LLM、Post-training/SFT、RLHF/RLAIF/DPO、Self-Correction/Self-Improvement/Self-Reflection、LLM Agents/Autonomous Agents/Agentic Workflow高度相关（10分），因为这些是框架的核心组成部分。其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG、Context Window等与论文的GPU内核优化主题无关，得0分。

!!! tip deepseek-chat TL;DR

论文提出Kernel-Smith框架，使用LLM驱动的进化代理优化GPU内核生成，在KernelBench上实现了最先进的性能，并成功应用于生产系统。

摘要翻译

我们提出Kernel-Smith框架，这是一种用于高性能GPU内核与算子生成的系统，它结合了基于稳定评估驱动的进化智能体与面向进化机制的后训练方案。在智能体方面，Kernel-Smith维护一个可执行候选程序种群，并利用存档中高性能且多样化的程序集合，结合编译、正确性与加速比的结构化执行反馈，对这些候选程序进行迭代优化。为确保搜索过程的可靠性，我们为NVIDIA GPU上的Triton后端和MetaX GPU上的Maca后端分别构建了专用的评估服务。在训练方面，我们将长周期的进化轨迹转化为以单步修订为中心的监督学习与强化学习信号，具体方法是保留那些维持正确性且带来高增益的代码修订记录，从而使模型被优化为进化循环内部强大的局部改进器，而非一次性生成器。在统一的进化协议下，Kernel-Smith-235B-RL在采用Nvidia Triton后端的KernelBench测试中实现了最优的整体性能，获得了最佳的平均加速比，并超越了包括Gemini-3.0-pro和Claude-4.6-opus在内的前沿专有模型。我们进一步在MetaX MACA后端上验证了该框架，其中Kernel-Smith-MACA-30B超越了DeepSeek-V3.2-think和Qwen3-235B-2507-think等大规模模型，凸显了其在异构平台间无缝适配的潜力。除基准测试结果外，同一工作流程还对SGLang和LMDeploy等生产系统产生了上游贡献，证明了大语言模型驱动的内核优化能够从受控评估迁移至实际部署。

摘要 (Abstract)

We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup. To make this search reliable, we build backend-specific evaluation services for Triton on NVIDIA GPUs and Maca on MetaX GPUs. On the training side, we convert long-horizon evolution trajectories into step-centric supervision and reinforcement learning signals by retaining correctness-preserving, high-gain revisions, so that the model is optimized as a strong local improver inside the evolutionary loop rather than as a one-shot generator. Under a unified evolutionary protocol, Kernel-Smith-235B-RL achieves state-of-the-art overall performance on KernelBench with Nvidia Triton backend, attaining the best average speedup ratio and outperforming frontier proprietary models including Gemini-3.0-pro and Claude-4.6-opus. We further validate the framework on the MetaX MACA backend, where our Kernel-Smith-MACA-30B surpasses large-scale counterparts such as DeepSeek-V3.2-think and Qwen3-235B-2507-think, highlighting potential for seamless adaptation across heterogeneous platforms. Beyond benchmark results, the same workflow produces upstream contributions to production systems including SGLang and LMDeploy, demonstrating that LLM-driven kernel optimization can transfer from controlled evaluation to practical deployment.

关键词: GPU kernel optimization, evolutionary agent, LLM-driven optimization, reinforcement learning, post-training, autonomous agents, performance speedup, heterogeneous platforms

132. ❌ The Necessity of Setting Temperature in LLM-as-a-Judge

作者: Lujun Li, Lama Sleem, Yangjie Xu, Yewei Song, Aolin Jia, Jerome Francois, Radu State 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28304v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM-as-a-Judge评估范式中的温度参数设置问题，仅与关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文完全围绕LLM在评估任务中的应用展开。其他关键词涉及模型架构、训练方法、推理优化、应用领域等，论文均未涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文研究了温度设置对LLM作为评估者性能的影响，通过实验和因果分析发现温度对评估结果有显著影响，并提供了优化评估流程的工程建议。

摘要翻译

LLM-as-a-Judge（大语言模型作为评判者）已成为评估文本质量和事实准确性的高效低成本范式。先前研究表明，即使在难以自动评估的任务上，LLM评判者与人类专家之间也存在高度一致性。实践中，研究者在评估过程中通常采用固定的温度参数配置——其中0.1和1.0是最常见的选择——这一惯例主要基于经验而非理论原则。然而，最新研究表明LLM性能对温度设置具有显著敏感性，较低温度并非总能产生最优结果，且这种影响高度依赖于具体任务。这引出了一个关键研究问题：在以LLM为核心的评估中，温度是否会影响评判者性能？为此，我们通过一系列受控实验系统探究了温度与评判性能之间的关系，并进一步在实证统计分析中采用因果推断框架，严格检验温度对评判行为的直接因果效应，从而为以LLM为核心的评估流程设计提供可操作的工程洞见。

摘要 (Abstract)

LLM-as-a-Judge has emerged as an effective and low-cost paradigm for evaluating text quality and factual correctness. Prior studies have shown substantial agreement between LLM judges and human experts, even on tasks that are difficult to assess automatically. In practice, researchers commonly employ fixed temperature configurations during the evaluation process-with values of 0.1 and 1.0 being the most prevalent choices-a convention that is largely empirical rather than principled. However, recent researches suggest that LLM performance exhibits non-trivial sensitivity to temperature settings, that lower temperatures do not universally yield optimal outcomes, and that such effects are highly task-dependent. This raises a critical research question: does temperature influence judge performance in LLM centric evaluation? To address this, we systematically investigate the relationship between temperature and judge performance through a series of controlled experiments, and further adopt a causal inference framework within our empirical statistical analysis to rigorously examine the direct causal effect of temperature on judge behavior, offering actionable engineering insights for the design of LLM-centric evaluation pipelines.

关键词: LLM-as-a-Judge, temperature, evaluation, causal inference, performance, empirical analysis, engineering insights

133. ❌ \textit{Versteasch du mi?} Computational and Socio-Linguistic Perspectives on GenAI, LLMs, and Non-Standard Language

作者: Verena Platzgummer, John McCrae, Sina Ahmadi 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28213v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）对非标准语言（如南蒂罗尔方言和库尔德语变体）的处理能力及其社会语言学影响，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术细节（如MoE、量化、推理加速等）或特定应用领域（如生物信息学），因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文探讨了大语言模型如何处理非标准语言（以南蒂罗尔方言和库尔德语变体为例），并分析了其对社会语言学公平性和数字语言鸿沟的影响，提出了技术改进与政策建议。

摘要翻译

大型语言模型与生成式人工智能的设计已被证明对使用人数较少的语言存在“不公”，并加深了数字语言鸿沟。批判性社会语言学研究进一步指出，这些技术不仅依赖于历史上以欧洲民族主义及殖民项目为基础的语言标准化进程，更强化了将语言视为“单一、单语、句法标准化的意义系统”的认识论。本文基于先前关于技术与语言政策交叉领域的研究，结合我们在批判社会语言学和计算语言学领域的专长，对这些论点进行审视。我们以各自研究范畴中的两种非标准语言变体集群——意大利南蒂罗尔地区非正式交流中广泛使用的南蒂罗尔方言，以及库尔德语的各种变体——作为起点，开展生成式人工智能与语言变异及标准化交叉领域的跨学科探索。我们既从技术角度探讨如何使大型语言模型处理非标准语言，也审视这种做法是否、何时或如何能够促进“民主化与去殖民化的数字及机器学习策略”，这一探讨具有直接的政策意义。

摘要 (Abstract)

The design of Large Language Models and generative artificial intelligence has been shown to be “unfair” to less-spoken languages and to deepen the digital language divide. Critical sociolinguistic work has also argued that these technologies are not only made possible by prior socio-historical processes of linguistic standardisation, often grounded in European nationalist and colonial projects, but also exacerbate epistemologies of language as “monolithic, monolingual, syntactically standardized systems of meaning”. In our paper, we draw on earlier work on the intersections of technology and language policy and bring our respective expertise in critical sociolinguistics and computational linguistics to bear on an interrogation of these arguments. We take two different complexes of non-standard linguistic varieties in our respective repertoires–South Tyrolean dialects, which are widely used in informal communication in South Tyrol, Italy, as well as varieties of Kurdish–as starting points to an interdisciplinary exploration of the intersections between GenAI and linguistic variation and standardisation. We discuss both how LLMs can be made to deal with nonstandard language from a technical perspective, and whether, when or how this can contribute to “democratic and decolonial digital and machine learning strategies”, which has direct policy implications.

关键词: Large Language Models, Generative AI, Non-standard Language, Sociolinguistics, Digital Language Divide, Language Standardization, Computational Linguistics, Language Policy

134. ❌ Beyond Cosine Similarity: Zero-Initialized Residual Complex Projection for Aspect-Based Sentiment Analysis

作者: Yijin Wang, Fandi Sun 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28205v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于细粒度情感分析（ABSA）任务，提出了一种基于复数投影和抗碰撞掩码的新框架来解决表示纠缠和假阴性碰撞问题。论文内容主要涉及自然语言处理中的情感分析、表示学习、对比学习和几何分析，但未涉及任何大模型、深度学习技术原理创新或大模型在不同领域的应用。所有关键词均与大模型技术、训练方法、推理优化、对齐、代理系统、科学AI应用等相关，而本文研究的是传统NLP任务中的特定方法改进，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文针对方面级情感分析中的表示纠缠和假阴性碰撞问题，提出了零初始化残差复数投影和抗碰撞掩码角度损失的新框架，实现了细粒度情感解耦，并在实验中达到了0.8851的宏F1分数。

摘要翻译

方面级情感分析（ABSA）的核心挑战在于表征纠缠问题，即方面语义与情感极性在实值嵌入空间中常被混淆。此外，标准对比学习方法受到假阴性冲突的困扰，严重降低了模型在高频方面的性能。受量子投影与纠缠理论的启发，本文提出了一种创新框架，其核心为零初始化残差复投影（ZRCP）和抗冲突掩蔽角损失。该框架将文本特征投影至复语义空间，系统性地利用相位分离情感极性，同时允许振幅编码主观描述的语义强度与词汇丰富度。为解决冲突瓶颈，我们引入了抗冲突掩蔽机制，该机制在保持极性内部方面凝聚力的同时，将极性间的判别边界扩大了50%以上。实验结果表明，我们的框架实现了0.8851的宏平均F1分数，达到当前最优水平。深入的几何分析进一步揭示：显式惩罚复振幅会灾难性地过度正则化主观表征，这证明我们采用的无约束振幅与相位驱动的优化目标对于实现鲁棒的细粒度情感解纠缠至关重要。

摘要 (Abstract)

Aspect-Based Sentiment Analysis (ABSA) is fundamentally challenged by representation entanglement, where aspect semantics and sentiment polarities are often conflated in real-valued embedding spaces. Furthermore, standard contrastive learning suffers from false-negative collisions, severely degrading performance on high-frequency aspects. In this paper, we propose a novel framework featuring a Zero-Initialized Residual Complex Projection (ZRCP) and an Anti-collision Masked Angle Loss,inspired by quantum projection and entanglement ideas. Our approach projects textual features into a complex semantic space, systematically utilizing the phase to disentangle sentiment polarities while allowing the amplitude to encode the semantic intensity and lexical richness of subjective descriptions. To tackle the collision bottleneck, we introduce an anti-collision mask that elegantly preserves intra-polarity aspect cohesion while expanding the inter-polarity discriminative margin by over 50%. Experimental results demonstrate that our framework achieves a state-of-the-art Macro-F1 score of 0.8851. Deep geometric analyses further reveal that explicitly penalizing the complex amplitude catastrophically over-regularizes subjective representations, proving that our unconstrained-amplitude and phase-driven objective is crucial for robust, fine-grained sentiment disentanglement.

关键词: Aspect-Based Sentiment Analysis, representation entanglement, complex semantic space, Zero-Initialized Residual Complex Projection, Anti-collision Masked Angle Loss, sentiment disentanglement, contrastive learning, geometric analysis

135. ❌ DongYuan: An LLM-Based Framework for Integrative Chinese and Western Medicine Spleen-Stomach Disorders Diagnosis

作者: Hua Li, Yingying Li, Xiaobin Feng, Xinyi Fu, Lifeng Dong, Qingfeng Yang, Yanzhe Chen, Xiaoju Feng, Zhidong Cao, Jianbin Guo, Yanru Du 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28191v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是开发基于LLM的中西医结合脾胃病诊断框架，高度相关关键词包括：LLMs（核心模型）、SFT和DPO（两阶段训练方法）、AI for Science（医学应用）。数据质量挑战与Scaling Laws相关（5分），诊断推理涉及CoT和System 2 Thinking（5分）。其他技术如MoE、量化、RAG等未提及（0分）。

!!! tip deepseek-chat TL;DR

该研究针对中西医结合脾胃病诊断中高质量数据缺乏、模型推理能力不足和评估标准缺失三大挑战，提出了基于LLM的DongYuan框架，通过两阶段训练（SFT+DPO）开发的诊断模型在自建基准上显著优于12个主流基线。

摘要翻译

脾胃疾病的临床负担十分沉重。尽管大语言模型为医学应用提供了新的潜力，但在中西医结合领域，它们面临三大挑战：缺乏高质量数据、缺少能够有效整合中医辨证推理逻辑与西医疾病诊断推理逻辑的模型，以及缺乏标准化的评估基准。为应对这些相互关联的挑战，我们提出了“东垣”——一个中西医结合的脾胃病诊断框架。具体而言，我们构建了三个中西医结合数据集（SSDF-Syndrome、SSDF-Dialogue 和 SSDF-PD），以填补脾胃疾病高质量数据的空白。随后，我们开发了 SSDF-Core，这是一个核心诊断大语言模型，它通过监督微调（SFT）和直接偏好优化（DPO）的两阶段训练方案，获得了稳健的中西医结合推理能力，并辅以 SSDF-Navigator——一个可插拔的咨询导航模型，旨在优化临床问诊策略。此外，我们建立了 SSDF-Bench，这是一个专注于脾胃疾病中西医结合诊断的综合评估基准。实验结果表明，SSDF-Core 在 SSDF-Bench 上的表现显著优于 12 个主流基线模型。“东垣”为未来智能中西医结合诊断系统的发展奠定了坚实的方法学基础，并提供了实用的技术参考。

摘要 (Abstract)

The clinical burden of spleen-stomach disorders is substantial. While large language models (LLMs) offer new potential for medical applications, they face three major challenges in the context of integrative Chinese and Western medicine (ICWM): a lack of high-quality data, the absence of models capable of effectively integrating the reasoning logic of traditional Chinese medicine (TCM) syndrome differentiation with that of Western medical (WM) disease diagnosis, and the shortage of a standardized evaluation benchmark. To address these interrelated challenges, we propose DongYuan, an ICWM spleen-stomach diagnostic framework. Specifically, three ICWM datasets (SSDF-Syndrome, SSDF-Dialogue, and SSDF-PD) were curated to fill the gap in high-quality data for spleen-stomach disorders. We then developed SSDF-Core, a core diagnostic LLM that acquires robust ICWM reasoning capabilities through a two-stage training regimen of supervised fine-tuning. tuning (SFT) and direct preference optimization (DPO), and complemented it with SSDF-Navigator, a pluggable consultation navigation model designed to optimize clinical inquiry strategies. Additionally, we established SSDF-Bench, a comprehensive evaluation benchmark focused on ICWM diagnosis of spleen-stomach disorders. Experimental results demonstrate that SSDF-Core significantly outperforms 12 mainstream baselines on SSDF-Bench. DongYuan lays a solid methodological foundation and provides practical technical references for the future development of intelligent ICWM diagnostic systems.

关键词: Large Language Models, Integrative Chinese and Western Medicine, Spleen-Stomach Disorders, Supervised Fine-tuning, Direct Preference Optimization, Medical Diagnosis, AI for Science, Evaluation Benchmark

136. ❌ Who Wrote the Book? Detecting and Attributing LLM Ghostwriters

作者: Anudeex Shetty, Qiongkai Xu, Olga Ohrimenko, Jey Han Lau 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28054v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM作者归属检测，核心贡献是GhostWriteBench数据集和TRACE指纹方法，仅与’Large Language Models’高度相关（10分），其他关键词涉及模型架构、训练方法、推理优化、应用领域等，均未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了GhostWriteBench数据集和TRACE指纹方法，用于检测和归属长文本的LLM作者，并在OOD设置和有限数据场景下实现了最先进的性能。

摘要翻译

本文提出GhostWriteBench——一个用于大型语言模型作者归属判定的数据集。该数据集包含由前沿大语言模型生成的长文本（每部作品超过5万字），旨在测试模型在多个分布外维度上的泛化能力，包括文本领域和未见过的模型作者。我们同时提出TRACE——一种可解释的轻量级指纹识别方法，适用于开源和闭源模型。该方法通过捕捉由另一个轻量级语言模型估计的词汇级转移模式（如词序层级）来构建指纹。在GhostWriteBench上的实验表明，TRACE取得了最先进的性能，在分布外场景中保持鲁棒性，并在有限训练数据条件下表现优异。

摘要 (Abstract)

In this paper, we introduce GhostWriteBench, a dataset for LLM authorship attribution. It comprises long-form texts (50K+ words per book) generated by frontier LLMs, and is designed to test generalisation across multiple out-of-distribution (OOD) dimensions, including domain and unseen LLM author. We also propose TRACE – a novel fingerprinting method that is interpretable and lightweight – that works for both open- and closed-source models. TRACE creates the fingerprint by capturing token-level transition patterns (e.g., word rank) estimated by another lightweight language model. Experiments on GhostWriteBench demonstrate that TRACE achieves state-of-the-art performance, remains robust in OOD settings, and works well in limited training data scenarios.

关键词: LLM authorship attribution, GhostWriteBench, fingerprinting method, TRACE, out-of-distribution generalization, token-level transition patterns, frontier LLMs, long-form texts

137. ❌ From Reviews to Requirements: Can LLMs Generate Human-Like User Stories?

作者: Shadman Sakib, Oishy Fatema Akhand, Tasnia Tasneem, Shohel Ahmed 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28163v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文核心研究LLMs在软件工程领域的应用，具体评估LLMs（GPT-3.5 Turbo, Gemini 2.0 Flash, Mistral 7B Instruct）将应用商店评论转化为用户故事的能力，因此与’Large Language Models’高度相关（10分）。论文使用了zero-shot、one-shot和two-shot prompting方法，这属于’In-context Learning’范畴，因此给予5分。其他关键词涉及模型架构、训练技术、推理优化、对齐方法、特定应用领域等，论文均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该研究评估了大型语言模型（LLMs）将混乱的应用商店评论转化为可用用户故事的能力，发现LLMs在少量示例提示下能生成流畅、格式良好的用户故事，但在生成独立、独特的用户故事方面仍有不足。

摘要翻译

应用商店评论持续提供着真实的用户反馈流，有助于改进软件需求。然而，这些评论通常杂乱、非正式，且难以通过人工方式进行大规模分析。尽管已有自动化技术存在，但许多方法在复现时效果不佳，且往往无法为敏捷项目生成清晰、可直接纳入待办事项列表的用户故事。
在本研究中，我们评估了大型语言模型（如GPT-3.5 Turbo、Gemini 2.0 Flash和Mistral 7B Instruct）直接从原始应用评论中生成可用用户故事的能力。利用包含1000多条健康应用评论的Mini-BAR数据集，我们测试了零样本、单样本和双样本提示方法。
我们通过人工评估（基于RUST框架）和基于UStAI微调的RoBERTa分类器，对生成的用户故事进行了整体质量评估。结果表明，在撰写流畅且格式规范的用户故事方面，大型语言模型能够达到甚至超越人类水平，尤其是在使用少量样本提示时。然而，它们在生成独立且独特的用户故事方面仍存在困难，而这对于构建坚实的敏捷待办事项列表至关重要。
总体而言，我们的研究结果表明，大型语言模型能够可靠地将非结构化的应用评论转化为可执行的软件需求，为开发者将用户反馈转化为有意义的改进提供清晰指导。

摘要 (Abstract)

App store reviews provide a constant flow of real user feedback that can help improve software requirements. However, these reviews are often messy, informal, and difficult to analyze manually at scale. Although automated techniques exist, many do not perform well when replicated and often fail to produce clean, backlog-ready user stories for agile projects. In this study, we evaluate how well large language models (LLMs) such as GPT-3.5 Turbo, Gemini 2.0 Flash, and Mistral 7B Instruct can generate usable user stories directly from raw app reviews. Using the Mini-BAR dataset of 1,000+ health app reviews, we tested zero-shot, one-shot, and two-shot prompting methods. We evaluated the generated user stories using both human judgment (via the RUST framework) and a RoBERTa classifier fine-tuned on UStAI to assess their overall quality. Our results show that LLMs can match or even outperform humans in writing fluent, well-formatted user stories, especially when few-shot prompts are used. However, they still struggle to produce independent and unique user stories, which are essential for building a strong agile backlog. Overall, our findings show how LLMs can reliably turn unstructured app reviews into actionable software requirements, providing developers with clear guidance to turn user feedback into meaningful improvements.

关键词: Large Language Models, LLMs, user stories, app reviews, software requirements, prompting methods, zero-shot, few-shot

138. ❌ Transfer Learning for an Endangered Slavic Variety: Dependency Parsing in Pomak Across Contact-Shaped Dialects

作者: Sercan Karakaş 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28033v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究濒危斯拉夫语言Pomak的依存句法分析，属于自然语言处理中的特定领域应用。论文主要涉及迁移学习和微调技术，与’Post-training OR Supervised Fine-tuning OR SFT’关键词有一定关联（评5分），因为论文明确提到使用现有树库训练解析器并进行微调。其他关键词均涉及大模型、深度学习技术原理、AI科学应用等前沿技术，而该论文使用传统NLP方法处理濒危语言，未涉及大模型、MoE、量化、推理加速、对齐、RAG等现代大模型技术，因此其他关键词评0分。

!!! tip deepseek-chat TL;DR

该论文研究了在濒危斯拉夫语言Pomak的希腊变体和土耳其变体之间进行依存句法分析的迁移学习效果，发现尽管存在方言差异，但针对性的微调和跨变体迁移学习能显著提升解析准确率。

摘要翻译

本文为波马克语——一种濒危的东南斯拉夫语言——的依存句法分析提供了新的资源与基线模型。该语言存在显著的方言差异且缺乏广泛采用的标准形式。我们聚焦于土耳其（乌尊克普吕）地区使用的方言变体，并探究基于现有波马克语通用依存树库（该树库主要源自希腊地区使用的方言变体）训练的依存句法分析器在不同方言间的迁移效果。研究分为两个实验阶段：首先，我们在希腊方言变体的通用依存数据上训练句法分析器，并评估其对土耳其方言变体波马克语的零样本迁移能力，量化音系及形态句法差异产生的影响；其次，我们引入一个包含650个句子的新型人工标注土耳其方言变体波马克语语料库，证明尽管数据规模有限，但针对性的微调能显著提升分析准确率。通过结合两种方言的跨变体迁移学习，模型性能得到进一步提升。

摘要 (Abstract)

This paper presents new resources and baselines for Dependency Parsing in Pomak, an endangered Eastern South Slavic language with substantial dialectal variation and no widely adopted standard. We focus on the variety spoken in Turkey (Uzunköprü) and ask how well a dependency parser trained on the existing Pomak Universal Dependencies treebank, which was built primarily from the variety that is spoken in Greece, transfers across dialects. We run two experimental phases. First, we train a parser on the Greek-variety UD data and evaluate zero-shot transfer to Turkish-variety Pomak, quantifying the impact of phonological and morphosyntactic differences. Second, we introduce a new manually annotated Turkish-variety Pomak corpus of 650 sentences and show that, despite its small size, targeted fine-tuning substantially improves accuracy; performance is further boosted by cross-variety transfer learning that combines the two dialects.

关键词: Dependency Parsing, Pomak, Endangered Language, Transfer Learning, Dialect Variation, Fine-tuning, Universal Dependencies, Cross-variety Learning

139. ❌ Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation

作者: Xinran Zhang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28005v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM作为评估者（LLM judges）在参考基础问答评估中的表现，直接涉及LLM技术应用，因此与’Large Language Models’高度相关（10分）。论文研究评估候选答案相对于参考的完整性（fully/partially/unsupported），涉及事实性验证，与’Hallucination Mitigation’有一定关联（5分）。论文未涉及其他关键词的具体技术或应用领域，如MoE、SLMs、训练方法、推理优化、代理系统等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在参考基础问答评估中，原子分解（将答案分解为声明进行验证）与整体评估两种LLM评判方法的比较，发现整体评估在三个基准中的两个上表现相当或更好，特别是在检测部分支持案例方面具有优势。

摘要翻译

原子分解——在验证前将候选答案拆分为多个主张，再分别与参考依据核对——是基于大语言模型的参考型评估器中广泛采用的设计。然而，原子式提示通常更丰富且更长，这使得其优势究竟源于分解过程还是更丰富的提示本身尚不明确。本研究针对基准式、对完整性敏感的参考支持分类任务（即根据提供的参考依据，将候选答案分类为完全支持、部分支持或不支持）对此进行探究。我们比较了具备自我分解能力的原子式评估器（单提示分解-验证）与经过提示控制的整体式评估器，两者输入相同且使用相似细化的评估准则。我们在TruthfulQA、ASQA和QAMPARI三个数据集上各选取200个源示例，采用四种模型系列，通过源层级配对检验、聚类自助法，并对每种设计族内三种预先固定的提示变体结果进行聚合分析。研究发现，在三个基准中的两个上，整体式评估器表现等同或优于原子式评估器：ASQA和QAMPARI在所有四种模型系列中均倾向于整体式评估（其中三个系列的结果具有统计可靠性），而TruthfulQA显示原子式评估略有优势。整体式评估的优势主要集中在部分支持案例——即不完整性检测方面。针对人工标注的敏感性检验证实，在基准完整性标准和人工事实正确性标准下，该评估器排序结果一致。我们的发现特定于采用自我分解单提示模式、在三个问答式基准（各含200个源示例）上的实验；多阶段原子式流程及非问答任务仍有待检验。在所考察的干扰因素中，参考依据质量下降对两种评估器家族均造成了最大的准确率下降。

摘要 (Abstract)

Atomic decomposition – breaking a candidate answer into claims before verifying each against a reference – is a widely adopted design for LLM-based reference-grounded judges. However, atomic prompts are typically richer and longer, making it unclear whether any advantage comes from decomposition or from richer prompting. We study this for benchmark-style completeness-sensitive reference-support classification: classifying a candidate as fully supported, partially supported, or unsupported relative to a supplied reference. We compare a self-decomposing atomic judge (single-prompt decompose-and-verify) against a prompt-controlled holistic judge with the same inputs and a similarly detailed rubric. On 200 source examples per dataset across TruthfulQA, ASQA, and QAMPARI, with four model families, source-level paired tests, cluster bootstrap, and aggregation across three pre-frozen prompt variants per design family, we find the holistic judge matches or exceeds the atomic judge on two of three benchmarks: ASQA and QAMPARI favor holistic across all four families (statistically reliable in three of four), while TruthfulQA shows a small atomic edge. The holistic advantage is concentrated in partially_supported cases – incompleteness detection. A sensitivity check against human annotations confirms the ranking under both benchmark-completeness and human factual-correctness standards. Our finding is specific to the self-decomposing single-prompt pattern on three QA-style benchmarks with 200 source examples each; multi-stage atomic pipelines and non-QA tasks remain untested. Among perturbations examined, reference-quality degradation produced the largest accuracy drops for both judge families.

关键词: LLM judges, atomic decomposition, reference-grounded evaluation, QA evaluation, holistic judge, completeness-sensitive classification, TruthfulQA, ASQA

140. ❌ CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models

作者: Kesheng Chen, Yamin Hu, Qi Zhou, Zhenqian Zhu, Wenjian Luo 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.27982v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视觉语言模型（VLMs）中的幻觉问题，特别是当视觉证据与常识冲突时模型倾向于遵循常识而非视觉证据的现象（commonsense-driven hallucination）。这与关键词’Hallucination Mitigation OR Factuality OR Truthfulness’高度相关（10分），因为论文的核心是评估和诊断幻觉问题。与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文通过基准测试诊断模型行为，有助于理解模型决策机制。其他关键词主要涉及大语言模型（LLMs）的特定技术、训练方法、优化技术、推理方法、代理系统、压缩加速等，而本文研究对象是视觉语言模型（VLMs），且未深入探讨这些具体技术，因此均评为0分。

!!! tip deepseek-chat TL;DR

该论文研究了视觉语言模型中当视觉证据与常识冲突时模型倾向于产生常识驱动幻觉的问题，并提出了CDH-Bench基准来评估模型在这种冲突下的视觉保真度，发现即使前沿模型也容易受到先验知识驱动而忽视视觉证据。

摘要翻译

视觉语言模型（Vision-Language Models, VLMs）在许多基准测试中表现出色，但一个基本的可靠性问题仍未得到充分探究：当视觉证据与常识相冲突时，模型是遵循所呈现的内容，还是遵从常识的暗示？在此情境下，一个典型的失败模式是模型会忽略视觉证据，输出符合常识的替代答案。我们将这种现象称为常识驱动幻觉（commonsense-driven hallucination, CDH）。为评估此现象，我们提出了CDH-Bench，这是一个专门设计用于制造显式视觉证据-常识冲突的基准测试。CDH-Bench涵盖三个维度：计数异常、关系异常和属性异常。我们在二元问答和多项选择问答两种设置下评估前沿的视觉语言模型，并报告包括反事实准确率、常识准确率、反事实准确率下降、常识崩溃率和相对先验依赖性在内的多项指标。结果表明，即使在视觉证据与常识冲突的情况下，即使是强大的模型仍易受先验驱动归一化的影响。CDH-Bench为视觉证据-常识冲突下的视觉保真度提供了一个受控的诊断工具。

摘要 (Abstract)

Vision-language models (VLMs) achieve strong performance on many benchmarks, yet a basic reliability question remains underexplored: when visual evidence conflicts with commonsense, do models follow what is shown or what commonsense suggests? A characteristic failure in this setting is that the model overrides visual evidence and outputs the commonsense alternative. We term this phenomenon \textbf{commonsense-driven hallucination} (CDH). To evaluate it, we introduce \textbf{CDH-Bench}, a benchmark designed to create explicit \textbf{visual evidence–commonsense conflicts}. CDH-Bench covers three dimensions: \textit{counting anomalies}, \textit{relational anomalies}, and \textit{attribute anomalies}. We evaluate frontier VLMs under \textit{binary Question Answering (QA)} and \textit{multiple-choice QA}, and report metrics including \textit{Counterfactual Accuracy} (CF-Acc), \textit{Commonsense Accuracy} (CS-Acc), \textit{Counterfactual Accuracy Drop} (CFAD), \textit{Commonsense Collapse Rate} (CCR), and \textit{Relative Prior Dependency} (RPD). Results show that even strong models remain vulnerable to prior-driven normalization under visual evidence–commonsense conflict. CDH-Bench provides a controlled diagnostic of visual fidelity under visual evidence–commonsense conflict.

关键词: Vision-Language Models, Commonsense-Driven Hallucination, Visual Evidence-Commonsense Conflict, Benchmark Evaluation, Visual Fidelity, Counterfactual Accuracy, Prior-Driven Normalization, CDH-Bench

141. ❌ On the Role of Encoder Depth: Pruning Whisper and LoRA Fine-Tuning in SLAM-ASR

作者: Ganesh Pavan Kartikeya Bharadwaj Kolluri, Michael Kampouridis, Ravi Shekhar 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.27981v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究SLAM-ASR系统中Whisper编码器的层剪枝和LoRA微调，核心涉及大模型（Whisper）在语音识别领域的应用、LoRA参数高效微调技术以及模型压缩（剪枝）。与’PEFT/LoRA’高度相关（10分），因为LoRA是主要实验方法；与’Large Language Models’相关（8分），因为Whisper是大规模预训练模型；与’Post-training/SFT’相关（8分），涉及微调；与’Pre-training’有一定关联（5分），因提及预训练模型；与’Quantization/Model Compression’相关（5分），因剪枝是模型压缩的一种形式。其他关键词如MoE、SLMs、Scaling Laws等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了在SLAM-ASR系统中对Whisper语音编码器进行层剪枝的影响，并发现结合LoRA微调不仅能恢复性能，还能减少参数，在多种语言和模型变体上实现优于未剪枝基线的效果。

摘要翻译

近年来，在大规模预训练模型及SLAM-ASR等端到端架构的推动下，自动语音识别（Automatic Speech Recognition, ASR）技术发展迅速。SLAM-ASR系统的关键组件之一是Whisper语音编码器，它能够提供鲁棒的声学表征。尽管已有研究探索对完整Whisper编码器-解码器架构进行模型剪枝，但其在SLAM-ASR框架下的影响尚未得到充分探究。本研究分析了当Whisper编码器作为SLAM-ASR的声学骨干网络时，层剪枝所带来的影响。我们进一步探究了基于LoRA（Low-Rank Adaptation）的微调方法能在多大程度上恢复因剪枝导致的性能下降。实验覆盖三种Whisper变体（Small、Medium、Large-v2）、代表不同资源水平的三种语言（丹麦语、荷兰语、英语）以及超过200次训练运行。结果表明，剪除两个编码器层仅导致2-4%的词错误率（WER）上升，而将此种剪枝与LoRA适配结合后，在总参数量减少7-14%的同时，性能持续优于未剪枝的基线模型。此外，我们的误差分析显示，LoRA主要通过语言模型的语言先验进行补偿，使荷兰语和英语的总词错误减少11-21%，其中替换错误和删除错误的降低最为显著。然而，对于低资源语言丹麦语，错误减少幅度较小（4-7%），且LoRA引入了更多的插入错误，这表明补偿效果取决于大语言模型（LLM）已有的语言熟练度及可用的训练数据。

摘要 (Abstract)

Automatic speech recognition (ASR) has advanced rapidly in recent years, driven by large-scale pretrained models and end-to-end architectures such as SLAM-ASR. A key component of SLAM-ASR systems is the Whisper speech encoder, which provides robust acoustic representations. While model pruning has been explored for the full Whisper encoder-decoder architecture, its impact within the SLAM-ASR setting remains under-investigated. In this work, we analyze the effects of layer pruning in the Whisper encoder when used as the acoustic backbone of SLAM-ASR. We further examine the extent to which LoRA-based fine-tuning can recover performance degradation caused by pruning. Experiments conducted across three Whisper variants (Small, Medium, Large-v2), three languages representing distinct resource levels (Danish, Dutch, English), and over 200 training runs demonstrate that pruning two encoder layers causes only 2-4% WER degradation, and that combining this pruning with LoRA adaptation consistently outperforms the unpruned baseline while reducing total parameters by 7-14%. Moreover, our error analysis reveals that LoRA primarily compensates through the language model’s linguistic priors, reducing total word errors by 11-21% for Dutch and English, with substitutions and deletions showing the largest reductions. However, for low-resource Danish, the reduction is smaller (4-7%), and LoRA introduces increased insertion errors, indicating that compensation effectiveness depends on the LLM’s pre-existing language proficiency and available training data.

关键词: Automatic Speech Recognition, SLAM-ASR, Whisper encoder, Layer pruning, LoRA fine-tuning, Parameter-efficient fine-tuning, Model compression, Word Error Rate

142. ❌ Efficient Inference of Large Vision Language Models

作者: Surendra Pathak 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.27960v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	5.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文是关于大型视觉语言模型（LVLMs）推理加速的综述，属于大模型技术应用领域。与’Large Language Models’高度相关（8分），因为LVLMs是大语言模型的视觉扩展。与’Speculative Decoding OR Inference Acceleration’高度相关（10分），因为论文核心是推理加速技术综述。与’KV Cache Compression OR Linear Attention OR FlashAttention’有一定关联（5分），因为论文涉及注意力机制优化。与’Quantization OR Model Compression OR Low-bit Weights’有一定关联（5分），因为论文涉及模型压缩技术。其他关键词与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

这篇论文系统综述了加速大型视觉语言模型推理的优化技术，包括视觉令牌压缩、内存管理、高效架构设计和高级解码策略，并指出了当前方法的局限性和未来研究方向。

摘要翻译

尽管大规模视觉语言模型（LVLMs）已展现出令人瞩目的多模态推理能力，但其扩展性与部署受限于庞大的计算需求。特别是高分辨率输入数据产生的海量视觉标记，由于注意力机制的二次复杂度而进一步加剧了这一问题。为应对这些挑战，研究界已开发出多种优化框架。本文全面综述了当前加速LVLM推理的前沿技术，提出一个系统化分类法，将现有优化框架归纳为四个主要维度：视觉标记压缩、内存管理与服务、高效架构设计以及先进解码策略。此外，我们批判性地审视了现有方法的局限性，并指出了关键性的开放问题，以期为高效多模态系统的未来研究方向提供启示。

摘要 (Abstract)

Although Large Vision Language Models (LVLMs) have demonstrated impressive multimodal reasoning capabilities, their scalability and deployment are constrained by massive computational requirements. In particular, the massive amount of visual tokens from high-resolution input data aggravates the situation due to the quadratic complexity of attention mechanisms. To address these issues, the research community has developed several optimization frameworks. This paper presents a comprehensive survey of the current state-of-the-art techniques for accelerating LVLM inference. We introduce a systematic taxonomy that categorizes existing optimization frameworks into four primary dimensions: visual token compression, memory management and serving, efficient architectural design, and advanced decoding strategies. Furthermore, we critically examine the limitations of these current methodologies and identify critical open problems to inspire future research directions in efficient multimodal systems.

关键词: Large Vision Language Models, LVLMs, inference acceleration, visual token compression, efficient architectural design, multimodal systems, attention mechanisms, decoding strategies

143. ❌ EnsemJudge: Enhancing Reliability in Chinese LLM-Generated Text Detection through Diverse Model Ensembles

作者: Zhuoshang Wang, Yubing Ren, Guoyu Zhao, Xiaowei Zhu, Hao Li, Yanan Cao 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.27949v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于检测中文LLM生成文本的可靠性，核心是LLM应用（检测LLM生成内容），因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的技术原理、方法或应用，如MoE、SLMs、训练技术、推理优化、代理系统、科学AI等，故其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该研究针对中文LLM生成文本检测在现实场景中面临领域外输入和对抗样本的挑战，提出了一个结合定制策略和集成投票机制的鲁棒框架EnsemJudge，在NLPCC2025共享任务中超越了所有基线方法并获得第一名。

摘要翻译

大语言模型（Large Language Models, LLMs）凭借其强大的文本生成能力，被广泛应用于各个领域。尽管LLM生成的文本常与人类撰写的文本相似，但其滥用可能带来重大的社会风险。检测此类文本是缓解LLM滥用的关键技术，许多检测方法已在不同数据集上展现出良好的效果。然而，现实场景中常涉及域外输入或对抗性样本，这会在不同程度上影响检测方法的性能。此外，现有研究大多集中于英文文本，针对中文文本检测的工作相对有限。在本研究中，我们提出了EmsemJudge，一个通过结合定制化策略与集成投票机制来检测中文LLM生成文本的鲁棒性框架。我们在NLPCC2025共享任务1提供的精心构建的中文数据集上对系统进行了训练与评估。我们的方法超越了所有基线方法，并在该任务中取得了第一名，证明了其在中文LLM生成文本检测中的有效性与可靠性。我们的代码公开于https://github.com/johnsonwangzs/MGT-Mini。

摘要 (Abstract)

Large Language Models (LLMs) are widely applied across various domains due to their powerful text generation capabilities. While LLM-generated texts often resemble human-written ones, their misuse can lead to significant societal risks. Detecting such texts is an essential technique for mitigating LLM misuse, and many detection methods have shown promising results across different datasets. However, real-world scenarios often involve out-of-domain inputs or adversarial samples, which can affect the performance of detection methods to varying degrees. Furthermore, most existing research has focused on English texts, with limited work addressing Chinese text detection. In this study, we propose EnsemJudge, a robust framework for detecting Chinese LLM-generated text by incorporating tailored strategies and ensemble voting mechanisms. We trained and evaluated our system on a carefully constructed Chinese dataset provided by NLPCC2025 Shared Task 1. Our approach outperformed all baseline methods and achieved first place in the task, demonstrating its effectiveness and reliability in Chinese LLM-generated text detection. Our code is available at https://github.com/johnsonwangzs/MGT-Mini.

关键词: LLM-generated text detection, Chinese text detection, ensemble voting, robust framework, out-of-domain inputs, adversarial samples, NLPCC2025

144. ❌ Top-down string-to-dependency Neural Machine Translation

作者: Shuhei Kondo, Katsuhito Sudoh, Yuji Matsumoto 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.27938v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究神经机器翻译（NMT）中的语法增强解码方法，提出了一种新颖的基于依赖树的top-down解码器，以改善长输入翻译的泛化能力。论文内容聚焦于传统的NMT架构（如encoder-decoder with attention）和句法整合，并未涉及大模型（LLMs）、深度学习技术原理创新、或大模型在不同领域的应用。所有评分关键词均与大模型技术、训练方法、推理优化、对齐、代理系统等前沿主题相关，而该论文属于早期NMT研究，与这些关键词完全无关。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种新颖的top-down string-to-dependency神经机器翻译解码器，通过生成目标语言依赖树来改善对训练数据中未见的长输入句子的翻译泛化能力。

摘要翻译

现代神经机器翻译模型大多基于带有注意力机制的编码器-解码器框架。尽管这些模型在标准数据集上表现良好，但在翻译训练中罕见或未出现的长输入时仍存在困难。融入目标语言句法是解决此类长度相关问题的一种途径。我们提出了一种新颖的句法解码器，该解码器以自上而下、从左到右的顺序生成目标语言的依存树。实验表明，在翻译训练数据中未出现的长输入时，所提出的自上而下的字符串到树解码方法比传统的序列到序列解码具有更好的泛化能力。

摘要 (Abstract)

Most of modern neural machine translation (NMT) models are based on an encoder-decoder framework with an attention mechanism. While they perform well on standard datasets, they can have trouble in translation of long inputs that are rare or unseen during training. Incorporating target syntax is one approach to dealing with such length-related problems. We propose a novel syntactic decoder that generates a target-language dependency tree in a top-down, left-to-right order. Experiments show that the proposed top-down string-to-tree decoding generalizes better than conventional sequence-to-sequence decoding in translating long inputs that are not observed in the training data.

关键词: neural machine translation, syntactic decoder, dependency tree, top-down decoding, long input translation, generalization, encoder-decoder framework, attention mechanism

145. ❌ HumMusQA: A Human-written Music Understanding QA Benchmark Dataset

作者: Benno Weck, Pablo Puentes, Andrea Poltronieri, Satyajeet Prabhu, Dmitry Bogdanov 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27877v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于音乐理解评估，属于AI在特定领域（音乐）的应用研究。与’Large Language Models’相关度8分，因为论文评估的是Large Audio-Language Models（LALMs），这是大语言模型在音频领域的扩展应用。与’AI for Science’相关度8分，因为音乐理解属于AI在艺术/科学交叉领域的应用。与’Scaling Laws AND Data Quality’相关度5分，因为论文强调高质量人工标注数据集的重要性，这与数据质量相关但非核心。其他关键词（如MoE、SFT、RAG等）与论文的音乐评估主题无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大型音频-语言模型在音乐理解评估中缺乏可靠基准的问题，提出了一个由专家手工标注的320个问题的数据集HumMusQA，并用于评估六个最先进的LALMs模型及其对单模态捷径的鲁棒性。

摘要翻译

大型音频-语言模型（LALMs）的音乐理解能力评估需要一个严格定义的基准，以真正测试模型是否能够感知和解读音乐，而当前的数据方法往往无法满足这一标准。本文提出了一种精心构建的音乐评估方法，引入了一个包含320道人工编写问题的新数据集，这些问题由受过音乐训练的专业人士策划和验证，并论证了这种聚焦式人工策划在探究复杂音频理解方面更具优势。为展示该数据集的应用，我们对六种前沿的LALMs进行了基准测试，并额外检验了它们对单模态捷径的鲁棒性。

摘要 (Abstract)

The evaluation of music understanding in Large Audio-Language Models (LALMs) requires a rigorously defined benchmark that truly tests whether models can perceive and interpret music, a standard that current data methodologies frequently fail to meet. This paper introduces a meticulously structured approach to music evaluation, proposing a new dataset of 320 hand-written questions curated and validated by experts with musical training, arguing that such focused, manual curation is superior for probing complex audio comprehension. To demonstrate the use of the dataset, we benchmark six state-of-the-art LALMs and additionally test their robustness to uni-modal shortcuts.

关键词: music understanding, Large Audio-Language Models, benchmark dataset, human-written questions, audio comprehension, model evaluation, expert validation, uni-modal shortcuts

146. ❌ KazByte: Adapting Qwen models to Kazakh via Byte-level Adapter

作者: Rauan Akylzhanov 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27859v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（Qwen2.5-7B）针对哈萨克语的适配问题，提出了一种两阶段微调方法（先训练字节级适配器，再微调注意力层），属于大模型技术原理的创新应用。高度相关的关键词包括：大语言模型（核心研究对象）、监督微调（核心方法）、参数高效微调（适配器方法本质）。有一定关联的关键词：领域适应（将模型适配到哈萨克语）、上下文窗口扩展（解决因分词导致的上下文缩短问题）。其他关键词如MoE、SLMs、对齐、RAG、推理加速等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在哈萨克语上因分词器导致的效率低下问题，提出了一种通过字节级适配器和注意力层微调的两阶段方法，旨在提升模型在哈萨克语任务上的性能。

摘要翻译

大型语言模型将哈萨克语文本切分为比同等英语文本更多的词元，因为其分词器是为高资源语言构建的。这种分词器税增加了计算成本，缩短了有效上下文窗口，并削弱了模型对哈萨克语形态学的把握。我们提出完全绕过分词器，通过一个小型适配器输入原始字节，该适配器学习与冻结的Qwen2.5-7B的内部语言进行交互。适配器训练完成后，我们将其冻结，并仅针对哈萨克语文本微调Qwen的注意力层。我们的核心假设是：这种两阶段过程——先教授接口，再调整模型——应在标准哈萨克语基准测试上达到或超越原始Qwen2.5-7B的准确率。本报告描述了ByteKaz架构与训练方案。实证验证正在进行中；此版本为记录目的阐明设计与假设。

摘要 (Abstract)

Large language models fragment Kazakh text into many more tokens than equivalent English text, because their tokenizers were built for high-resource languages. This tokenizer tax inflates compute, shortens the effective context window, and weakens the model’s grip on Kazakh morphology. We propose to bypass the tokenizer entirely by feeding raw bytes through a small adapter that learns to speak the internal language of a frozen Qwen2.5-7B. Once the adapter is trained, we freeze it and fine-tune only the attention layers of Qwen on Kazakh text. Our central hypothesis is that this two-stage process – first teach the interface, then adapt the model – should match or exceed the accuracy of the original Qwen2.5-7B on standard Kazakh benchmarks. This report describes the ByteKaz architecture and training protocol. Empirical validation is ongoing; this version stakes the design and hypotheses for the record.

关键词: Large language models, Qwen models, Kazakh language adaptation, Byte-level adapter, Parameter-efficient fine-tuning, Tokenizer tax, Attention layer fine-tuning, Domain adaptation

147. ❌ What can LLMs tell us about the mechanisms behind polarity illusions in humans? Experiments across model scales and training steps

作者: Dario Paape 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27855v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	8.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文使用Pythia模型套件研究LLMs中的极性错觉现象，核心关注LLMs本身的行为分析，因此与’Large Language Models’高度相关（10分）。研究涉及模型规模变化对错觉的影响，与’Scaling Laws’有一定关联（8分）。论文通过分析LLMs的内部处理机制来解释人类语言处理，与’Mechanistic Interpretability’相关（8分）。其他关键词如MoE、SLMs、训练方法、推理技术、应用领域等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该研究使用Pythia模型套件探究LLMs中两种极性错觉（NPI错觉和深度电荷错觉）的出现机制，发现NPI错觉随模型规模增大而减弱直至消失，深度电荷错觉则随规模增大而增强，这对解释人类句子处理机制具有启示意义。

摘要翻译

本研究借助Pythia规模扩展模型组（Biderman et al. 2023），探究两种著名的极性幻觉——NPI幻觉（negative polarity illusion）与深度电荷幻觉（depth charge illusion）是否及如何在大型语言模型（LLMs）中显现。实验表明：随着模型规模增大，NPI幻觉逐渐减弱直至消失；而深度电荷幻觉在更大模型中反而增强。该结果对人类句子加工研究具有启示意义：鉴于LLMs（尤其在其隐式的下一词预测层面）难以进行合理推理，解释极性幻觉现象或许无需假设存在将不合规句子转化为合规句的“理性推断”机制。另一方面，浅层的“足够好”加工模式和/或不符合规范语法的结构的部分语法化现象，均可能在LLMs中发生。基于构式语法（construction grammar）的基本原理，本文提出了一种融合不同理论解释的综合分析框架。

摘要 (Abstract)

I use the Pythia scaling suite (Biderman et al. 2023) to investigate if and how two well-known polarity illusions, the NPI illusion and the depth charge illusion, arise in LLMs. The NPI illusion becomes weaker and ultimately disappears as model size increases, while the depth charge illusion becomes stronger in larger models. The results have implications for human sentence processing: it may not be necessary to assume “rational inference” mechanisms that convert ill-formed sentences into well-formed ones to explain polarity illusions, given that LLMs cannot plausibly engage in this kind of reasoning, especially at the implicit level of next-token prediction. On the other hand, shallow, “good enough” processing and/or partial grammaticalization of prescriptively ungrammatical structures may both occur in LLMs. I propose a synthesis of different theoretical accounts that is rooted in the basic tenets of construction grammar.

关键词: LLMs, polarity illusions, NPI illusion, depth charge illusion, model scaling, Pythia, sentence processing, construction grammar

148. ❌ Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3

作者: Natapong Nitarach 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27844v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究LLM在数学推理任务中的推理策略优化，核心涉及LLM推理（Chain of Thought/System 2 Thinking），但未涉及其他关键词如MoE、训练方法、压缩、科学应用等。

!!! tip deepseek-chat TL;DR

论文研究了通过多样化推理策略来减少LLM在数学推理中的相关错误，但发现模型能力比推理时优化更重要。

摘要翻译

多数投票机制通过整合多个大语言模型的求解尝试来提升数学推理能力，但误差相关性限制了有效样本量。一种自然的解决方案是：为不同投票者分配结构相异的推理策略以降低误差相关性。我们在AIMO~3竞赛中测试了这种“多样化提示混合器”方法：使用3个模型、进行23项以上实验，在单张H100 80GB显卡的5小时限制内求解50道国际数学奥林匹克竞赛级别题目。所有干预措施均告失败。高温采样已足以充分降低误差相关性；较弱的提示策略在降低相关性的同时，对单次尝试准确率的损害更为显著。在跨越17分模型能力差距的测试中，无论采用何种推理时优化策略，模型基础能力始终以数量级优势主导结果。

摘要 (Abstract)

Majority voting over multiple LLM attempts improves mathematical reasoning, but correlated errors limit the effective sample size. A natural fix: assign structurally different reasoning strategies to different voters to decorrelate errors. We test this Diverse Prompt Mixer in the AIMO~3 competition: 3 models, 23+ experiments, and 50 IMO-level problems on a single H100 80 GB with a 5-hour limit. Every intervention fails. High-temperature sampling already decorrelates errors sufficiently; weaker prompt strategies reduce per-attempt accuracy more than they reduce correlation. Across a 17-point model capability gap and every inference-time optimization we tried, model capability dominates by an order of magnitude.

关键词: LLM, mathematical reasoning, inference-time optimization, diverse prompting, error correlation, model capability, AIMO competition, majority voting

149. ❌ ProText: A benchmark dataset for measuring (mis)gendering in long-form texts

作者: Hadas Kotek, Margit Bowler, Patrick Sonnenberg, Yu’an Yang 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27838v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究大语言模型在文本转换任务（如摘要和重写）中的性别偏见和误性别化问题，使用了最先进的大语言模型进行实验。因此，仅与’Large Language Models OR LLMs OR Foundation Models’关键词高度相关（8分），因为论文明确使用LLMs进行实验并分析其表现。其他关键词涉及模型架构、训练方法、推理优化、应用领域等，论文未涉及这些具体技术或应用，故均为0分。

!!! tip deepseek-chat TL;DR

该研究提出了ProText数据集，用于测量大语言模型在长文本转换任务中的性别偏见和误性别化问题，并通过案例研究揭示了模型存在系统性性别偏见，特别是在输入缺乏明确性别线索或模型默认异性恋假设时。

摘要翻译

我们推出ProText数据集，用于测量风格多样的长篇英语文本中的性别化指代与错误性别指代现象。该数据集涵盖三个维度：主题名词（姓名、职业、头衔、亲属称谓）、主题类别（刻板印象男性化、刻板印象女性化、性别中立/无性别）以及代词类别（男性化、女性化、性别中立、无代词）。该数据集旨在探究大型语言模型在文本转换任务（如摘要生成与文本重写）中的（错误）性别指代行为，其研究范围超越了传统的代词消解基准测试与非二元性别框架。我们通过微型案例研究验证了ProText的有效性，结果表明仅使用两种提示词和两种模型，便能获得关于性别偏见、刻板印象、错误性别指代及性别化指代的细致洞察。研究揭示了系统性的性别偏见，尤其当输入文本缺乏明确性别线索或模型默认遵循异性恋规范假设时更为显著。

摘要 (Abstract)

We introduce ProText, a dataset for measuring gendering and misgendering in stylistically diverse long-form English texts. ProText spans three dimensions: Theme nouns (names, occupations, titles, kinship terms), Theme category (stereotypically male, stereotypically female, gender-neutral/non-gendered), and Pronoun category (masculine, feminine, gender-neutral, none). The dataset is designed to probe (mis)gendering in text transformations such as summarization and rewrites using state-of-the-art Large Language Models, extending beyond traditional pronoun resolution benchmarks and beyond the gender binary. We validated ProText through a mini case study, showing that even with just two prompts and two models, we can draw nuanced insights regarding gender bias, stereotyping, misgendering, and gendering. We reveal systematic gender bias, particularly when inputs contain no explicit gender cues or when models default to heteronormative assumptions.

关键词: gendering, misgendering, large language models, gender bias, text transformations, dataset, stereotyping, pronoun resolution

150. ❌ Q-Bridge: Code Translation for Quantum Machine Learning via LLMs

作者: Runjia Zeng, Priyabrata Senapati, Ruixiang Tang, Dongfang Liu, Qiang Guan 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27836v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是使用LLMs进行量子机器学习代码翻译，因此与’Large Language Models’高度相关（10分）。方法涉及监督微调（SFT）和LoRA参数高效微调，这两项得10分。研究属于AI for Science应用，得10分。论文创建了数据集CML-2-QML，与数据质量有一定关联，得5分。微调过程可视为领域适应的一种形式，得5分。其他关键词如MoE、量化、RAG等未在摘要中提及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了Q-Bridge框架，利用大语言模型将经典机器学习代码翻译为可执行的量子机器学习代码，并通过LoRA微调创建了首个可复现的量子代码翻译数据集和系统。

摘要翻译

大型语言模型近期展现出弥合经典机器学习与量子机器学习之间鸿沟的潜力。然而，缺乏标准化、高质量的数据集及稳健的翻译框架限制了该领域的进展。我们提出Q-Bridge——一个由LLM引导的代码翻译框架，能够系统地将经典机器学习实现转换为可执行的量子机器学习变体。该方法基于一个自迭代流程，通过迭代扩展已验证的种子代码库，构建出大规模数据集CML-2-QML，其中整合了可验证与不可验证的代码对。Q-Bridge模型通过监督式LoRA适配进行微调，实现了可扩展且内存高效的训练，能够在不同架构上生成忠实且可解释的量子代码。实证分析证实了从经典机器学习到量子机器学习的直接翻译可行性，并揭示了经典范式与量子范式间持续存在的结构对齐特性。案例研究进一步表明，Q-Bridge既能保持确定性正确性，也能支持创造性的架构探索。本研究首次建立了LLM驱动的量子代码翻译的可复现框架与数据集，为可扩展的量子人工智能发展奠定了基础。

摘要 (Abstract)

Large language models have recently shown potential in bridging the gap between classical machine learning and quantum machine learning. However, the lack of standardized, high-quality datasets and robust translation frameworks limits progress in this domain. We introduce Q-Bridge, an LLM-guided code translation framework that systematically converts CML implementations into executable QML variants. Our approach builds on a self-involving pipeline that iteratively expands a verified seed codebase into a large-scale dataset, CML-2-QML, integrating verifiable and unverifiable code pairs. The Q-Bridge model is fine-tuned using supervised LoRA adaptation for scalable and memory-efficient training, achieving faithful and interpretable quantum code generation across diverse architectures. Empirical analysis confirms the feasibility of direct CML-to-QML translation and reveals consistent structural alignment between classical and quantum paradigms. Case studies further demonstrate that Q-Bridge can maintain deterministic correctness and also enable creative architectural exploration. This work establishes the first reproducible framework and dataset for LLM-driven quantum code translation, offering a foundation for scalable quantum AI development.

关键词: Large Language Models, Quantum Machine Learning, Code Translation, LoRA, Supervised Fine-tuning, CML-2-QML Dataset, AI for Science, Parameter-efficient Fine-tuning

151. ❌ Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning

作者: Zhiwen You, Xi Chen, Aniket Vashishtha, Simo Du, Gabriel Erion-Barner, Hongyuan Mei, Hao Peng, Yue Guo 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27820v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出了一种基于大语言模型（LLMs）的临床诊断框架，核心是使用反事实推理和多智能体系统来提升诊断准确性和可解释性。因此，与LLMs、多步推理（Chain of Thought）、深度推理（System 2 Thinking）、LLM智能体、多智能体系统、可解释AI（Explainable AI）以及科学AI（AI for Science）高度相关（10分）。同时，框架涉及自我修正（Self-Correction）和幻觉缓解（Hallucination Mitigation）以提升可靠性，给予8分。其他关键词如MoE、量化、训练方法等未在论文中涉及，给予0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM临床诊断系统缺乏可解释性推理的问题，提出了一种基于反事实案例编辑和多智能体讨论的框架，显著提升了复杂病例的诊断准确性和推理的临床实用性。

摘要翻译

临床诊断是一个复杂的推理过程，临床医生在此过程中收集证据、形成假设，并对照其他可能的解释进行检验。在医学训练中，这种推理能力通过反事实提问得以显式培养——例如，询问若关键症状缺失或改变，诊断将如何变化——以强化鉴别诊断技能。随着基于大语言模型的系统日益用于诊断支持，确保其建议的可解释性变得至关重要。然而，现有大多数基于大语言模型的诊断智能体仅在固定的临床证据上进行推理，未能显式检验个体发现如何支持或削弱竞争性诊断。在本研究中，我们受临床医生培训启发，提出一种反事实多智能体诊断框架，使假设检验过程显式化并基于证据。该框架引入反事实病例编辑，通过修改临床发现并评估这些变化如何影响竞争性诊断。我们进一步定义了反事实概率差距，该方法通过测量在这些编辑下置信度的变化，量化个体发现对特定诊断的支持强度。这些反事实信号引导多轮专科医生讨论，使智能体能够质疑无依据的假设、优化鉴别诊断，并产生更具可解释性的推理路径。在三个诊断基准测试和七种大语言模型上的实验表明，相较于直接提示和先前的多智能体基线方法，我们的方法持续提升了诊断准确性，在复杂和模糊病例中改善最为显著。人工评估进一步证实，该框架能产生更具临床实用性、更可靠且逻辑连贯的推理。这些结果表明，引入反事实证据验证是构建可靠临床决策支持人工智能系统的重要一步。

摘要 (Abstract)

Clinical diagnosis is a complex reasoning process in which clinicians gather evidence, form hypotheses, and test them against alternative explanations. In medical training, this reasoning is explicitly developed through counterfactual questioning–e.g., asking how a diagnosis would change if a key symptom were absent or altered–to strengthen differential diagnosis skills. As large language model (LLM)-based systems are increasingly used for diagnostic support, ensuring the interpretability of their recommendations becomes critical. However, most existing LLM-based diagnostic agents reason over fixed clinical evidence without explicitly testing how individual findings support or weaken competing diagnoses. In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded. Our framework introduces counterfactual case editing to modify clinical findings and evaluate how these changes affect competing diagnoses. We further define the Counterfactual Probability Gap, a method that quantifies how strongly individual findings support a diagnosis by measuring confidence shifts under these edits. These counterfactual signals guide multi-round specialist discussions, enabling agents to challenge unsupported hypotheses, refine differential diagnoses, and produce more interpretable reasoning trajectories. Across three diagnostic benchmarks and seven LLMs, our method consistently improves diagnostic accuracy over prompting and prior multi-agent baselines, with the largest gains observed in complex and ambiguous cases. Human evaluation further indicates that our framework produces more clinically useful, reliable, and coherent reasoning. These results suggest that incorporating counterfactual evidence verification is an important step toward building reliable AI systems for clinical decision support.

关键词: clinical diagnosis, large language models, counterfactual reasoning, multi-agent systems, interpretability, differential diagnosis, evidence verification, AI for healthcare

152. ❌ Conversational Agents and the Understanding of Human Language: Reflections on AI, LLMs, and Cognitive Science

作者: Andrei Popescu-Belis 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27809v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要讨论NLP（特别是大语言模型时代）与人类语言能力理解之间的关系，属于大模型在认知科学领域的应用讨论。摘要明确提到"large language models"和"current chatbots using artificial neural networks"，因此与"Large Language Models OR LLMs OR Foundation Models"高度相关（8分）。论文是哲学/认知科学视角的讨论，不涉及具体技术原理、训练方法、优化技术或特定领域应用，因此其他所有技术性关键词（如MoE、SFT、RAG、量化等）和科学应用关键词（如AI for Science）均完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文探讨了从早期自然语言处理到大语言模型时代的技术演变与人类语言认知理论之间的关系，并得出结论：尽管当前基于神经网络的聊天机器人展现出令人印象深刻的语言能力，但语言技术的发展并未显著加深我们对人类心智如何处理自然语言的理解。

摘要翻译

本文探讨了计算机自然语言处理（Natural Language Processing，NLP）与语言学及认知科学所研究的人类语言能力理解之间的关系。我们梳理了自然语言处理从其起源到大规模语言模型时代的发展历程，并针对其主要范式，分别指出了其与人类语言能力理论之间的若干相似性和差异性。我们的结论是，尽管当前基于人工神经网络的聊天机器人已展现出令人瞩目的语言能力，但语言技术的演进并未显著深化我们对于人类心智如何处理自然语言的理解。

摘要 (Abstract)

In this paper, we discuss the relationship between natural language processing by computers (NLP) and the understanding of the human language capacity, as studied by linguistics and cognitive science. We outline the evolution of NLP from its beginnings until the age of large language models, and highlight for each of its main paradigms some similarities and differences with theories of the human language capacity. We conclude that the evolution of language technology has not substantially deepened our understanding of how human minds process natural language, despite the impressive language abilities attained by current chatbots using artificial neural networks.

关键词: natural language processing, large language models, cognitive science, human language capacity, artificial neural networks, chatbots, linguistics, language technology

153. ❌ KVSculpt: KV Cache Compression as Distillation

作者: Bo Jiang, Sian Jin 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27819v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	15.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究KV缓存压缩技术，与’KV Cache Compression OR Linear Attention OR FlashAttention’高度相关（15分），直接解决LLM长上下文推理效率问题，因此与’Large Language Models OR LLMs OR Foundation Models’（10分）和’Speculative Decoding OR Inference Acceleration’（10分）强相关。论文提到量化作为对比方法，与’Quantization OR Model Compression OR Low-bit Weights’有一定关联（8分）。研究针对长上下文LLM，与’Context Window Extension OR Long Context LLMs’相关（8分）。其他关键词如MoE、SLMs、训练方法、对齐、推理方法、科学AI应用等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出KVSculpt方法，通过优化连续嵌入空间中的无约束KV对来压缩KV缓存，显著降低长上下文LLM推理时的KL散度，并引入自适应预算分配进一步改善压缩效果。

摘要翻译

KV缓存压缩对于长上下文大语言模型的高效推理至关重要。降低每对键值存储占用的方法——量化和低秩分解——与减少缓存序列长度的方法是正交的。在序列长度维度上，现有方法涵盖了从纯粹驱逐（选择保留哪些键值对）到合并（将相似的键值对组合成更少的对）的范围。这两种方法都仍然锚定于原始的缓存条目。我们提出了KVSculpt，它走向了这一谱系的另一端：我们不再选择或组合原始键值对，而是在连续的嵌入空间中优化一组更小的、不受约束的键值对，以保留每一层的注意力行为。键通过L-BFGS算法进行优化，值则通过最小二乘法以闭式解形式求解，两者每隔几步交替进行。在此基础上，我们引入了自适应预算分配，该方法利用一次廉价的前导压缩运行，根据每个组件的压缩难度，在各层和各KV头之间重新分配压缩预算。在Qwen2.5-1.5B-Instruct模型上，针对2048个令牌的上下文，在压缩比r为{0.3, 0.5, 0.7}的情况下，与Select+Fit方法（基于注意力分数驱逐并配合最小二乘法拟合值）相比，KVSculpt将KL散度降低了3.5至4.1倍。自适应分配在不增加额外推理成本的情况下，进一步提供了1.3倍的KL散度降低。分析表明，压缩难度具有高度的非均匀性：各层的前导均方误差（MSE）差异最高可达100倍，而同一层内的两个KV头之间的差异最高可达467倍——这证明了细粒度的预算分配至关重要。

摘要 (Abstract)

KV cache compression is critical for efficient long-context LLM inference. Approaches that reduce the per-pair footprint – quantization and low-rank decomposition – are orthogonal to those that reduce the sequence length of the cache. Along the sequence-length dimension, existing methods range from pure eviction – selecting which KV pairs to keep – to merging, which combines similar pairs into fewer ones. Both remain anchored to the original cache entries. We propose KVSculpt, which moves to the other end of this spectrum: instead of selecting or combining original pairs, we optimize a smaller set of unconstrained KV pairs in continuous embedding space to preserve each layer’s attention behavior. Keys are optimized via L-BFGS and values are solved in closed form via least squares, alternating every few steps. On top of this, we introduce adaptive budget allocation, which uses a cheap pilot compression run to redistribute the compression budget across layers and KV heads based on per-component difficulty. On Qwen2.5-1.5B-Instruct with 2048-token contexts, KVSculpt reduces KL divergence by 3.5-4.1x compared to Select+Fit – attention-score eviction with least-squares value fitting – across compression ratios r in {0.3, 0.5, 0.7}. Adaptive allocation provides an additional 1.3x KL reduction at no extra inference cost. Analysis reveals that compression difficulty is highly non-uniform: per-layer pilot MSE varies by up to 100x across layers, and the two KV heads within a single layer can differ by up to 467x – demonstrating that fine-grained budget allocation is essential.

关键词: KV cache compression, long-context LLM inference, attention behavior preservation, adaptive budget allocation, KL divergence reduction, efficient inference, Qwen2.5-1.5B-Instruct, compression ratio

154. ❌ Understanding Teacher Revisions of Large Language Model-Generated Feedback

作者: Conrad Borchers, Luiz Rodrigues, Newarney Torrezão da Costa, Cleon Xavier, Rafael Ferreira Mello 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27806v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLMs在教育领域的应用（AI-generated feedback），与’Large Language Models’高度相关（10分），属于大模型在不同领域的研究应用；与’AI for Science’有一定关联（5分），因为教育可视为广义科学应用领域；其他关键词均未涉及技术原理创新或具体应用，故评0分。

!!! tip deepseek-chat TL;DR

该研究分析了教师如何修订LLM生成的反馈，发现教师通常接受80%的AI反馈，编辑行为差异大，且编辑后反馈往往更简洁，这为设计更符合教师需求的反馈系统提供了依据。

摘要翻译

大型语言模型（LLMs）正越来越多地为学生生成形成性反馈，但教师在此类反馈到达学习者之前如何进行修订，目前尚不明确。教师的修订行为决定了学生最终接收到的反馈内容，这使得修订实践成为评估人工智能课堂工具的核心环节。本研究分析了来自117名教师的1,349条人工智能生成反馈实例及对应的教师编辑后的解释文本。我们重点考察：（i）与教师修订行为相关的文本特征；（ii）仅基于人工智能反馈文本是否能够预测修订决策；（iii）修订如何改变反馈的教学类型。首先，我们发现教师在大约80%的情况下直接接受未经修改的人工智能反馈，而被编辑的反馈往往显著更长且随后被教师缩短。不同教师的编辑行为差异显著：约50%的教师从不编辑人工智能反馈，仅约10%的教师会编辑超过三分之二的反馈实例。其次，仅以人工智能反馈文本作为输入特征（通过句子嵌入表示）训练的机器学习模型，在识别哪些反馈将被修订方面表现出中等性能（AUC=0.75）。第三，定性编码分析表明，当发生修订时，教师常会简化人工智能生成的反馈，使其从高信息量的解释性内容转向更简洁的纠正性形式。综上，这些研究结果揭示了教师在实践中如何处理人工智能生成的反馈，并指出未来可设计更符合教师需求、减少不必要编辑负担的反馈系统。

摘要 (Abstract)

Large language models (LLMs) increasingly generate formative feedback for students, yet little is known about how teachers revise this feedback before it reaches learners. Teachers’ revisions shape what students receive, making revision practices central to evaluating AI classroom tools. We analyze a dataset of 1,349 instances of AI-generated feedback and corresponding teacher-edited explanations from 117 teachers. We examine (i) textual characteristics associated with teacher revisions, (ii) whether revision decisions can be predicted from the AI feedback text, and (iii) how revisions change the pedagogical type of feedback delivered. First, we find that teachers accept AI feedback without modification in about 80% of cases, while edited feedback tends to be significantly longer and subsequently shortened by teachers. Editing behavior varies substantially across teachers: about 50% never edit AI feedback, and only about 10% edit more than two-thirds of feedback instances. Second, machine learning models trained only on the AI feedback text as input features, using sentence embeddings, achieve fair performance in identifying which feedback will be revised (AUC=0.75). Third, qualitative coding shows that when revisions occur, teachers often simplify AI-generated feedback, shifting it away from high-information explanations toward more concise, corrective forms. Together, these findings characterize how teachers engage with AI-generated feedback in practice and highlight opportunities to design feedback systems that better align with teacher priorities while reducing unnecessary editing effort.

关键词: Large language models, AI-generated feedback, Teacher revisions, Formative feedback, Educational technology, Machine learning prediction, Pedagogical feedback types, Feedback system design

作者: Yue Huang, Yu Jiang, Wenjie Wang, Haomin Zhuang, Xiaonan Luo, Yuchen Ma, Zhangchen Xu, Zichen Chen, Nuno Moniz, Zinan Lin, Pin-Yu Chen, Nitesh V Chawla, Nouha Dziri, Huan Sun, Xiangliang Zhang 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27771v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多智能体系统（由大语言模型组成）中的涌现社会智能风险，与’Large Language Models’、‘LLM Agents’和’Multi-agent Systems’高度相关（10分），因为这些是论文的核心研究对象。其他关键词如MoE、SLMs、训练技术、推理方法、压缩技术、科学应用等均未在摘要中提及，完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了由大语言模型组成的多智能体系统中涌现的社会智能风险，发现这些系统在资源竞争、协作等场景下会自发产生类似人类社会的共谋、从众等失败模式，且现有单智能体防护措施无法预防这些风险。

摘要翻译

由大型生成模型构成的多智能体系统正迅速从实验室原型走向现实世界部署，它们通过联合规划、协商与共享资源分配来解决复杂任务。尽管这类系统展现出前所未有的可扩展性与自主性，但其集体互动也催生了无法归因于单个智能体的新型故障模式。理解这些涌现性风险至关重要。本文针对涉及共享资源竞争（如计算资源或市场份额）、顺序交接协作（下游智能体仅能看到前序输出）、集体决策聚合等场景中的涌现性多智能体风险开展了开创性研究。在这些场景中，我们观察到此类群体行为在重复试验和多种交互条件下频繁出现，而非罕见或病态案例。特别是在现实的资源约束、通信协议和角色分配条件下，类合谋协调与从众等现象以不可忽视的频率涌现，尽管未受明确指令，却复现了人类社会中的典型病理模式。此外，现有智能体层级的安全措施无法单独防范这些风险。这些发现揭示了智能多智能体系统的阴暗面：一种社会性智能风险——即使未受指令，智能体集体仍会自发复现人类社会中熟悉的故障模式。

摘要 (Abstract)

Multi-agent systems composed of large generative models are rapidly moving from laboratory prototypes to real-world deployments, where they jointly plan, negotiate, and allocate shared resources to solve complex tasks. While such systems promise unprecedented scalability and autonomy, their collective interaction also gives rise to failure modes that cannot be reduced to individual agents. Understanding these emergent risks is therefore critical. Here, we present a pioneer study of such emergent multi-agent risk in workflows that involve competition over shared resources (e.g., computing resources or market share), sequential handoff collaboration (where downstream agents see only predecessor outputs), collective decision aggregation, and others. Across these settings, we observe that such group behaviors arise frequently across repeated trials and a wide range of interaction conditions, rather than as rare or pathological cases. In particular, phenomena such as collusion-like coordination and conformity emerge with non-trivial frequency under realistic resource constraints, communication protocols, and role assignments, mirroring well-known pathologies in human societies despite no explicit instruction. Moreover, these risks cannot be prevented by existing agent-level safeguards alone. These findings expose the dark side of intelligent multi-agent systems: a social intelligence risk where agent collectives, despite no instruction to do so, spontaneously reproduce familiar failure patterns from human societies.

关键词: multi-agent systems, large generative models, emergent risks, social intelligence risk, collusion, conformity, agent coordination, autonomous agents

156. ❌ TailNLG: A Multilingual Benchmark Addressing Verbalization of Long-Tail Entities

作者: Lia Draetta, Michael Oliverio, Virginia Ramón-Ferrer, Pier Felice Balestrucci, Flaviana Corallo, Carlos Badenes-Olmedo, Alessandro Mazzei, Marco Antonio Stranisci, Rossana Damiano 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27768v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确评估了三种不同系列的大语言模型（LLMs）在零样本设置下的表现，因此与’Large Language Models’高度相关（10分）。论文提到结构化知识的自动语言化是支持检索增强生成系统的关键任务，摘要中明确提及’supporting retrieval-augmented generation systems’，因此与’Retrieval-Augmented Generation’高度相关（10分）。论文未涉及其他关键词的具体技术、方法或应用，因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了数据到文本生成中长尾实体的语言化偏差问题，通过构建多语言基准TailNLG评估大语言模型，发现模型对罕见实体存在一致的偏见，表现为嵌入分数较低和模型不确定性较高。

摘要翻译

结构化知识的自动言语化是使知识图谱能够为非专业用户所理解并支持检索增强生成系统的关键任务。尽管数据到文本生成领域的最新进展提升了多语言覆盖能力，但针对稀有实体（常被称为长尾实体）言语化过程中潜在偏见的研究仍显不足。本研究首次系统性地探讨了数据到文本生成中的长尾实体问题。我们提出了TailNLG——一个基于维基数据构建、涵盖英语、意大利语和西班牙语的多语言新基准，该基准覆盖了不同流行度等级的实体。我们在零样本设置下评估了三类不同架构的大语言模型，比较了它们在稀有实体与常见实体上的表现，并与成熟的WebNLG基准进行了对比。研究结果揭示了对长尾实体的一致偏见：稀有实体的嵌入评分普遍较低，且模型不确定性更高。我们进一步证明长尾实体的影响在不同模型和语言间存在差异，而现有评估指标未能稳定捕捉这些差异，这凸显了建立更可靠评估框架的必要性。

摘要 (Abstract)

The automatic verbalization of structured knowledge is a key task for making knowledge graphs accessible to non-expert users and supporting retrieval-augmented generation systems. Although recent advances in Data-to-Text generation have improved multilingual coverage, little attention has been paid to potential biases in the verbalization of rare entities, frequently known as long-tail entities. In this work, we present the first systematic study of long-tail entities in Data-to-Text generation. We introduce TailNLG, a new multilingual benchmark in English, Italian, and Spanish, built from Wikidata and covering entities with varying levels of popularity. We evaluate three different families of large language models in zero-shot settings and compare their performance on rare versus common entities, as well as against the established WebNLG benchmark. Our results reveal a consistent bias against long-tail entities: embedding-based scores are lower, and model uncertainty is higher for rare entities. We further show that the impact of long-tail entities varies across models and languages, and that existing evaluation metrics do not consistently capture these differences, highlighting the need for more reliable evaluation frameworks.

关键词: Data-to-Text generation, long-tail entities, multilingual benchmark, large language models, retrieval-augmented generation, bias evaluation, zero-shot settings, Wikidata

157. ❌ Retromorphic Testing with Hierarchical Verification for Hallucination Detection in RAG

作者: Boxi Yu, Yuzhong Zhang, Liting Lin, Lionel Briand, Emir Muñoz 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27752v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RAG系统中的幻觉检测问题，与’Large Language Models’、‘Retrieval-Augmented Generation’和’Hallucination Mitigation’高度相关（10分），因为论文明确研究LLMs在RAG中的幻觉问题并提出了检测框架。与’Mechanistic Interpretability’有一定关联（5分），因为RT4CHART框架提供了可解释的审计功能。其他关键词如MoE、SLMs、Scaling Laws等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对检索增强生成（RAG）中大型语言模型产生幻觉的问题，提出了RT4CHART框架，通过分层验证实现细粒度幻觉检测，在两个基准测试中显著优于现有方法。

摘要翻译

大型语言模型（LLM）在检索增强生成（RAG）中仍持续产生幻觉，生成的主张缺乏检索上下文支持或与之矛盾。当仅依据检索上下文评估忠实性时，检测此类错误依然具有挑战性。现有方法要么提供粗粒度的答案级评分，要么侧重于开放领域的事实性，通常缺乏细粒度、基于证据的诊断能力。
我们提出RT4CHART，一种用于上下文忠实性评估的逆向形态测试框架。RT4CHART将模型输出分解为可独立验证的主张，并依据检索上下文进行从局部到全局的层次化验证。每个主张被赋予三种标签之一：可推导、矛盾或无依据。此外，RT4CHART将主张级判定映射回具体的答案片段，并从上下文中检索明确的支持或反驳证据，从而实现细粒度且可解释的审计。
我们在新近重新标注的基准数据集RAGTruth++（408个样本）和RAGTruth-Enhance（2,675个样本）上评估RT4CHART。RT4CHART在所有基线方法中取得了最佳的答案级幻觉检测F1分数。在RAGTruth++上，其F1分数达到0.776，比最强基线高出83%；在RAGTruth-Enhance上，其片段级F1分数达到47.5%。
消融研究表明，层次化验证设计是性能提升的主要驱动力。最后，我们的重新标注揭示了比原始标签多1.68倍的幻觉案例，表明现有基准显著低估了幻觉现象的普遍性。

摘要 (Abstract)

Large language models (LLMs) continue to hallucinate in retrieval-augmented generation (RAG), producing claims that are unsupported by or conflict with the retrieved context. Detecting such errors remains challenging when faithfulness is evaluated solely with respect to the retrieved context. Existing approaches either provide coarse-grained, answer-level scores or focus on open-domain factuality, often lacking fine-grained, evidence-grounded diagnostics. We present RT4CHART, a retromorphic testing framework for context-faithfulness assessment. RT4CHART decomposes model outputs into independently verifiable claims and performs hierarchical, local-to-global verification against the retrieved context. Each claim is assigned one of three labels: entailed, contradicted, or baseless. Furthermore, RT4CHART maps claim-level decisions back to specific answer spans and retrieves explicit supporting or refuting evidence from the context, enabling fine-grained and interpretable auditing. We evaluate RT4CHART on RAGTruth++ (408 samples) and RAGTruth-Enhance (2,675 samples), a newly re-annotated benchmark. RT4CHART achieves the best answer-level hallucination detection F1 among all baselines. On RAGTruth++, it reaches an F1 score of 0.776, outperforming the strongest baseline by 83%. On RAGTruth-Enhance, it achieves a span-level F1 of 47.5%. Ablation studies show that the hierarchical verification design is the primary driver of performance gains. Finally, our re-annotation reveals 1.68x more hallucination cases than the original labels, suggesting that existing benchmarks substantially underestimate the prevalence of hallucinations.

关键词: Hallucination Detection, Retrieval-Augmented Generation, Large Language Models, Faithfulness Assessment, Hierarchical Verification, Fine-grained Auditing, RAGTruth Benchmark, Context-faithfulness

158. ❌ KAT-Coder-V2 Technical Report

作者: Fengxiang Li, Han Zhang, Haoyang Huang, Jinghui Wang, Jinhua Hao, Kun Yuan, Mengtong Li, Minglei Zhang, Pengcheng Xu, Wenhao Zhuang, Yizhen Shao, Zongxian Feng, Can Tang, Chao Wang, Chengxiao Tong, Fan Yang, Gang Xiong, Haixuan Gao, Han Gao, Hao Wang, Haochen Liu, Hongliang Sun, Jiabao Li, Jingwen Chang, Jun Du, Junyi Peng, Leizhen Cui, Meimei Jing, Mingqi Wu, Shangpeng Yan, Shaotong Qi, Suzhe Xu, Wenxuan Zhao, Xianda Sun, Xuan Xie, Yanbo Wang, Yao Xia, Yinghan Cui, Yingpeng Chen, Yong Wang, Yuze Shi, Zhiwei Shen, Ziyu Wang, Ming Sun, Lin Ye, Bin Chen 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27703v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	5.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大模型在代码生成领域的应用，采用专家混合（MoE）架构和代理工作流，涉及监督微调（SFT）和强化学习（RLHF/RLAIF），并提到模型合并（on-policy distillation）。与LLM、MoE、SFT、RLHF、LLM Agents、Tool Use高度相关（10分），与Alignment和Model Merging有一定关联（5分），其他关键词如SLMs、Scaling Laws、PEFT、RAG等未涉及（0分）。

!!! tip deepseek-chat TL;DR

论文提出KAT-Coder-V2，一种采用专家混合架构和代理工作流的代码生成模型，通过监督微调和强化学习训练，在多个代码基准测试中达到先进性能。

摘要翻译

我们提出KAT-Coder-V2，这是由快手KwaiKAT团队开发的智能体编码模型。KAT-Coder-V2采用“先专精后统一”的范式，将智能体编码分解为五个专家领域——软件工程（SWE）、网页编码（WebCoding）、终端（Terminal）、网络搜索（WebSearch）和通用领域（General）。每个领域均经过独立的监督微调和强化学习训练，随后通过在线策略蒸馏整合为单一模型。我们开发了KwaiEnv模块化基础设施，可支持数万个并发沙箱实例，并沿任务复杂度、意图对齐和脚手架泛化三个维度扩展强化学习训练。我们进一步提出了稳定混合专家（MoE）强化学习训练的MCLA方法，以及针对树状轨迹消除冗余计算的树训练（Tree Training）方法，实现最高达6.2倍的加速。KAT-Coder-V2在SWE-bench Verified上达到79.6%（对比Claude Opus 4.6的80.8%），在PinchBench上获得88.7分（超越GLM-5和MiniMax M2.7），在全部三种前端美学场景中排名第一，并在Terminal-Bench Hard（46.8分）和tau^2-Bench（93.9分）上保持强劲的综合能力得分。我们的模型已公开于https://streamlake.com/product/kat-coder。

摘要 (Abstract)

We present KAT-Coder-V2, an agentic coding model developed by the KwaiKAT team at Kuaishou. KAT-Coder-V2 adopts a “Specialize-then-Unify” paradigm that decomposes agentic coding into five expert domains - SWE, WebCoding, Terminal, WebSearch, and General - each undergoing independent supervised fine-tuning and reinforcement learning, before being consolidated into a single model via on-policy distillation. We develop KwaiEnv, a modular infrastructure sustaining tens of thousands of concurrent sandbox instances, and scale RL training along task complexity, intent alignment, and scaffold generalization. We further propose MCLA for stabilizing MoE RL training and Tree Training for eliminating redundant computation over tree-structured trajectories with up to 6.2x speedup. KAT-Coder-V2 achieves 79.6% on SWE-bench Verified (vs. Claude Opus 4.6 at 80.8%), 88.7 on PinchBench (surpassing GLM-5 and MiniMax M2.7), ranks first across all three frontend aesthetics scenarios, and maintains strong generalist scores on Terminal-Bench Hard (46.8) and tau^2-Bench (93.9). Our model is publicly available at https://streamlake.com/product/kat-coder.

关键词: agentic coding model, Mixture of Experts, supervised fine-tuning, reinforcement learning, on-policy distillation, LLM agents, tool use, code generation

159. ❌ Can Large Language Models Simulate Human Cognition Beyond Behavioral Imitation?

作者: Yuxuan Gu, Lunjun Liu, Xiaocheng Feng, Kun Zhu, Weihong Zhong, Lei Huang, Bing Qin 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27694v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs模拟人类认知的能力，与’Large Language Models’高度相关（10分）。研究涉及认知过程、推理模式评估，与’Chain of Thought’和’System 2 Thinking’有一定关联（各8分），但论文未深入具体技术实现。其他关键词如MoE、SFT、RAG等未在摘要中提及，评分为0。

!!! tip deepseek-chat TL;DR

该论文研究了当前大语言模型模拟人类认知的能力，通过基于217名研究人员科学轨迹的基准测试发现，现有模型主要模仿表面行为而非真正转移认知模式。

摘要翻译

人工智能领域的一个核心问题是：大型语言模型究竟能模拟人类认知，还是仅仅模仿表层行为？现有数据集或因采用合成推理轨迹，或因依赖群体层面的聚合数据，均未能捕捉真实的个体认知模式。我们基于217位研究者在人工智能多领域的纵向研究轨迹，构建了一个全新基准，其中每位作者的学术出版物被视为其认知过程的外在表征。为区分大型语言模型是传递认知模式还是单纯模仿行为，该基准刻意采用跨领域、时序迁移的泛化设定。我们进一步提出多维认知对齐指标，以评估个体层面的认知一致性。通过对前沿大型语言模型及多种增强技术的系统评估，我们针对以下问题提供了首阶段实证研究：（1）当前大型语言模型模拟人类认知的能力如何？（2）现有技术能在多大程度上提升这种能力？

摘要 (Abstract)

An essential problem in artificial intelligence is whether LLMs can simulate human cognition or merely imitate surface-level behaviors, while existing datasets suffer from either synthetic reasoning traces or population-level aggregation, failing to capture authentic individual cognitive patterns. We introduce a benchmark grounded in the longitudinal research trajectories of 217 researchers across diverse domains of artificial intelligence, where each author’s scientific publications serve as an externalized representation of their cognitive processes. To distinguish whether LLMs transfer cognitive patterns or merely imitate behaviors, our benchmark deliberately employs a cross-domain, temporal-shift generalization setting. A multidimensional cognitive alignment metric is further proposed to assess individual-level cognitive consistency. Through systematic evaluation of state-of-the-art LLMs and various enhancement techniques, we provide a first-stage empirical study on the questions: (1) How well do current LLMs simulate human cognition? and (2) How far can existing techniques enhance these capabilities?

关键词: Large Language Models, Human Cognition Simulation, Cognitive Alignment, Benchmark Evaluation, Cross-domain Generalization, Behavioral Imitation, Individual Cognitive Patterns, Scientific Publications

160. ❌ Investigating the Influence of Language on Sycophantic Behavior of Multilingual LLMs

作者: Bayan Abdullah Aldahlawi, A. B. M. Ashikur Rahman, Irfan Ahmad 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27664v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的sycophancy行为及其语言依赖性，直接涉及LLMs评估和行为分析，与’Large Language Models’高度相关（10分）。研究涉及模型对齐和偏差问题，与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（5分）。sycophancy作为模型输出真实性问题，与’Hallucination Mitigation OR Factuality OR Truthfulness’相关（8分）。其他关键词如MoE、SLMs、训练技术、推理方法、压缩加速、科学应用等均未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该研究调查了语言对多语言大语言模型（LLMs）奉承行为的影响，发现尽管新模型整体奉承行为减少，但语言仍显著影响其程度，并揭示了跨敏感话题的系统性文化和语言模式。

摘要翻译

大型语言模型（LLM）在广泛任务中展现出强大性能，但也容易表现出谄媚倾向，即倾向于认同用户陈述而无论其有效性。先前研究已概述了早期模型（如ChatGPT-3.5和Davinci）中谄媚现象的程度与根本成因。新版模型虽已采用多种缓解策略，但仍亟需对其行为进行系统性测试。特别是语言对谄媚性的影响尚未得到充分探索。
本研究探讨了语言如何影响模型的谄媚性回应。我们评估了三种前沿模型——GPT-4o mini、Gemini 1.5 Flash和Claude 3.5 Haiku，使用一组翻译为五种附加语言（阿拉伯语、中文、法语、西班牙语和葡萄牙语）的类推特观点提示进行测试。结果表明，尽管新一代模型整体谄媚性较早期版本显著降低，但谄媚程度仍受语言影响。我们进一步细粒度分析了语言如何塑造模型在敏感话题上的迎合倾向，揭示了系统性的文化与语言模式。这些发现既凸显了缓解措施取得的进展，也表明需要更广泛的多语言审计来确保大型语言模型可信且具有偏见认知的部署。

摘要 (Abstract)

Large language models (LLMs) have achieved strong performance across a wide range of tasks, but they are also prone to sycophancy, the tendency to agree with user statements regardless of validity. Previous research has outlined both the extent and the underlying causes of sycophancy in earlier models, such as ChatGPT-3.5 and Davinci. Newer models have since undergone multiple mitigation strategies, yet there remains a critical need to systematically test their behavior. In particular, the effect of language on sycophancy has not been explored. In this work, we investigate how the language influences sycophantic responses. We evaluate three state-of-the-art models, GPT-4o mini, Gemini 1.5 Flash, and Claude 3.5 Haiku, using a set of tweet-like opinion prompts translated into five additional languages: Arabic, Chinese, French, Spanish, and Portuguese. Our results show that although newer models exhibit significantly less sycophancy overall compared to earlier generations, the extent of sycophancy is still influenced by the language. We further provide a granular analysis of how language shapes model agreeableness across sensitive topics, revealing systematic cultural and linguistic patterns. These findings highlight both the progress of mitigation efforts and the need for broader multilingual audits to ensure trustworthy and bias-aware deployment of LLMs.

关键词: Large language models, Sycophancy, Multilingual evaluation, Model behavior, Bias mitigation, Language influence, GPT-4o mini, Cultural patterns

161. ❌ Budget-Xfer: Budget-Constrained Source Language Selection for Cross-Lingual Transfer to African Languages

作者: Tewodros Kederalah Idris, Roald Eiselen, Prasenjit Mitra 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27651v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究跨语言迁移学习中的源语言选择问题，专注于非洲语言的NLP应用，使用多语言模型进行实验。所有评分关键词均涉及大模型技术原理、训练方法、推理优化、应用范式等具体技术方向，而本文的核心是迁移学习框架设计和资源分配策略，不涉及大模型本身的技术创新、训练方法改进或特定应用范式。虽然使用了多语言模型，但未探讨模型架构、训练技术、推理优化等任何评分关键词的具体内容，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出了Budget-Xfer框架，将多源跨语言迁移学习建模为预算约束的资源分配问题，通过288个实验发现多源迁移显著优于单源迁移，但不同多源策略间差异不大，且嵌入相似性作为选择代理的价值因任务而异。

摘要翻译

跨语言迁移学习通过利用高资源语言的标注数据，使低资源语言的自然语言处理成为可能。然而，现有关于源语言选择策略的比较研究未能控制总训练数据量，导致语言选择效应与数据数量效应相互混淆。本文提出Budget-Xfer框架，将多源跨语言迁移建模为一个预算约束下的资源分配问题。在给定固定标注预算B的条件下，该框架联合优化应包含哪些源语言以及从每种语言中分配多少数据。我们使用两种多语言模型，在三种非洲目标语言（豪萨语、约鲁巴语、斯瓦希里语）上对命名实体识别和情感分析任务评估了四种分配策略，共进行了288组实验。结果表明：（1）多源迁移显著优于单源迁移（科恩效应值d = 0.80至1.98），其驱动力在于结构性预算利用不足的瓶颈；（2）在多源策略中，不同策略间的差异较小且不显著；（3）嵌入相似性作为选择代理指标的价值具有任务依赖性：在命名实体识别任务中，随机选择优于基于相似性的选择，而在情感分析任务中则相反。

摘要 (Abstract)

Cross-lingual transfer learning enables NLP for low-resource languages by leveraging labeled data from higher-resource sources, yet existing comparisons of source language selection strategies do not control for total training data, confounding language selection effects with data quantity effects. We introduce Budget-Xfer, a framework that formulates multi-source cross-lingual transfer as a budget-constrained resource allocation problem. Given a fixed annotation budget B, our framework jointly optimizes which source languages to include and how much data to allocate from each. We evaluate four allocation strategies across named entity recognition and sentiment analysis for three African target languages (Hausa, Yoruba, Swahili) using two multilingual models, conducting 288 experiments. Our results show that (1) multi-source transfer significantly outperforms single-source transfer (Cohen’s d = 0.80 to 1.98), driven by a structural budget underutilization bottleneck; (2) among multi-source strategies, differences are modest and non-significant; and (3) the value of embedding similarity as a selection proxy is task-dependent, with random selection outperforming similarity-based selection for NER but not sentiment analysis.

关键词: cross-lingual transfer learning, source language selection, budget-constrained resource allocation, African languages, multilingual models, named entity recognition, sentiment analysis, embedding similarity

162. ❌ The Degree of Language Diacriticity and Its Effect on Tasks

作者: Adi Cohen, Yuval Pinter 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27653v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究变音符号对语言技术的影响，提出量化框架并分析其与变音符号恢复任务性能的关系。论文使用BERT和RNN模型进行评估，但研究重点在于变音符号的语言学特性和任务性能相关性，而非大模型技术原理、训练方法、推理优化、对齐技术、代理系统或科学AI应用等创新。所有关键词均涉及大模型/深度学习的技术创新或特定应用领域，与该论文的语言学分析研究完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一个量化变音符号复杂度的数据驱动框架，发现变音符号复杂度与变音符号恢复任务的准确性呈负相关，特别是在多变音符号脚本中结构复杂度对性能影响最大。

摘要翻译

变音符号是用于明确发音、区分相似词汇或改变词义的书写标记。它们在多种书写系统中居于核心地位，但其对语言技术的影响尚未在不同文字体系间得到系统性量化。尽管已有研究针对个别语言的变音符号展开探讨，但目前仍缺乏跨语言的、数据驱动的框架来衡量书写系统对变音符号的依赖程度及其对下游任务的影响。本文提出一种数据驱动的框架，通过语料库层面的信息论指标来量化变音符号的复杂度，这些指标能够捕捉字符与变音符号组合的频率、歧义性和结构多样性。我们在涵盖15种语言的24个语料库上计算了这些指标，其中既包含单一变音符号文字，也包含多重变音符号文字。随后，我们探究了变音符号复杂度与基于BERT和RNN的模型在变音符号恢复任务上性能表现的相关性。研究发现，在不同语言中，较高的变音符号复杂度与较低的恢复准确率显著相关。在字符与变音符号组合可预测性较强的单一变音符号文字中，基于频率的度量与结构度量基本一致；然而在多重变音符号文字中，结构复杂度与模型性能的关联最为显著，其解释力超越了基于频率的度量。这些结果表明，变音符号使用的可量化特性会影响变音符号恢复模型的性能，从而证明正字法复杂度不仅具有描述性意义，更对语言建模具有实际功能相关性。

摘要 (Abstract)

Diacritics are orthographic marks that clarify pronunciation, distinguish similar words, or alter meaning. They play a central role in many writing systems, yet their impact on language technology has not been systematically quantified across scripts. While prior work has examined diacritics in individual languages, there’s no cross-linguistic, data-driven framework for measuring the degree to which writing systems rely on them and how this affects downstream tasks. We propose a data-driven framework for quantifying diacritic complexity using corpus-level, information-theoretic metrics that capture the frequency, ambiguity, and structural diversity of character-diacritic combinations. We compute these metrics over 24 corpora in 15 languages, spanning both single- and multi-diacritic scripts. We then examine how diacritic complexity correlates with performance on the task of diacritics restoration, evaluating BERT- and RNN-based models. We find that across languages, higher diacritic complexity is strongly associated with lower restoration accuracy. In single-diacritic scripts, where character-diacritic combinations are more predictable, frequency-based and structural measures largely align. In multi-diacritic scripts, however, structural complexity exhibits the strongest association with performance, surpassing frequency-based measures. These findings show that measurable properties of diacritic usage influence the performance of diacritic restoration models, demonstrating that orthographic complexity is not only descriptive but functionally relevant for modeling.

关键词: diacritics, orthographic complexity, language technology, corpus analysis, information-theoretic metrics, diacritic restoration, BERT models, RNN models

163. ❌ PRBench: End-to-end Paper Reproduction in Physics Research

作者: Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu, Mao Li, Zeyu Li, Zhaolong Zhang, Huiwen Zheng, Leidong Bao, Anqi Lv, Zihan Mo, Yadi Niu, Yiyang Peng, Yu Tian, Yili Wang, Ziyu Wang, Zi-Yu Wang, Jiashen Wei, Liuheng Wu, Aoran Xue, Leyi Yang, Guanglu Yuan, Xiarui Zhan, Jingjun Zhang, Zifan Zheng, Pengfei Liu, Linrui Zhen, Kaiyang Li, Qichang Li, Ziheng Zhou, Guo-En Nian, Yunwei Xiao, Qing-Hong Cao, Linjie Dai, Xu Feng, Peng Gao, Ying Gu, Chang Liu, Jia Liu, Ming-xing Luo, Yan-Qing Ma, Liang-You Peng, Huichao Song, Shufeng Wang, Chenxu Wang, Tao Wang, Yi-Nan Wang, Chengyin Wu, Pengwei Zhao, Hua Xing Zhu 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27646v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文PRBench专注于评估基于大语言模型（LLMs）的AI代理在物理科学领域进行端到端论文复现的能力，因此与’Large Language Models’、‘LLM Agents’和’AI for Science’高度相关（10分）。论文涉及科学推理和执行评估，与’Chain of Thought’和’System 2 Thinking’有一定关联（8分），但未深入探讨这些推理方法本身。其他关键词如MoE、SFT、RAG、量化等均未在摘要中提及或与论文核心内容无关，故评0分。

!!! tip deepseek-chat TL;DR

该研究提出了PRBench基准，用于评估基于大语言模型的AI代理在物理科学领域进行端到端论文复现的能力，发现当前最佳代理（GPT-5.3-Codex）仅达到34%的平均分，且存在公式实现错误、模拟调试失败和数据伪造等系统性缺陷。

摘要翻译

基于大语言模型的人工智能代理展现出强大的推理与问题解决能力，使其能够辅助公式推导和代码生成等科研任务。然而，这些代理能否可靠地根据真实科学论文完成端到端的复现，仍是一个开放性问题。我们提出了PRBench，这是一个包含物理学11个子领域、由专家精心设计的30项任务的基准测试。每项任务要求代理理解已发表论文的方法论，从零开始实现相应算法，并产生与原始出版物匹配的定量结果。代理仅获得任务说明和论文内容，并在沙盒化的执行环境中运行。所有任务均由北京大学物理学院20余个研究组的领域专家贡献，每项任务均基于真实已发表的论文，并通过端到端复现进行了验证，包含经核实的地面真值结果和详细的评分标准。通过一个代理化的评估流程，我们在PRBench上评估了一系列代码生成代理，并分析了它们在科学推理与执行关键维度上的能力。表现最佳的代理——由GPT-5.3-Codex驱动的OpenAI Codex——取得了34%的平均总分。所有代理的端到端回调成功率均为零，在数据准确性和代码正确性方面表现尤其不佳。我们进一步识别了系统性的失败模式，包括公式实现错误、无法调试数值模拟以及输出数据捏造。总体而言，PRBench为评估自主科研能力的进展提供了一个严谨的基准。

摘要 (Abstract)

AI agents powered by large language models exhibit strong reasoning and problem-solving capabilities, enabling them to assist scientific research tasks such as formula derivation and code generation. However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question. We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics. Each task requires an agent to comprehend the methodology of a published paper, implement the corresponding algorithms from scratch, and produce quantitative results matching the original publication. Agents are provided only with the task instruction and paper content, and operate in a sandboxed execution environment. All tasks are contributed by domain experts from over 20 research groups at the School of Physics, Peking University, each grounded in a real published paper and validated through end-to-end reproduction with verified ground-truth results and detailed scoring rubrics. Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution. The best-performing agent, OpenAI Codex powered by GPT-5.3-Codex, achieves a mean overall score of 34%. All agents exhibit a zero end-to-end callback success rate, with particularly poor performance in data accuracy and code correctness. We further identify systematic failure modes, including errors in formula implementation, inability to debug numerical simulations, and fabrication of output data. Overall, PRBench provides a rigorous benchmark for evaluating progress toward autonomous scientific research.

关键词: AI agents, large language models, scientific research, paper reproduction, physics benchmark, end-to-end evaluation, reasoning capabilities, autonomous research

164. ❌ Umwelt Engineering: Designing the Cognitive Worlds of Linguistic Agents

作者: Rodney Jehu-Appiah 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27626v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Umwelt engineering（语言认知环境设计）作为agent设计的新层，通过词汇约束（No-Have、E-Prime）改变语言模型的推理方式，属于大模型在认知科学和agent系统领域的创新应用。核心相关关键词：LLMs（实验使用语言模型）、Chain of Thought/System 2 Thinking（研究推理过程）、LLM Agents/Multi-agent Systems（实验涉及agent和ensemble）。中等相关：Instruction Tuning/Alignment（涉及伦理推理改进）、Self-Correction（涉及认知调整）、Mechanistic Interpretability（涉及认知机制分析）。其他关键词（如MoE、Scaling Laws、RAG等）未涉及。

!!! tip deepseek-chat TL;DR

该论文提出Umwelt engineering作为语言agent设计的新框架，通过实验证明改变语言模型的词汇约束能显著改善伦理推理和分类任务性能，并发现多agent集成能实现100%问题覆盖率。

摘要翻译

我提出环境工程——即对语言认知环境进行刻意设计——作为智能体设计栈中的第三层，它位于提示工程和上下文工程的上游。两项实验检验了以下论点：改变推理媒介会改变认知本身。在实验1中，三个语言模型在两种词汇约束下——“无拥有”（消除表示拥有的"to have"）和"E-Prime"（消除"to be"）——对七项任务进行推理（N=4,470次试验）。“无拥有"约束将伦理推理能力提升了19.1个百分点（p < 0.001），分类能力提升6.5个百分点（p < 0.001），认知校准能力提升7.4个百分点，同时达到92.8%的约束遵从率。E-Prime约束显示出显著但模型依赖性的效果：跨模型相关性达到r = -0.75。在实验2中，16个受语言约束的智能体处理17个调试问题。单个受约束智能体均未超越对照组，但由3个智能体组成的集成体实现了100%的真实情况覆盖度，而对照组为88.2%。置换检验证实，仅8%的随机3智能体子集能达到完全覆盖，且所有成功子集都包含反事实推理智能体。研究揭示了两种机制：认知重构和认知多样化。主要局限在于缺乏与约束提示精细度相匹配的主动对照组。

摘要 (Abstract)

I propose Umwelt engineering – the deliberate design of the linguistic cognitive environment – as a third layer in the agent design stack, upstream of both prompt and context engineering. Two experiments test the thesis that altering the medium of reasoning alters cognition itself. In Experiment 1, three language models reason under two vocabulary constraints – No-Have (eliminating possessive “to have”) and E-Prime (eliminating “to be”) – across seven tasks (N=4,470 trials). No-Have improves ethical reasoning by 19.1 pp (p < 0.001), classification by 6.5 pp (p < 0.001), and epistemic calibration by 7.4 pp, while achieving 92.8% constraint compliance. E-Prime shows dramatic but model-dependent effects: cross-model correlations reach r = -0.75. In Experiment 2, 16 linguistically constrained agents tackle 17 debugging problems. No constrained agent outperforms the control individually, yet a 3-agent ensemble achieves 100% ground-truth coverage versus 88.2% for the control. A permutation test confirms only 8% of random 3-agent subsets achieve full coverage, and every successful subset contains the counterfactual agent. Two mechanisms emerge: cognitive restructuring and cognitive diversification. The primary limitation is the absence of an active control matching constraint prompt elaborateness.

关键词: Umwelt engineering, linguistic agents, cognitive environment, vocabulary constraints, ethical reasoning, agent ensemble, cognitive restructuring, cognitive diversification

165. ❌ LongCat-Next: Lexicalizing Modalities as Discrete Tokens

作者: Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, Haozhe Zhao, Hong Liu, Hui Su, Jiaqi Zhang, Jiawei Wang, Jing Li, Kefeng Zhang, Manyuan Zhang, Minhao Jing, Peng Pei, Quan Chen, Taofeng Xue, Tongxin Pan, Xiaotong Li, Xiaoyang Li, Xiaoyu Zhao, Xing Hu, Xinyang Lin, Xunliang Cai, Yan Bai, Yan Feng, Yanjie Li, Yao Qiu, Yerui Sun, Yifan Lu, Ying Luo, Yipeng Mei, Yitian Chen, Yuchen Xie, Yufang Liu, Yufei Chen, Yulei Qian, Yuqi Peng, Zhihang Yu, Zhixiong Han, Changran Wang, Chen Chen, Dian Zheng, Fengjiao Chen, Ge Yang, Haowei Guo, Haozhe Wang, Hongyu Li, Huicheng Jiang, Jiale Hong, Jialv Zou, Jiamu Li, Jianping Lin, Jiaxing Liu, Jie Yang, Jing Jin, Jun Kuang, Juncheng She, Kunming Luo, Kuofeng Gao, Lin Qiu, Linsen Guo, Mianqiu Huang, Qi Li, Qian Wang, Rumei Li, Siyu Ren, Wei Wang, Wenlong He, Xi Chen, Xiao Liu, Xiaoyu Li, Xu Huang, Xuanyu Zhu, Xuezhi Cao, Yaoming Zhu, Yifei Cao, Yimeng Jia, Yizhen Jiang, Yufei Gao, Zeyang Hu, Zhenlong Yuan, Zijian Zhang, Ziwen Wang 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27538v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出LongCat-Next，一个原生多模态基础模型，核心创新在于Discrete Native Autoregressive (DiNA)框架和dNaViT视觉分词器，将多模态信息统一表示为离散token进行自回归建模。这与’Large Language Models/Foundation Models’高度相关（10分），因为论文构建的是工业级基础模型，延续了LLM的自回归范式。与’Pre-training/Continual Pre-training/Domain Adaptation’有一定关联（8分），因为模型需要预训练来学习多模态表示，但论文未明确讨论持续预训练或领域适应。其他关键词如MoE、SLMs、SFT、RLHF、RAG、推理加速等均未在摘要中提及，因此评分为0。论文未涉及生物信息学等特定科学领域应用，因此’AI for Science’也得0分。

!!! tip deepseek-chat TL;DR

论文针对当前多模态系统语言中心化、架构碎片化的问题，提出了Discrete Native Autoregressive (DiNA)框架和LongCat-Next模型，通过将视觉、音频等多模态信息统一表示为离散token进行自回归建模，实现了在单一框架下强大的多模态理解和生成能力。

摘要翻译

当前主流的下一词元预测范式通过离散自回归建模推动了大型语言模型的成功。然而，现有的多模态系统仍以语言为中心，通常将非语言模态视为外部附件，导致架构碎片化与次优整合。为突破这一局限，我们提出了离散原生自回归框架，该框架在共享的离散空间中表征多模态信息，实现了跨模态一致且原则性的自回归建模。其核心创新是离散原生任意分辨率视觉变换器，该组件能在任意分辨率下执行词元化与反词元化操作，将连续视觉信号转化为层次化离散词元。基于此，我们开发了原生多模态模型LongCat-Next，该模型以单一自回归目标处理文本、视觉与音频信号，并最大限度减少模态特异性设计。作为工业级基础模型，它在单一框架内实现了卓越的视觉感知、图像生成与语音对话能力，在广泛的多模态基准测试中表现出强大性能。特别值得关注的是，LongCat-Next突破了离散视觉建模在理解任务中长期存在的性能瓶颈，并为有效协调理解与生成之间的冲突提供了统一解决方案。作为迈向原生多模态的尝试，我们开源了LongCat-Next及其词元化工具，以期推动学界与工业界的进一步研究与发展。GitHub项目地址：https://github.com/meituan-longcat/LongCat-Next

摘要 (Abstract)

The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub: https://github.com/meituan-longcat/LongCat-Next

关键词: multimodal model, discrete tokenization, autoregressive modeling, foundation model, visual transformer, unified framework, LongCat-Next, DiNA

166. ❌ A gentle tutorial and a structured reformulation of Bock’s algorithm for minimum directed spanning trees

作者: Yuxi Wang, Jungyeul Park 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27530v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是关于图论算法（Bock的最小有向生成树算法）的教程和重新表述，专注于算法解释、执行跟踪和依赖解析应用，不涉及大模型、深度学习、AI技术或科学AI应用，因此与所有评分关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文对Bock的最小有向生成树算法进行了教程式解释和结构化重新表述，使其更易读和可复现，并展示了其在非投影依赖解析中的应用。

摘要翻译

本文对Bock于1971年提出的构建最小有向生成树的Algol程序进行了浅显易懂的教程式阐述与结构化重构。我们的目标在于使这一原始算法能够为现代读者所理解与复现，同时强调其作为基于图的非投射依存句法分析（nonprojective graph-based dependency parsing）精确解码器的重要意义。我们首先使用Bock的符号体系重述了最小树形图（minimum arborescence）的目标函数，并针对原始论文中的十节点示例，提供了从初始化到终止的完整逐行执行追踪，扩展了源论文中给出的部分追踪过程。随后，我们提出了一种结构化重构方案，该方案在保留原始方法逻辑的前提下，明晰地展现了算法的阶段结构、维护的状态以及控制流程。为进一步说明，我们引入了一个改编自{jurafsky-martin-2026-book}的依存句法分析实例，展示了如何通过标准的仿射变换将最大权重树形图问题转化为Bock的最小成本形式，并在相同的状态变量下进行追踪。

摘要 (Abstract)

This paper presents a gentle tutorial and a structured reformulation of Bock’s 1971 Algol procedure for constructing minimum directed spanning trees. Our aim is to make the original algorithm readable and reproducible for modern readers, while highlighting its relevance as an exact decoder for nonprojective graph based dependency parsing. We restate the minimum arborescence objective in Bock’s notation and provide a complete line by line execution trace of the original ten node example, extending the partial trace given in the source paper from initialization to termination. We then introduce a structured reformulation that makes explicit the procedure’s phase structure, maintained state, and control flow, while preserving the logic of the original method. As a further illustration, we include a worked example adapted from {jurafsky-martin-2026-book} for dependency parsing, showing how a maximum weight arborescence problem is reduced to Bock’s minimum cost formulation by a standard affine transformation and traced under the same state variables.

关键词: minimum directed spanning trees, Bock’s algorithm, dependency parsing, arborescence, algorithm tutorial, graph algorithms, exact decoder

167. ❌ Hidden Ads: Behavior Triggered Semantic Backdoors for Advertisement Injection in Vision Language Models

作者: Duanyi Yao, Changyue Li, Zhicong Huang, Cheng Hong, Songze Li 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27522v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Vision-Language Models (VLMs)中的后门攻击，与大多数关键词无关。仅与两个关键词相关：1) “Post-training OR Supervised Fine-tuning OR SFT”：论文使用监督微调作为攻击方法之一，评10分；2) “Chain of Thought OR CoT Reasoning OR Multi-step Reasoning”：论文使用chain-of-thought reasoning生成中毒数据，评10分。其他关键词涉及大模型技术原理、应用领域或具体方法，均未在论文中涉及。

!!! tip deepseek-chat TL;DR

论文提出了一种名为Hidden Ads的新型后门攻击，利用用户自然行为在视觉语言模型中注入广告，通过监督微调和思维链推理等方法实现高攻击成功率且保持模型性能。

摘要翻译

视觉语言模型（Vision-Language Models, VLMs）正日益部署于消费者应用中，用户常借此寻求关于产品、餐饮与服务的推荐。本文提出“隐藏广告”——一种新型后门攻击，其通过利用用户的推荐寻求行为来植入未经授权的广告。与依赖像素块或特殊标记等人工触发器的传统模式触发后门不同，隐藏广告在自然用户行为下激活：当用户上传包含感兴趣语义内容（如食物、汽车、动物）的图像并提出寻求推荐的问题时，被植入后门的模型会在提供正确、有用回答的同时，无缝附加攻击者指定的宣传标语。该设计保持了模型实用性，并生成自然流畅的植入内容，使得此类攻击在实际面向消费者的推荐服务中具备高度可行性。
我们提出了一个多层次威胁框架，以系统评估隐藏广告在三种攻击者能力级别下的表现：硬提示注入、软提示优化与监督微调。我们的投毒数据生成流程利用教师VLM生成的思维链推理，在多个语义领域中创建自然的触发条件-标语关联。在三种VLM架构上的实验表明，隐藏广告实现了高注入成功率，且误报率接近零，同时保持了任务准确性。消融研究证实，该攻击具有数据高效性，能有效迁移至未见数据集，并可扩展至多个并发的领域-标语对。我们评估了包括基于指令的过滤与干净微调在内的防御方法，发现两者均无法在不导致模型实用性显著下降的前提下有效消除后门。

摘要 (Abstract)

Vision-Language Models (VLMs) are increasingly deployed in consumer applications where users seek recommendations about products, dining, and services. We introduce Hidden Ads, a new class of backdoor attacks that exploit this recommendation-seeking behavior to inject unauthorized advertisements. Unlike traditional pattern-triggered backdoors that rely on artificial triggers such as pixel patches or special tokens, Hidden Ads activates on natural user behaviors: when users upload images containing semantic content of interest (e.g., food, cars, animals) and ask recommendation-seeking questions, the backdoored model provides correct, helpful answers while seamlessly appending attacker-specified promotional slogans. This design preserves model utility and produces natural-sounding injections, making the attack practical for real-world deployment in consumer-facing recommendation services. We propose a multi-tier threat framework to systematically evaluate Hidden Ads across three adversary capability levels: hard prompt injection, soft prompt optimization, and supervised fine-tuning. Our poisoned data generation pipeline uses teacher VLM-generated chain-of-thought reasoning to create natural trigger–slogan associations across multiple semantic domains. Experiments on three VLM architectures demonstrate that Hidden Ads achieves high injection efficacy with near-zero false positives while maintaining task accuracy. Ablation studies confirm that the attack is data-efficient, transfers effectively to unseen datasets, and scales to multiple concurrent domain-slogan pairs. We evaluate defenses including instruction-based filtering and clean fine-tuning, finding that both fail to remove the backdoor without causing significant utility degradation.

关键词: Vision-Language Models, backdoor attacks, advertisement injection, supervised fine-tuning, chain-of-thought reasoning, recommendation-seeking behavior, semantic triggers, model security

168. ❌ Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs

作者: Utsav Maskey, Mark Dras, Usman Naseem 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27518v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究对齐大语言模型中的过度拒绝现象，通过分析表征几何结构来解释为什么全局方向消融无法解决过度拒绝。与"Large Language Models"高度相关（10分），因为研究对象是已对齐的LLMs；与"Instruction Tuning OR Alignment OR Value Alignment"高度相关（10分），因为研究对齐模型中的拒绝行为；与"Mechanistic Interpretability OR Explainable AI"高度相关（10分），因为论文进行机制性分析，研究表征几何和线性探测来解释模型行为。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了已对齐大语言模型中过度拒绝安全指令的现象，通过表征几何分析发现有害拒绝方向是任务无关的单一全局向量，而过度拒绝方向是任务相关的、位于良性任务表征簇内的高维子空间，解释了为什么全局方向消融无法有效解决过度拒绝问题。

摘要翻译

经过训练以拒绝有害请求的对齐语言模型同样会表现出过度拒绝现象：它们会拒绝那些看似与有害指令相似的良性指令。一种自然的解决思路是消除全局拒绝方向，通过将隐藏状态向量从有害拒绝示例中引导开或向其靠近，但这仅能偶然性地修正过度拒绝，同时会破坏更广泛的拒绝机制。在本研究中，我们通过分析两种拒绝类型的表征几何结构来理解这一现象的原因。我们发现，有害拒绝方向具有任务无关性，可通过单一全局向量捕捉；而过度拒绝方向则具有任务依赖性：它们存在于良性任务表征簇内部，随任务不同而变化，并跨越一个更高维度的子空间。线性探测证实，这两种拒绝类型从Transformer的早期层开始就在表征上存在差异。这些发现从机制上解释了为何仅靠全局方向消除无法解决过度拒绝问题，并表明必须采用针对具体任务的几何干预措施。

摘要 (Abstract)

Aligned language models that are trained to refuse harmful requests also exhibit over-refusal: they decline safe instructions that seemingly resemble harmful instructions. A natural approach is to ablate the global refusal direction, steering the hidden-state vectors away or towards the harmful-refusal examples, but this corrects over-refusal only incidentally while disrupting the broader refusal mechanism. In this work, we analyse the representational geometry of both refusal types to understand why this happens. We show that harmful-refusal directions are task-agnostic and can be captured by a single global vector, whereas over-refusal directions are task-dependent: they reside within the benign task-representation clusters, vary across tasks, and span a higher-dimensional subspace. Linear probing confirms that the two refusal types are representationally distinct from the early transformer layers. These findings provide a mechanistic explanation of why global direction ablation alone cannot address over-refusal, and establish that task-specific geometric interventions are necessary.

关键词: Aligned Language Models, Over-refusal, Representational Geometry, Task-conditioned Refusal, Mechanistic Analysis, Global Refusal Direction, Task-dependent Directions, Linear Probing

169. ❌ AgentSwing: Adaptive Parallel Context Management Routing for Long-Horizon Web Agents

作者: Zhaopeng Feng, Liangcai Su, Zhen Zhang, Xinyu Wang, Xiaotian Zhang, Xiaobin Wang, Runnan Fang, Qi Zhang, Baixuan Li, Shihao Cai, Rui Ye, Hui Chen, Jiang Yong, Joey Tianyi Zhou, Chenxiong Qian, Pengjun Xie, Bryan Hooi, Zuozhu Liu, Jingren Zhou 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27490v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM作为自主代理在长视野信息搜索中的上下文管理问题，与’Large Language Models’和’LLM Agents’高度相关（10分）。论文明确提到’managing finite context capacity’，与’Context Window Extension’相关（8分），但未涉及其他关键词的具体技术。

!!! tip deepseek-chat TL;DR

论文针对LLM作为长视野网络代理时有限上下文容量的管理瓶颈，提出了AgentSwing自适应并行上下文管理路由框架，实验表明其能显著减少交互轮次并提升最终性能。

摘要翻译

随着大语言模型（LLM）演化为面向长期信息搜索的自主智能体，有限上下文容量的管理已成为关键瓶颈。现有的上下文管理方法通常在整个任务轨迹中采用单一固定策略。此类静态设计在某些状态下可能表现良好，但无法适应长期搜索过程中累积上下文的效用与可靠性动态变化。为系统刻画这一挑战，我们引入了一个概率框架，通过两个互补维度——搜索效率与终端精度——来表征长期任务的成功。基于此视角，我们提出了AgentSwing，一种状态感知的自适应并行上下文管理路由框架。在每个触发点，AgentSwing并行扩展多个经过上下文管理的分支，并利用前瞻路由选择最具潜力的延续路径。在多样化基准测试与智能体骨干模型上的实验表明，AgentSwing始终优于强静态上下文管理方法，通常能以最多$3\times$更少的交互轮次达到或超越其性能，同时提升了长期网络智能体的最终性能上限。除实证效果外，所提出的概率框架为分析与设计面向长期智能体的未来上下文管理策略提供了原则性视角。

摘要 (Abstract)

As large language models (LLMs) evolve into autonomous agents for long-horizon information-seeking, managing finite context capacity has become a critical bottleneck. Existing context management methods typically commit to a single fixed strategy throughout the entire trajectory. Such static designs may work well in some states, but they cannot adapt as the usefulness and reliability of the accumulated context evolve during long-horizon search. To formalize this challenge, we introduce a probabilistic framework that characterizes long-horizon success through two complementary dimensions: search efficiency and terminal precision. Building on this perspective, we propose AgentSwing, a state-aware adaptive parallel context management routing framework. At each trigger point, AgentSwing expands multiple context-managed branches in parallel and uses lookahead routing to select the most promising continuation. Experiments across diverse benchmarks and agent backbones show that AgentSwing consistently outperforms strong static context management methods, often matching or exceeding their performance with up to $3\times$ fewer interaction turns while also improving the ultimate performance ceiling of long-horizon web agents. Beyond the empirical gains, the proposed probabilistic framework provides a principled lens for analyzing and designing future context management strategies for long-horizon agents.

关键词: LLM agents, context management, long-horizon search, adaptive routing, parallel branches, web agents, probabilistic framework, autonomous agents

170. ❌ A tree interpretation of arc standard dependency derivation

作者: Zihao Huang, Ai Ka Lee, Jungyeul Park 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27459v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究自然语言处理中的依存句法分析，特别是arc-standard推导与有序树表示之间的理论映射关系，并实现了概念验证的神经转换解析器。所有评分关键词均涉及大模型、深度学习技术原理或AI科学应用，而该论文专注于传统的句法分析理论和方法，未涉及任何大模型、深度学习或AI科学应用相关技术，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种将arc-standard依存推导解释为有序树表示的理论框架，证明了这种表示与依存树可投射性的等价关系，并通过神经转换解析器实现了概念验证。

摘要翻译

我们证明，用于投射性依存树的弧标准推导可确定一种具有表层连续产出和稳定词项锚定的唯一有序树表示。每一次 \textsc{shift}、\textsc{leftarc} 和 \textsc{rightarc} 转移均对应一次确定性的树更新操作，且最终生成的层次化对象能唯一确定原始的依存弧。我们进一步证明，该表示刻画了投射性：一个单中心词依存树当且仅当具有投射性时，才允许此类连续有序表示。本方案是推导式而非转换式的，它将弧标准转移序列直接解释为有序树的构建过程，而非将已完成的依存图转换为短语结构输出。对于非投射性输入，在实际应用中可通过推导前的伪投射提升与恢复后的逆向解码，采用相同的解释方式。在一个基于转移的神经句法分析器中进行的原理验证实现表明，映射后的推导是可执行的，并能支持稳定的依存关系恢复。

摘要 (Abstract)

We show that arc-standard derivations for projective dependency trees determine a unique ordered tree representation with surface-contiguous yields and stable lexical anchoring. Each \textsc{shift}, \textsc{leftarc}, and \textsc{rightarc} transition corresponds to a deterministic tree update, and the resulting hierarchical object uniquely determines the original dependency arcs. We further show that this representation characterizes projectivity: a single-headed dependency tree admits such a contiguous ordered representation if and only if it is projective. The proposal is derivational rather than convertive. It interprets arc-standard transition sequences directly as ordered tree construction, rather than transforming a completed dependency graph into a phrase-structure output. For non-projective inputs, the same interpretation can be used in practice via pseudo-projective lifting before derivation and inverse decoding after recovery. A proof-of-concept implementation in a standard neural transition-based parser shows that the mapped derivations are executable and support stable dependency recovery.

关键词: dependency parsing, arc-standard derivation, projective dependency trees, ordered tree representation, neural transition-based parser, pseudo-projective lifting, syntactic analysis, natural language processing

作者: Jakub Bąba, Jarosław A. Chudziak 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27451v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在多智能体辩论框架中的应用，直接涉及LLMs、Self-Correction、LLM Agents和Multi-agent Systems等关键词，相关度最高。论文通过多智能体辩论实现推理改进，与Chain of Thought、System 2 Thinking和Explainable AI有一定关联。其他关键词如MoE、SFT、RAG等未在论文中涉及，相关度为0。

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型在论点挖掘任务中存在的结构模糊性和自我修正中的顺从性问题，提出了一个多智能体辩论框架MAD-ACC，通过支持者-反对者-法官模型进行辩证细化，在无需领域特定训练的情况下显著提升了论点分类性能，并提供了可解释的决策过程。

摘要翻译

论元挖掘（Argument Mining，AM）是自动化写作评估的一项基础技术，但传统的监督方法严重依赖昂贵且领域特定的微调。尽管大语言模型（Large Language Models，LLMs）提供了一种无需训练的替代方案，但它们常常受困于结构模糊性，难以区分类似成分（如主张与前提）。此外，单智能体的自我修正机制往往存在“迎合倾向”，即模型会强化其初始错误而非进行批判性评估。本文提出MAD-ACC（面向论元成分分类的多智能体辩论框架），该框架利用辩证优化来解决分类不确定性。MAD-ACC采用“支持者-反对者-裁判”模型，使智能体针对模糊文本提出相互冲突的解读并进行辩护，从而揭示单智能体模型所忽略的逻辑细节。在UKP学生作文语料库上的评估表明，MAD-ACC的宏观F1分数达到85.7%，显著优于单智能体推理基线方法，且无需领域特定训练。此外，与“黑盒”分类器不同，MAD-ACC的辩证方法通过生成可解释决策逻辑的人类可读辩论记录，提供了一种透明且可解释的替代方案。

摘要 (Abstract)

Argument Mining (AM) is a foundational technology for automated writing evaluation, yet traditional supervised approaches rely heavily on expensive, domain-specific fine-tuning. While Large Language Models (LLMs) offer a training-free alternative, they often struggle with structural ambiguity, failing to distinguish between similar components like Claims and Premises. Furthermore, single-agent self-correction mechanisms often suffer from sycophancy, where the model reinforces its own initial errors rather than critically evaluating them. We introduce MAD-ACC (Multi-Agent Debate for Argument Component Classification), a framework that leverages dialectical refinement to resolve classification uncertainty. MAD-ACC utilizes a Proponent-Opponent-Judge model where agents defend conflicting interpretations of ambiguous text, exposing logical nuances that single-agent models miss. Evaluation on the UKP Student Essays corpus demonstrates that MAD-ACC achieves a Macro F1 score of 85.7%, significantly outperforming single-agent reasoning baselines, without requiring domain-specific training. Additionally, unlike “black-box” classifiers, MAD-ACC’s dialectical approach offers a transparent and explainable alternative by generating human-readable debate transcripts that explain the reasoning behind decisions.

关键词: Argument Mining, Large Language Models, Multi-Agent Systems, Self-Correction, Dialectical Refinement, Explainable AI, Argument Component Classification, Debate Framework

172. ❌ Improving Attributed Long-form Question Answering with Intent Awareness

作者: Xinran Zhao, Aakanksha Naik, Jay DeYoung, Joseph Chee Chang, Jena D. Hwang, Tongshuang Wu, Varsha Kishore 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27435v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文核心研究LLMs在科学报告生成中的应用，通过增强意图感知来提升长格式问答质量，涉及LLMs和小模型微调，属于AI for Science范畴。与LLMs、Small Language Models、Post-training/SFT、AI for Science高度相关，其他关键词未涉及。

!!! tip deepseek-chat TL;DR

该论文研究如何通过增强大语言模型的意图感知能力来提升科学长格式问答的报告质量和引用准确性，实验表明该方法能显著提高大小模型的生成性能。

摘要翻译

大型语言模型（LLM）正日益被用于生成全面、知识密集型的报告。然而，尽管这些模型基于多样化的学术论文和报告进行训练，它们并未接触到指导作者撰写这些文档的推理过程与意图。我们假设，增强模型的意图感知能力可以显著提升生成长篇报告的质量。我们开发并采用基于标签的结构化方案，以更好地引导出写作或引用的潜在隐含意图。我们证明，这些提取出的意图不仅能增强LLM的零样本生成能力，还能为微调较小模型创造高质量的合成数据。我们的实验表明，在多项具有挑战性的科学报告生成任务中，模型性能均得到提升，其中大型模型和小型模型相较于基线分别平均提升了+2.9和+12.3个绝对百分点。此外，我们的分析阐明了意图感知如何增强模型的引用使用，并显著提高报告的可读性。

摘要 (Abstract)

Large language models (LLMs) are increasingly being used to generate comprehensive, knowledge-intensive reports. However, while these models are trained on diverse academic papers and reports, they are not exposed to the reasoning processes and intents that guide authors in crafting these documents. We hypothesize that enhancing a model’s intent awareness can significantly improve the quality of generated long-form reports. We develop and employ structured, tag-based schemes to better elicit underlying implicit intents to write or cite. We demonstrate that these extracted intents enhance both zero-shot generation capabilities in LLMs and enable the creation of high-quality synthetic data for fine-tuning smaller models. Our experiments reveal improved performance across various challenging scientific report generation tasks, with an average improvement of +2.9 and +12.3 absolute points for large and small models over baselines, respectively. Furthermore, our analysis illuminates how intent awareness enhances model citation usage and substantially improves report readability.

关键词: Large Language Models, Intent Awareness, Long-form Question Answering, Scientific Report Generation, Fine-tuning, Citation Usage, Readability Improvement, Synthetic Data

173. ❌ Not Worth Mentioning? A Pilot Study on Salient Proposition Annotation

作者: Amir Zeldes, Katherine Conhaim, Lauren Levine 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27358v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是自然语言处理中的命题显著性标注任务，属于传统的文本分析、语料库语言学和话语分析领域。论文内容涉及提取性摘要、命题显著性量化、修辞结构理论（RST）和话语解析，但完全没有涉及大模型、深度学习、AI for Science或任何评分关键词中的技术。所有关键词均与大模型技术原理、训练方法、推理优化、应用领域等相关，而该论文是传统NLP标注研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了如何操作化自然文本中的分级命题显著性，通过将基于摘要的显著性度量应用于多体裁数据集，并初步探讨了该度量与修辞结构理论中话语单元中心性的关系。

摘要翻译

尽管抽取式摘要研究已有悠久传统——其本质目标在于还原文本中最核心的命题，但在自然语料中实现分级命题显著性的可操作性研究仍十分有限。本文借鉴先前显著实体抽取研究中基于分级摘要的显著性度量方法，将其调整用于量化命题显著性。我们明确了标注任务框架，将其应用于一个小规模多体裁数据集，评估了标注者间一致性，并依据修辞结构理论的话语解析框架，初步探究了该度量指标与话语单元中心性概念之间的关联。

摘要 (Abstract)

Despite a long tradition of work on extractive summarization, which by nature aims to recover the most important propositions in a text, little work has been done on operationalizing graded proposition salience in naturally occurring data. In this paper, we adopt graded summarization-based salience as a metric from previous work on Salient Entity Extraction (SEE) and adapt it to quantify proposition salience. We define the annotation task, apply it to a small multi-genre dataset, evaluate agreement and carry out a preliminary study of the relationship between our metric and notions of discourse unit centrality in discourse parsing following Rhetorical Structure Theory (RST).

关键词: proposition salience, extractive summarization, annotation task, rhetorical structure theory, discourse parsing, salient entity extraction, multi-genre dataset, graded salience

174. ❌ Heterogeneous Debate Engine: Identity-Grounded Cognitive Architecture for Resilient LLM-Based Ethical Tutoring

作者: Jakub Masłowski, Jarosław A. Chudziak 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27404v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM作为自主代理在多代理辩论系统中的伦理辅导应用，与’Large Language Models’、‘Retrieval-Augmented Generation’、‘LLM Agents’、‘Multi-agent Systems’高度相关（10分）；涉及伦理对齐、复杂推理、事实性等主题，与’Instruction Tuning/Alignment’、‘Chain of Thought’、‘System 2 Thinking’、‘Hallucination Mitigation’有一定关联（5分）；其他关键词如MoE、量化、科学AI等未在论文中涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLM多代理系统在伦理辅导中易出现语义漂移和逻辑退化的问题，提出了结合身份检索增强生成和启发式心智理论的异构辩论引擎，实验表明该架构能显著提升辩论复杂性和教学保真度。

摘要翻译

大型语言模型（LLM）正日益作为自主代理被应用于复杂推理任务中，这为辩证互动开辟了特定领域。然而，由系统性无约束系统实现的多智能体系统普遍经历语义漂移与逻辑退化，因此难以应用于需要精确答案的伦理教学场景。当前的模拟往往倾向于退化为辩证停滞，智能体陷入递归性趋同或循环论证。一个关键挑战依然存在：如何在确保教义忠实度的同时，不抑制辩证推理所需的生成灵活性？为应对这一特定需求，我们提出了异构辩论引擎（Heterogeneous Debate Engine, HDE），这是一种认知架构，它结合了基于身份检索增强生成（Identity-Grounded Retrieval-Augmented Generation, ID-RAG）以确保教义忠实度，并利用启发式心智理论（Heuristic Theory of Mind）进行策略性对手建模。评估结果表明，架构异构性是维持稳定性的关键变量：与基线相比，对立的教义初始化（如义务论与功利主义）使学生的论证复杂度评分提升了一个数量级。这些发现验证了ID-RAG与启发式心智理论作为架构要素在维持高保真度（对抗性）教学中的有效性。

摘要 (Abstract)

Large Language Models (LLMs) are being increasingly used as autonomous agents in complex reasoning tasks, opening the niche for dialectical interactions. However, Multi-Agent systems implemented with systematically unconstrained systems systematically undergo semantic drift and logical deterioration and thus can hardly be used in providing ethical tutoring where a precise answer is required. Current simulation often tends to degenerate into dialectical stagnation, the agents degenerate into recursive concurrence or circular arguments. A critical challenge remains: how to enforce doctrinal fidelity without suppressing the generative flexibility required for dialectical reasoning? To address this niche, we contribute the Heterogeneous Debate Engine (HDE), a cognitive architecture that combines Identity-Grounded Retrieval-Augmented Generation (ID-RAG) for doctrinal fidelity and Heuristic Theory of Mind for strategic opponent modeling. Our evaluation shows that architectural heterogeneity is a crucial variable to stability: contrary doctrinal initializations (e.g., Deontology vs. Utilitarianism) have increased the Argument Complexity Scores of students by an order of magnitude, over baselines. These findings validate the effectiveness of ID-RAG and Heuristic ToM as architectural requirements in maintaining high-fidelity (adversarial) pedagogy.

关键词: Large Language Models, Multi-Agent Systems, Retrieval-Augmented Generation, Ethical Tutoring, Dialectical Reasoning, Cognitive Architecture, Doctrinal Fidelity, Argument Complexity

175. ❌ Culturally Adaptive Explainable LLM Assessment for Multilingual Information Disorder: A Human-in-the-Loop Approach

作者: Maziar Kianimoghadam Jouneghani 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27356v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在跨文化信息识别中的应用，直接涉及LLMs、解释性AI和上下文学习等关键词。与LLMs高度相关（10分），因为论文以LLMs为基础模型；与解释性AI高度相关（10分），因为研究重点是可解释的LLM评估；与上下文学习高度相关（10分），因为使用了ICL和动态检索示例；与事实性/幻觉缓解相关（8分），因为涉及信息操纵检测；与对齐相关（5分），因为评估模型与人类评估者的一致性；与RAG相关（5分），因为使用了动态检索示例。其他关键词如MoE、量化、推理加速等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对LLMs在跨文化信息识别中存在的文化偏见和解释性不足问题，提出了一种人机协同框架，通过动态检索目标语言示例的上下文学习方法，提高了模型在波斯语和意大利语新闻中的文化适应性和解释质量。

摘要翻译

识别信息失序现象是困难的，因为对信息操纵的判断依赖于文化和语言背景。然而当前的大型语言模型（LLMs）往往表现为单一文化、以英语为中心的“黑箱”，其生成的流畅推理常忽视本地化框架。来自多语言信息失序（InDor）语料库的初步证据表明，现有模型难以在不同社群间对操纵性新闻作出连贯解释。为弥补这一缺陷，这项进行中的研究提出一种混合智能循环——一种人在回路（HITL）框架，将模型评估建立在母语标注者撰写的人类推理基础上。该方法通过将英语任务指令与动态检索的目标语言示例相结合，超越了静态的目标语言少样本提示策略：这些示例通过情境学习（ICL）从经过筛选的InDor标注中提取。在初步实验中，示例库以这些筛选后的标注为基础构建，并用于比较波斯语和意大利语新闻的静态提示与自适应提示效果。研究评估了文本片段与严重程度预测、生成推理的质量与文化适应性，以及不同评估组间的模型对齐度，从而为基于文化的可解释人工智能提供了一个测试平台。

摘要 (Abstract)

Recognizing information disorder is difficult because judgments about manipulation depend on cultural and linguistic context. Yet current Large Language Models (LLMs) often behave as monocultural, English-centric “black boxes,” producing fluent rationales that overlook localized framing. Preliminary evidence from the multilingual Information Disorder (InDor) corpus suggests that existing models struggle to explain manipulated news consistently across communities. To address this gap, this ongoing study proposes a Hybrid Intelligence Loop, a human-in-the-loop (HITL) framework that grounds model assessment in human-written rationales from native-speaking annotators. The approach moves beyond static target-language few-shot prompting by pairing English task instructions with dynamically retrieved target-language exemplars drawn from filtered InDor annotations through In-Context Learning (ICL). In the initial pilot, the Exemplar Bank is seeded from these filtered annotations and used to compare static and adaptive prompting on Farsi and Italian news. The study evaluates span and severity prediction, the quality and cultural appropriateness of generated rationales, and model alignment across evaluator groups, providing a testbed for culturally grounded explainable AI.

关键词: Large Language Models, Explainable AI, In-context Learning, Multilingual Information Disorder, Cultural Adaptation, Human-in-the-loop, Hybrid Intelligence Loop, Exemplar Bank

176. ❌ Inference-Time Structural Reasoning for Compositional Vision-Language Understanding

作者: Amartya Bhattacharya 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27349v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究视觉语言模型（VLMs）的组合推理能力，特别是通过结构推理增强模型性能。与关键词的相关性分析如下：1）论文评估了Qwen3-VL-8B-Thinking等模型，这些属于大型语言模型（LLMs）的扩展应用，因此与’Large Language Models’有一定关联（5分）。2）论文的核心是提升模型的组合推理能力，涉及多步推理和深度思考，与’Chain of Thought’和’System 2 Thinking’高度相关（各8分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、对齐、RAG、加速技术、幻觉缓解、可解释性、模型合并、上下文学习、科学AI等均未在论文中涉及或提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型在组合推理任务上的不足，提出了一种基于场景图的结构推理框架，通过依赖解析和图形不对称评分增强模型性能，使Qwen3-VL-8B-Thinking在Winoground基准测试中达到66.0分，超越了开源最先进水平。

摘要翻译

视觉语言模型（VLMs）在图像-文本检索任务中表现出色，但在组合推理方面持续存在不足，难以区分那些词汇相同但关系结构不同的描述文本。本文提出了一个统一的评估与增强框架，在Winoground基准测试上对四种架构各异的VLM模型——CLIP、BLIP、LLaVA和Qwen3-VL-8B-Thinking——进行了基准测试，涵盖了原始场景和场景图增强两种模式。我们引入了一种基于依存关系的文本场景图解析器（TextSceneGraphParser，基于spaCy），用于提取“主体-关系-客体”三元组，以及一个使用最优二分匹配的图不对称性评分器，以注入结构关系先验。通过描述文本消融实验（主体-客体掩码与交换）发现，Qwen3-VL-8B-Thinking模型取得了62.75的组别分数，远超所有基于编码器的模型；而提出的多轮场景图过滤策略进一步将其分数提升至66.0，超越了此前开源的先进水平。我们分析了能力增强的权衡关系，发现场景图增强对已有较强能力的模型有益，而对较弱的基线模型则带来可忽略甚至负面的增益。代码地址：https://github.com/amartyacodes/Inference-Time-Structural-Reasoning-for-Compositional-Vision-Language-Understanding

摘要 (Abstract)

Vision-language models (VLMs) excel at image-text retrieval yet persistently fail at compositional reasoning, distinguishing captions that share the same words but differ in relational structure. We present, a unified evaluation and augmentation framework benchmarking four architecturally diverse VLMs,CLIP, BLIP, LLaVA, and Qwen3-VL-8B-Thinking,on the Winoground benchmark under plain and scene-graph-augmented regimes. We introduce a dependency-based TextSceneGraphParser (spaCy) extracting subject-relation-object triples, and a Graph Asymmetry Scorer using optimal bipartite matching to inject structural relational priors. Caption ablation experiments (subject-object masking and swapping) reveal that Qwen3-VL-8B-Thinking achieves a group score of 62.75, far above all encoder-based models, while a proposed multi-turn SG filtering strategy further lifts it to 66.0, surpassing prior open-source state-of-the-art. We analyze the capability augmentation tradeoff and find that SG augmentation benefits already capable models while providing negligible or negative gains for weaker baselines. Code: https://github.com/amartyacodes/Inference-Time-Structural-Reasoning-for-Compositional-Vision-Language-Understanding

关键词: Vision-Language Models, Compositional Reasoning, Structural Reasoning, Scene Graph, Winoground Benchmark, Qwen3-VL-8B-Thinking, Graph Asymmetry Scorer, Multi-turn Filtering

177. ❌ PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering

作者: Yiqing Zhang, Xiaozhong Liu, Fabricio Murai 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27335v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文PubMed Reasoner是一个基于GPT-4o的生物医学问答代理，核心创新在于结合检索增强生成（RAG）与动态推理机制（包括自我批评查询优化、反思性检索和多步推理），以提升证据基础回答的准确性和可信度。因此，与以下关键词高度相关（10分）：LLMs（使用GPT-4o）、RAG（检索增强生成）、CoT Reasoning（多步推理）、System 2 Thinking（深度推理）、Self-Reflection（自我反思）、LLM Agents（代理系统）、Hallucination Mitigation（减少幻觉）、AI for Science（生物信息学应用）。其他关键词如MoE、SLMs、训练方法、模型压缩等未涉及，评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了PubMed Reasoner，一个结合动态推理和检索增强生成的生物医学问答代理，通过自我批评查询优化和反思性检索机制，在PubMedQA上达到78.32%的准确率，超越了人类专家水平，并提升了回答的推理合理性和证据基础。

摘要翻译

可信赖的生物医学问答系统不仅需要提供准确答案，还必须以当前可验证的证据为其提供依据。检索增强方法部分解决了这一缺陷，但缺乏迭代优化低质量查询的机制，而自反思方法仅在完整检索完成后才启动。在此背景下，我们推出PubMed Reasoner——一个由三个阶段组成的生物医学问答智能体：自我批判式查询优化阶段通过评估MeSH术语的覆盖度、对齐度和冗余度，基于部分（元数据）检索结果优化PubMed查询；反思式检索阶段分批处理文献直至收集到充分证据；基于证据的答案生成阶段则生成附带明确引用的回答。以GPT-4o为核心的PubMed Reasoner在PubMedQA数据集上达到78.32%的准确率，小幅超越人类专家，并在MMLU临床知识评估中持续提升。此外，基于大语言模型的评估显示，我们的回答在推理严谨性、证据支撑度、临床相关性和可信度方面均更受青睐。通过构建以权威资源为优先的检索推理机制，我们的方法在为临床医生和生物医学研究者提供实际帮助的同时，有效控制了计算与标记成本。

摘要 (Abstract)

Trustworthy biomedical question answering (QA) systems must not only provide accurate answers but also justify them with current, verifiable evidence. Retrieval-augmented approaches partially address this gap but lack mechanisms to iteratively refine poor queries, whereas self-reflection methods kick in only after full retrieval is completed. In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata) retrieval; reflective retrieval processes articles in batches until sufficient evidence is gathered; and evidence-grounded response generation produces answers with explicit citations. PubMed Reasoner with a GPT-4o backbone achieves 78.32% accuracy on PubMedQA, slightly surpassing human experts, and showing consistent gains on MMLU Clinical Knowledge. Moreover, LLM-as-judge evaluations prefer our responses across: reasoning soundness, evidence grounding, clinical relevance, and trustworthiness. By orchestrating retrieval-first reasoning over authoritative sources, our approach provides practical assistance to clinicians and biomedical researchers while controlling compute and token costs.

关键词: biomedical question answering, retrieval-augmented generation, dynamic reasoning, self-criticism, evidence grounding, LLM agents, PubMedQA, clinical knowledge

178. ❌ SACRED: A Faithful Annotated Multimedia Multimodal Multilingual Dataset for Classifying Connectedness Types in Online Spirituality

作者: Qinghao Guan, Yuchen Pan, Donghao Li, Zishi Zhang, Yiyang Chen, Lu Li, Flaminia Canu, Emilia Volkart, Gerold Schneider 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27331v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是构建SACRED多模态数据集并评估13个流行LLM在在线灵性沟通分类任务上的性能，因此与’Large Language Models’高度相关（10分）。论文提到’fine-tuned approaches’，与’Post-training/SFT’有一定关联（5分）。研究属于社会科学领域的AI应用，与’AI for Science’有一定关联（5分）。其他关键词（如MoE、Scaling Laws、RLHF等）均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究构建了首个在线灵性沟通的多模态数据集SACRED，并评估了13个大型语言模型在该数据集上的分类性能，发现DeepSeek-V3在文本分类任务中表现最佳（79.19%准确率），GPT-4o-mini在视觉任务中表现最优（63.99% F1分数）。

摘要翻译

在宗教与神学研究中，灵性因其超越文化特性并为个体提供独特体验而受到广泛关注。然而，社会科学研究者常受限于规模较小的数据集，且此类数据基本无法在线获取。本研究与社会科学学者合作，构建了一个高质量多媒体多模态数据集——\textbf{SACRED}，其分类可靠性得到严格保证。基于\textbf{SACRED}数据集，我们评估了13种主流大语言模型以及传统规则方法与微调模型的性能。结果表明，DeepSeek-V3模型在此类抽象概念分类任务中表现优异（在Quora测试集上准确率达79.19%），而GPT-4o-mini模型在视觉任务中超越其他模型（F1分数达63.99%）。据我们所知，这是首个源自在线灵性交流的标注多模态数据集。研究还发现了一种对传播学研究具有价值的新型连接模式。

摘要 (Abstract)

In religion and theology studies, spirituality has garnered significant research attention for the reason that it not only transcends culture but offers unique experience to each individual. However, social scientists often rely on limited datasets, which are basically unavailable online. In this study, we collaborated with social scientists to develop a high-quality multimedia multi-modal datasets, \textbf{SACRED}, in which the faithfulness of classification is guaranteed. Using \textbf{SACRED}, we evaluated the performance of 13 popular LLMs as well as traditional rule-based and fine-tuned approaches. The result suggests DeepSeek-V3 model performs well in classifying such abstract concepts (i.e., 79.19% accuracy in the Quora test set), and the GPT-4o-mini model surpassed the other models in the vision tasks (63.99% F1 score). Purportedly, this is the first annotated multi-modal dataset from online spirituality communication. Our study also found a new type of connectedness which is valuable for communication science studies.

关键词: SACRED dataset, multimedia multimodal dataset, online spirituality, connectedness classification, large language models evaluation, DeepSeek-V3, GPT-4o-mini, social science AI application

179. ❌ Self-evolving AI agents for protein discovery and directed evolution

作者: Yang Tan, Lingrong Zhang, Mingchen Li, Yuanxi Yu, Bozitao Zhong, Bingxin Zhou, Nanqing Dong, Liang Hong 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27303v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出了一种用于蛋白质发现和定向进化的自进化多智能体框架VenusFactory2，核心涉及AI智能体、多智能体系统、自进化/自改进机制以及科学AI应用。与以下关键词高度相关：‘Self-Correction OR Self-Improvement OR Self-Reflection’（自进化是核心创新）、‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（框架基于自主智能体）、‘Multi-agent Systems OR Agent Coordination’（多智能体基础设施）、‘AI for Science OR Bioinformatics OR Cheminformatics’（应用于蛋白质科学）。与’Large Language Models OR LLMs OR Foundation Models’和’Tool Use OR Function Calling OR API Tool Use’有一定关联，因为智能体可能利用LLM和工具，但论文未明确强调这些技术细节。其他关键词如MoE、量化、推理加速等未涉及，评分为0。

!!! tip deepseek-chat TL;DR

该论文解决了蛋白质科学发现中手动协调信息与算法的瓶颈问题，通过提出一个自进化多智能体框架VenusFactory2，实现了从静态工具使用到动态工作流合成的转变，在VenusAgentEval基准上超越现有智能体，并能从单一自然语言提示自主组织蛋白质的发现与优化。

摘要翻译

蛋白质科学发现受限于信息与算法的人工协调瓶颈，而通用智能体在复杂领域项目中存在不足。VenusFactory2 提出了一种自主框架，通过自进化的多智能体基础设施，将静态工具使用转变为动态工作流合成，以应对蛋白质相关需求。该框架在 VenusAgentEval 基准测试中超越了一系列知名智能体，并能从单一自然语言指令出发，自主组织蛋白质的发现与优化过程。

摘要 (Abstract)

Protein scientific discovery is bottlenecked by the manual orchestration of information and algorithms, while general agents are insufficient in complex domain projects. VenusFactory2 provides an autonomous framework that shifts from static tool usage to dynamic workflow synthesis via a self-evolving multi-agent infrastructure to address protein-related demands. It outperforms a set of well-known agents on the VenusAgentEval benchmark, and autonomously organizes the discovery and optimization of proteins from a single natural language prompt.

关键词: self-evolving AI agents, protein discovery, directed evolution, multi-agent infrastructure, autonomous framework, workflow synthesis, VenusFactory2, VenusAgentEval

180. ❌ Mitigating Hallucination on Hallucination in RAG via Ensemble Voting

作者: Zequn Xie, Zhengyang Sun 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27253v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	15.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	15.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RAG中的幻觉问题，提出VOTE-RAG框架，使用多智能体投票机制。高度相关关键词：RAG（核心方法）、Hallucination Mitigation（核心问题）、LLMs（基础模型）、LLM Agents和Multi-agent Systems（框架使用多智能体）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理加速、科学AI等均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对RAG中检索结果误导生成模型导致的“幻觉叠加”问题，提出了VOTE-RAG框架，通过两阶段多智能体投票机制有效缓解幻觉，在多个基准测试中达到或超越复杂框架的性能。

摘要翻译

检索增强生成（Retrieval-Augmented Generation，简称RAG）旨在通过整合外部知识来减少大语言模型（Large Language Models，LLMs）的幻觉现象。然而，RAG引入了一个关键挑战：“幻觉叠加幻觉”，即存在缺陷的检索结果会误导生成模型，导致幻觉问题加剧。为解决这一问题，我们提出了VOTE-RAG——一种无需训练的新型框架，其采用两阶段结构和高效、可并行化的投票机制。VOTE-RAG包含：（1）检索投票阶段：多个智能体并行生成多样化查询，并汇总所有检索到的文档；（2）响应投票阶段：多个智能体基于汇总文档独立生成答案，最终输出由多数投票决定。我们在六个基准数据集上进行了对比实验。结果表明，VOTE-RAG取得了与更复杂框架相当或更优的性能。此外，VOTE-RAG架构更简洁、完全可并行化，并避免了“问题漂移”风险。我们的研究表明，简单可靠的集成投票是一种更优越且高效的缓解RAG幻觉的方法。

摘要 (Abstract)

Retrieval-Augmented Generation (RAG) aims to reduce hallucinations in Large Language Models (LLMs) by integrating external knowledge. However, RAG introduces a critical challenge: hallucination on hallucination,” where flawed retrieval results mislead the generation model, leading to compounded hallucinations. To address this issue, we propose VOTE-RAG, a novel, training-free framework with a two-stage structure and efficient, parallelizable voting mechanisms. VOTE-RAG includes: (1) Retrieval Voting, where multiple agents generate diverse queries in parallel and aggregate all retrieved documents; (2) Response Voting, where multiple agents independently generate answers based on the aggregated documents, with the final output determined by majority vote. We conduct comparative experiments on six benchmark datasets. Our results show that VOTE-RAG achieves performance comparable to or surpassing more complex frameworks. Additionally, VOTE-RAG features a simpler architecture, is fully parallelizable, and avoids the problem drift" risk. Our work demonstrates that simple, reliable ensemble voting is a superior and more efficient method for mitigating RAG hallucinations.

关键词: Retrieval-Augmented Generation, Hallucination Mitigation, Large Language Models, Ensemble Voting, Multi-agent Systems, Training-free Framework, Parallelizable Architecture, Benchmark Evaluation

181. ❌ Structural Stress and Learned Helplessness in Afghanistan: A Multi-Layer Analysis of the AFSTRESS Dari Corpus

作者: Jawid Ahmad Baktash, Mursal Dawodi, Nadira Ahmadi 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27233v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究阿富汗人道主义危机中的结构性压力和习得性无助，使用传统机器学习方法（TF-IDF + Linear SVM）进行多标签分类，与所有大模型/深度学习技术关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文介绍了AFSTRESS——首个达里语自我报告压力叙事的多标签语料库，并通过基线实验表明字符TF-IDF与线性SVM在压力分类任务上优于预训练语言模型。

摘要翻译

我们推出首个多标签达里语（东波斯语）自述压力叙事语料库AFSTRESS，该语料库包含737份在持续人道主义危机期间从阿富汗个体收集的反馈。参与者通过达里语清单描述经历的压力，并选择情绪与压力源标签。该数据集支持三个层面的分析：计算层面（多标签分类）、社会层面（结构性驱动因素与性别差异）以及心理层面（习得性无助、慢性压力与情绪级联模式）。数据集包含12个二元标签（5类情绪、7类压力源），具有高标签基数（5.54）和标签密度（0.462），体现了复杂多维的压力特征。结构性压力源占主导地位：未来不确定性（62.6%）与教育中断（60.0%）的比例超过情绪状态标签，表明压力主要由结构性因素驱动。最强的共现关系存在于绝望感与未来不确定性之间（雅卡尔德指数J=0.388）。基线实验表明，采用字符级TF-IDF特征与线性支持向量机（Linear SVM）的组合取得微平均F1值0.663与宏平均F1值0.651，优于ParsBERT和XLM-RoBERTa模型，而阈值调优使微平均F1值提升10.3个百分点。 AFSTRESS为首个用于危机受影响人群压力与福祉计算分析的达里语资源。

摘要 (Abstract)

We introduce AFSTRESS, the first multi-label corpus of self-reported stress narratives in Dari (Eastern Persian), comprising 737 responses collected from Afghan individuals during an ongoing humanitarian crisis. Participants describe experienced stress and select emotion and stressor labels via Dari checklists. The dataset enables analysis at three levels: computational (multi-label classification), social (structural drivers and gender disparities), and psychological (learned helplessness, chronic stress, and emotional cascade patterns). It includes 12 binary labels (5 emotions, 7 stressors), with high label cardinality (5.54) and density (0.462), reflecting complex, multi-dimensional stress. Structural stressors dominate: uncertain future (62.6 percent) and education closure (60.0 percent) exceed emotional states, indicating stress is primarily structurally driven. The strongest co-occurrence is between hopelessness and uncertain future (J = 0.388). Baseline experiments show that character TF-IDF with Linear SVM achieves Micro-F1 = 0.663 and Macro-F1 = 0.651, outperforming ParsBERT and XLM-RoBERTa, while threshold tuning improves Micro-F1 by 10.3 points. AFSTRESS provides the first Dari resource for computational analysis of stress and well-being in a crisis-affected population.

关键词: AFSTRESS corpus, Dari language, multi-label classification, structural stress, learned helplessness, humanitarian crisis, TF-IDF, Linear SVM

182. ❌ SCOPE: Tree-based Self-Correcting Online Log Parsing via Syntactic-Semantic Collaboration

作者: Dongyi Fan, Suqiong Zhang, Lili He, Ming Liu, Yifan Huo 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27247v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SCOPE提出了一种结合启发式和LLM的自校正在线日志解析方法。核心与LLM应用高度相关（10分），因为它使用LLM作为语义理解的后备方案来处理复杂情况。同时，其“自校正”机制与Self-Correction关键词高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理加速、可解释性等均未在摘要中提及，属于完全无关（0分）。论文属于AI在系统日志分析（可视为广义科学/工程应用）中的创新应用，但未专门针对生物信息学等特定科学领域，故AI for Science得0分。

!!! tip deepseek-chat TL;DR

该论文针对传统日志解析方法效率高但精度低、而基于LLM的方法精度高但延迟高的问题，提出了一种名为SCOPE的自校正在线日志解析方法，通过双向树结构和两阶段句法-语义协作框架，在减少LLM调用的同时保持了高精度，在多个基准数据集上实现了效率和效果的平衡。

摘要翻译

日志解析是复杂系统中自动化日志分析的关键步骤。传统基于启发式的方法效率较高，但由于忽略语义上下文，其准确性受限。相比之下，近期基于大语言模型（LLM）的解析器通过语义理解提升了准确性，但因频繁调用模型而产生高延迟。为解决这一问题，我们提出了SCOPE——首个集成启发式与基于LLM范式优势的自校正在线日志解析方法。SCOPE引入了一种新颖的双向树结构，支持从正向与反向两个维度进行高效的模板匹配，从而获得更高的整体匹配率。此外，该方法采用两阶段语法-语义协作框架：轻量级自然语言处理（NLP）模型首先利用词性（POS）信息进行基于语法的匹配；当存在不确定性时，系统则选择性调用大语言模型作为后备机制，以处理语义复杂的案例。这一设计在保持高准确性的同时显著减少了大语言模型API的调用频率，实现了效率与效能的平衡。在多样化基准数据集上的大量评估表明，SCOPE在准确性与效率方面均优于现有前沿方法。相关实现与数据集已公开发布，以促进进一步研究。

摘要 (Abstract)

Log parsing is a critical step for automated log analysis in complex systems. Traditional heuristic-based methods offer high efficiency but are limited in accuracy due to overlooking semantic context. In contrast, recent LLM-based parsers improve accuracy via se mantic understanding but incur high latency from frequent model calls. To address this, we propose SCOPE, the first self-correcting online log parsing method that integrates the strengths of both heuristic and LLM-based paradigms. SCOPE introduces a novel bi-directional tree structure that enables efficient template match ing from both forward and reverse directions, resulting in a higher overall matching rate. Additionally, it adopts a two-stage syntactic semantic collaboration framework: a lightweight NLP model first utilizes part-of-speech (POS) information for syntax-based match ing, while the LLM is selectively invoked as a fallback to handle semantically complex cases when uncertainty remains. This design significantly reduces LLM API usage while maintaining high ac curacy, achieving a balance between efficiency and effectiveness. Extensive evaluations on diverse benchmark datasets show that SCOPE outperforms state-of-the-art methods in both accuracy and efficiency. The implementation and datasets are publicly released to facilitate further research.

关键词: log parsing, self-correcting, online parsing, LLM-based parser, syntactic-semantic collaboration, bi-directional tree, template matching, efficiency-accuracy trade-off

183. ❌ LightMover: Generative Light Movement with Color and Intensity Controls

作者: Gengze Zhou, Tianyu Wang, Soo Ye Kim, Zhixin Shu, Xin Yu, Yannick Hold-Geoffroy, Sumit Chaturvedi, Qi Wu, Zhe Lin, Scott Cohen 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27209v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文LightMover专注于计算机视觉领域的可控光照编辑，利用视频扩散先验进行单图像光照操作，与所有评分关键词（均围绕大模型、深度学习技术原理及其在科学领域的应用）无直接关联。论文未涉及语言模型、模型训练/对齐技术、推理优化、代理系统、模型压缩或科学AI应用等主题。

!!! tip deepseek-chat TL;DR

LightMover提出了一种基于视频扩散先验的单图像可控光照编辑框架，通过视觉令牌序列预测实现光照位置、颜色和强度的独立控制，并引入自适应令牌剪枝机制减少控制序列长度41%同时保持编辑保真度。

摘要翻译

本文提出LightMover框架，用于单幅图像的可控光照操控，该框架利用视频扩散先验，在不重新渲染场景的情况下生成物理上合理的光照变化。我们将光照编辑构建为视觉标记空间中的序列到序列预测问题：给定一幅图像和光照控制标记，模型可从单一视角调整光源位置、颜色与强度，并同步生成相应的反射、阴影与衰减效果。这种对空间（移动）与外观（颜色、强度）控制的统一处理，提升了对光照的操控能力与理解水平。我们进一步引入一种自适应标记剪枝机制，在保留空间信息标记的同时紧凑编码非空间属性，使控制序列长度减少41%且保持编辑保真度。为训练本框架，我们构建了一个可扩展的渲染流程，能够生成大量在不同光源位置、颜色和强度下的图像对，同时保持场景内容与原始图像一致。LightMover能够对光源位置、颜色和强度实现精确且独立的控制，并在不同任务中取得了高PSNR（峰值信噪比）和强大的语义一致性（基于DINO与CLIP指标）。

摘要 (Abstract)

We present LightMover, a framework for controllable light manipulation in single images that leverages video diffusion priors to produce physically plausible illumination changes without re-rendering the scene. We formulate light editing as a sequence-to-sequence prediction problem in visual token space: given an image and light-control tokens, the model adjusts light position, color, and intensity together with resulting reflections, shadows, and falloff from a single view. This unified treatment of spatial (movement) and appearance (color, intensity) controls improves both manipulation and illumination understanding. We further introduce an adaptive token-pruning mechanism that preserves spatially informative tokens while compactly encoding non-spatial attributes, reducing control sequence length by 41% while maintaining editing fidelity. To train our framework, we construct a scalable rendering pipeline that generates large numbers of image pairs across varied light positions, colors, and intensities while keeping the scene content consistent with the original image. LightMover enables precise, independent control over light position, color, and intensity, and achieves high PSNR and strong semantic consistency (DINO, CLIP) across different tasks.

关键词: light editing, video diffusion priors, visual token space, adaptive token-pruning, controllable light manipulation, physically plausible illumination, single image, rendering pipeline

184. ❌ daVinci-LLM:Towards the Science of Pretraining

作者: Yiwei Qin, Yixiu Liu, Tiantian Mi, Muhang Xie, Zhen Huang, Weiye Si, Pengrui Lu, Siyuan Feng, Xia Wu, Liming Liu, Ye Luo, Jinlong Hou, Qipeng Guo, Yu Qiao, Pengfei Liu 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27164v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	8.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	15.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究预训练科学，与’Pre-training’关键词高度相关（15分），涉及LLM预训练（10分），并探讨数据处理与扩展规律（8分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型预训练的科学方法论，通过开放范式、系统化数据处理和200+控制实验，揭示了数据处理深度、领域动态和评估协议对预训练能力的关键影响。

摘要翻译

基础预训练阶段决定了模型的能力上限，因为后训练难以突破预训练所建立的能力基础，然而这一阶段仍处于严重未被充分探索的状态。这源于一种结构性悖论：拥有计算资源的组织在商业压力下运作，阻碍了透明公开；而学术机构虽享有研究自由，却缺乏预训练规模的计算资源。daVinci-LLM 正处在这一未被探索的交汇点上，它融合了工业级的资源与完全的研究自由，以推进预训练科学。我们采用完全开放的范式，将开放性视为科学方法论，公开完整的数据处理流程、全周期训练过程及系统性探索成果。鉴于该领域缺乏系统化的数据处理方法论，我们采用 Data Darwinism 框架——一套从过滤到合成的原则性 L0-L9 分级体系。我们从随机初始化开始，使用两阶段自适应课程学习，在 8T 令牌上训练了一个 30 亿参数模型，逐步从基础能力转向推理密集型增强。通过 200 多项受控消融实验，我们证实：处理深度能系统性提升能力，使其成为与规模扩展同等关键的重要维度；不同领域表现出差异化的饱和动态，需要从比例调整到格式转换的自适应策略；组合平衡可实现针对性强化，同时防止性能崩溃；评估协议的选择如何塑造我们对预训练进展的理解。通过公开完整的探索过程，我们使学术界能够基于我们的发现和系统化方法论，在预训练领域形成累积性科学知识。

摘要 (Abstract)

The foundational pretraining phase determines a model’s capability ceiling, as post-training struggles to overcome capability foundations established during pretraining, yet it remains critically under-explored. This stems from a structural paradox: organizations with computational resources operate under commercial pressures that inhibit transparent disclosure, while academic institutions possess research freedom but lack pretraining-scale computational resources. daVinci-LLM occupies this unexplored intersection, combining industrial-scale resources with full research freedom to advance the science of pretraining. We adopt a fully-open paradigm that treats openness as scientific methodology, releasing complete data processing pipelines, full training processes, and systematic exploration results. Recognizing that the field lacks systematic methodology for data processing, we employ the Data Darwinism framework, a principled L0-L9 taxonomy from filtering to synthesis. We train a 3B-parameter model from random initialization across 8T tokens using a two-stage adaptive curriculum that progressively shifts from foundational capabilities to reasoning-intensive enhancement. Through 200+ controlled ablations, we establish that: processing depth systematically enhances capabilities, establishing it as a critical dimension alongside volume scaling; different domains exhibit distinct saturation dynamics, necessitating adaptive strategies from proportion adjustments to format shifts; compositional balance enables targeted intensification while preventing performance collapse; how evaluation protocol choices shape our understanding of pretraining progress. By releasing the complete exploration process, we enable the community to build upon our findings and systematic methodologies to form accumulative scientific knowledge in pretraining.

关键词: pretraining, large language models, data processing, scaling laws, systematic methodology, adaptive curriculum, capability enhancement, foundational capabilities

185. ❌ Learning to Predict Future-Aligned Research Proposals with Language Models

作者: Heng Wang, Pengcheng Jiang, Jiashuo Sun, Zhiyi Shi, Haofei Yu, Jiawei Han, Heng Ji 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27146v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	8.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLM在科学研究提案生成中的应用，属于AI for Science领域，因此相关关键词得高分。论文涉及LLM的微调（SFT）、对齐（Alignment）、检索增强（RAG）、思维链（CoT）、智能体（Agents）和模型合并（Model Merging）等技术，这些关键词得8-10分。其他关键词如MoE、量化、推理加速等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于时间切片科学预测的方法，使用LLM生成研究提案并通过未来对齐分数（FAS）评估其质量，实验表明未来对齐微调能显著提升提案的未来对齐性和实际应用效果。

摘要翻译

大语言模型（LLM）正日益被用于辅助研究构思，但评估LLM生成的研究提案质量仍然困难：新颖性和严谨性难以自动衡量，而大规模人工评估成本高昂。我们提出一种可验证的替代方案，将提案生成重新构建为一个时间切片的科学预测问题。给定一个研究问题以及在截止时间前可获取的启发性论文，模型生成一份结构化提案，并通过其是否预测到截止时间后发表论文中出现的研究方向来进行评估。我们将此目标具体化为未来对齐分数（FAS），该分数通过基于检索和LLM的语义评分，对照一个预留的未来文献库计算得出。为训练模型，我们构建了一个包含17,771篇目标论文及其截止时间前引用的时间一致性数据集，并合成了教导模型识别研究空白和借鉴启发的推理轨迹。在Llama-3.1和Qwen2.5系列模型上的实验表明，未来对齐微调相较于未对齐的基线模型提升了未来对齐性（整体FAS最高提升+10.6%），且领域专家的人工评估也证实了提案质量的改进。最后，我们通过使用代码代理实施了两项模型生成的提案，展示了其实际影响：一项新的提示策略在MATH数据集上获得了4.17%的准确率提升，而一项新颖的模型融合方法也取得了持续的改进。

摘要 (Abstract)

Large language models (LLMs) are increasingly used to assist ideation in research, but evaluating the quality of LLM-generated research proposals remains difficult: novelty and soundness are hard to measure automatically, and large-scale human evaluation is costly. We propose a verifiable alternative by reframing proposal generation as a time-sliced scientific forecasting problem. Given a research question and inspiring papers available before a cutoff time, the model generates a structured proposal and is evaluated by whether it anticipates research directions that appear in papers published after the time. We operationalize this objective with the Future Alignment Score (FAS), computed via retrieval and LLM-based semantic scoring against a held-out future corpus. To train models, we build a time-consistent dataset of 17,771 papers from targets and their pre-cutoff citations, and synthesize reasoning traces that teach gap identification and inspiration borrowing. Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality. Finally, we demonstrate practical impact by implementing two model-generated proposals with a code agent, obtaining 4.17% accuracy gain on MATH from a new prompting strategy and consistent improvements for a novel model-merging method.

关键词: Large Language Models, Research Proposal Generation, Future Alignment Score, Scientific Forecasting, Fine-tuning, Model Merging, AI for Science, LLM Agents

186. ❌ Routing Sensitivity Without Controllability: A Diagnostic Study of Fairness in MoE Language Models

作者: Junhyeok Lee, Kyu Sung Choi 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27141v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	15.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究MoE语言模型的公平性控制问题，与’Mixture of Experts OR MoE OR Sparse Models’高度相关（15分），因为全文围绕MoE架构展开；与’Large Language Models OR LLMs OR Foundation Models’相关（10分），因为研究对象是MoE语言模型；其他关键词如公平性、偏见缓解等虽在论文中涉及，但未在给定关键词列表中，因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文研究了MoE语言模型中路由敏感性与公平性控制的关系，发现路由敏感性虽然存在但不足以实现有效的刻板印象干预，且公平性调整会带来显著的性能代价。

摘要翻译

专家混合（Mixture-of-Experts, MoE）语言模型在路由层面对人口统计内容普遍敏感，但利用这种敏感性进行公平性控制在结构上存在局限。我们提出了公平感知路由均衡（Fairness-Aware Routing Equilibrium, FARE）诊断框架，旨在探究不同MoE架构中路由层面刻板印象干预的极限。FARE揭示，路由层面的偏好转移要么无法实现（Mixtral、Qwen1.5、Qwen3），要么统计上不稳健（DeepSeekMoE），要么伴随显著的性能代价（OLMoE，在TQA下降6.3%的情况下CrowS-Pairs仅下降4.4%）。关键的是，即使在对数似然偏好转移稳健的情况下，这种转移也无法传递到解码生成过程：对非零效应模型的扩展评估显示，所有生成指标均无显著结果。组级专家掩码揭示了原因：偏见与核心知识在专家组内深度纠缠。这些发现表明，路由敏感性对于刻板印象控制是必要但不充分的条件，并识别了特定的架构条件，可为未来设计更具可控性的MoE系统提供参考。

摘要 (Abstract)

Mixture-of-Experts (MoE) language models are universally sensitive to demographic content at the routing level, yet exploiting this sensitivity for fairness control is structurally limited. We introduce Fairness-Aware Routing Equilibrium (FARE), a diagnostic framework designed to probe the limits of routing-level stereotype intervention across diverse MoE architectures. FARE reveals that routing-level preference shifts are either unachievable (Mixtral, Qwen1.5, Qwen3), statistically non-robust (DeepSeekMoE), or accompanied by substantial utility cost (OLMoE, -4.4%p CrowS-Pairs at -6.3%p TQA). Critically, even where log-likelihood preference shifts are robust, they do not transfer to decoded generation: expanded evaluations on both non-null models yield null results across all generation metrics. Group-level expert masking reveals why: bias and core knowledge are deeply entangled within expert groups. These findings indicate that routing sensitivity is necessary but insufficient for stereotype control, and identify specific architectural conditions that can inform the design of more controllable future MoE systems.

关键词: Mixture-of-Experts, MoE language models, fairness control, routing sensitivity, stereotype intervention, FARE framework, utility cost, expert masking

187. ❌ Story2Proposal: A Scaffold for Structured Scientific Paper Writing

作者: Zhuoyang Qian, Wei Shi, Xu Lin, Li Ling, Meng Luo, Ziming Wang, Zhiwei Zhang, Tengyue Xu, Gaoge Liu, Zhentao Zhang, Shuo Zhang, Ziqi Wang, Zheng Feng, Yan Luo, Shu Xu, Yongjin Chen, Zhibo Feng, Zhuo Chen, Bruce Yuan, Biao Wu, Harry Wang, Kris Chen 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27065v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文的核心是提出一个多智能体框架（Story2Proposal），用于结构化生成科学论文，这直接与’LLM Agents/Autonomous Agents/Agentic Workflow’和’Multi-agent Systems/Agent Coordination’高度相关（10分）。论文使用GPT、Claude、Gemini、Qwen等大模型作为骨干，因此与’Large Language Models/LLMs/Foundation Models’高度相关（10分）。论文的应用场景是科学论文写作，属于’AI for Science’范畴（10分）。其他关键词如MoE、SFT、RAG、CoT等，论文未涉及具体技术细节，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有语言模型生成科学论文时存在的结构漂移、视觉元素缺失等问题，提出了一个基于契约管理的多智能体框架Story2Proposal，通过协调多个智能体在共享视觉契约下工作，显著提高了生成论文的结构一致性和视觉对齐性，在多个大模型骨干上取得了优于基线方法的专家评估分数。

摘要翻译

生成科学手稿需要在文档全生命周期中保持叙事推理、实验证据与视觉呈现之间的对齐。现有语言模型生成流程依赖于无约束的文本合成，仅在生成后进行验证，常导致结构漂移、图表缺失及跨章节不一致问题。本文提出Story2Proposal——一个契约驱动的多智能体框架，通过持久共享视觉契约（visual contract）协调运作的智能体，将研究故事转化为结构化手稿。该系统围绕契约状态组织架构师、撰写者、优化器和渲染器智能体，该契约持续追踪章节结构与已注册的视觉元素；同时评估智能体在“生成-评估-适配”循环中提供反馈，在生成过程中动态更新契约。基于Jericho研究数据集衍生的任务实验表明：在GPT、Claude、Gemini和Qwen模型基座上，Story2Proposal获得6.145的专家评估分，较DirectChat的3.963提升2.182分。与结构化生成基线Fars相比，Story2Proposal平均得分5.705优于后者的5.197，表明其在结构一致性与视觉对齐方面具有显著改进。

摘要 (Abstract)

Generating scientific manuscripts requires maintaining alignment between narrative reasoning, experimental evidence, and visual artifacts across the document lifecycle. Existing language-model generation pipelines rely on unconstrained text synthesis with validation applied only after generation, often producing structural drift, missing figures or tables, and cross-section inconsistencies. We introduce Story2Proposal, a contract-governed multi-agent framework that converts a research story into a structured manuscript through coordinated agents operating under a persistent shared visual contract. The system organizes architect, writer, refiner, and renderer agents around a contract state that tracks section structure and registered visual elements, while evaluation agents supply feedback in a generate evaluate adapt loop that updates the contract during generation. Experiments on tasks derived from the Jericho research corpus show that Story2Proposal achieved an expert evaluation score of 6.145 versus 3.963 for DirectChat (+2.182) across GPT, Claude, Gemini, and Qwen backbones. Compared with the structured generation baseline Fars, Story2Proposal obtained an average score of 5.705 versus 5.197, indicating improved structural consistency and visual alignment.

关键词: multi-agent framework, scientific manuscript generation, structured generation, visual contract, LLM agents, AI for science, contract-governed, generate-evaluate-adapt loop

188. ❌ ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding

作者: Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Isaac Sanchez, Ben Wiesel, Shafiq Abedin, Amit Alfassy, Eli Schwartz, Daniel Caraballo, Yagmur Gizem Cinar, Florian Scheidegger, Steven I. Ross, Daniel Karl I. Weidele, Hang Hua, Ekaterina Arutyunova, Roei Herzig, Zexue He, Zihan Wang, Xinyue Yu, Yunfei Zhao, Sicong Jiang, Minghao Liu, Qunshu Lin, Peter Staar, Luis Lastras, Aude Oliva, Rogerio Feris 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27064v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要贡献是创建了一个用于图表理解的多模态数据集ChartNet，并展示了其作为大规模监督数据对多模态模型（vision-language models）的效用。论文与多个关键词有间接关联：1）与"Large Language Models/Foundation Models"相关（5分），因为数据集旨在支持基础模型开发；2）与"Scaling Laws AND Data Quality"相关（5分），因为论文强调数据质量和规模；3）与"Pre-training/Domain Adaptation"和"Post-training/SFT"相关（各5分），因为数据集用于模型微调；4）与"Chain of Thought/System 2 Thinking"相关（各5分），因为数据集包含推理问答；5）与"AI for Science"相关（5分），因为图表理解是科学数据分析的基础。其他关键词如MoE、SLMs、RLHF、RAG等与论文核心内容无关，得0分。

!!! tip deepseek-chat TL;DR

该研究创建了ChartNet——一个百万规模的高质量多模态数据集，用于提升图表理解能力，并通过微调实验证明了该数据集能有效提高多模态模型在图表理解任务上的性能。

摘要翻译

理解图表要求模型能够对几何视觉模式、结构化数值数据及自然语言进行联合推理——当前视觉语言模型（VLM）在此能力上仍存在局限。我们推出ChartNet，一个高质量、百万规模的多模态数据集，旨在推进图表解释与推理研究。ChartNet采用新颖的代码引导合成流程，生成了涵盖24种图表类型和6种绘图库的150万个多样化图表样本。每个样本包含五个对齐组件：绘图代码、渲染的图表图像、数据表格、自然语言摘要以及带推理过程的问答，实现了细粒度的跨模态对齐。为全面覆盖图表理解范畴，ChartNet还额外纳入了包含人工标注数据、真实世界数据、安全性与可追溯性验证的专项子集。此外，通过严格的质量过滤流程，确保了图表表征的视觉保真度、语义准确性与多样性。基于ChartNet的微调在多个基准测试中持续提升模型性能，证明了其作为多模态模型大规模监督数据的实用性。作为同类规模最大的开源数据集，ChartNet致力于为开发具有稳健且可泛化数据可视化理解能力的基础模型提供支持。该数据集已公开于https://huggingface.co/datasets/ibm-granite/ChartNet。

摘要 (Abstract)

Understanding charts requires models to jointly reason over geometric visual patterns, structured numerical data, and natural language – a capability where current vision-language models (VLMs) remain limited. We introduce ChartNet, a high-quality, million-scale multimodal dataset designed to advance chart interpretation and reasoning. ChartNet leverages a novel code-guided synthesis pipeline to generate 1.5 million diverse chart samples spanning 24 chart types and 6 plotting libraries. Each sample consists of five aligned components: plotting code, rendered chart image, data table, natural language summary, and question-answering with reasoning, providing fine-grained cross-modal alignment. To capture the full spectrum of chart comprehension, ChartNet additionally includes specialized subsets encompassing human annotated data, real-world data, safety, and grounding. Moreover, a rigorous quality-filtering pipeline ensures visual fidelity, semantic accuracy, and diversity across chart representations. Fine-tuning on ChartNet consistently improves results across benchmarks, demonstrating its utility as large-scale supervision for multimodal models. As the largest open-source dataset of its kind, ChartNet aims to support the development of foundation models with robust and generalizable capabilities for data visualization understanding. The dataset is publicly available at https://huggingface.co/datasets/ibm-granite/ChartNet

关键词: ChartNet, multimodal dataset, chart understanding, vision-language models, data synthesis, fine-tuning, reasoning, foundation models

189. ❌ Gen-Searcher: Reinforcing Agentic Search for Image Generation

作者: Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, Xiangyu Yue 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28767v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	15.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文Gen-Searcher专注于图像生成领域，通过构建搜索增强的智能体来解决图像生成模型受限于内部知识的问题。核心创新在于训练一个能够进行多跳推理和搜索的智能体，以收集文本知识和参考图像进行生成。与关键词的相关性分析如下：1）与"LLM Agents/Autonomous Agents/Agentic Workflow"高度相关（15分），因为论文核心是训练图像生成智能体；2）与"Retrieval-Augmented Generation/RAG/Retrieval-Generation"高度相关（10分），因为智能体通过搜索获取外部知识；3）与"Chain of Thought/CoT Reasoning/Multi-step Reasoning"高度相关（10分），因为智能体执行多跳推理；4）与"Tool Use/Function Calling/API Tool Use"高度相关（10分），因为搜索是智能体使用的工具；5）与"Post-training/Supervised Fine-tuning/SFT"高度相关（10分），因为训练流程包括SFT；6）与"Large Language Models/LLMs/Foundation Models"有一定关联（5分），因为可能涉及基础模型；其他关键词如MoE、量化、对齐等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

论文提出了Gen-Searcher，首个搜索增强的图像生成智能体，通过多跳推理和搜索获取外部知识进行图像生成，实验表明其在KnowGen和WISE基准上分别提升了约16和15个点。

摘要翻译

近期图像生成模型在生成高保真度与逼真图像方面展现出强大能力，但其根本上受限于冻结的内部知识，因此在需要密集知识或最新信息的现实场景中往往表现不佳。本文提出Gen-Searcher，首次尝试训练一个搜索增强的图像生成智能体，通过执行多跳推理与搜索来收集基于事实生成所需的文本知识及参考图像。为实现这一目标，我们构建了定制化数据流程，并精心整理了两个高质量数据集——Gen-Searcher-SFT-10k与Gen-Searcher-RL-6k，其中包含多样化的搜索密集型提示词及对应的真实合成图像。我们进一步引入KnowGen基准测试集，该基准明确要求基于搜索的外部知识进行图像生成，并从多维度评估模型性能。基于这些资源，我们采用监督微调（SFT）训练Gen-Searcher，随后通过具有双重奖励反馈的智能体强化学习进行优化——该方法结合基于文本和基于图像的奖励机制，为GRPO训练提供更稳定且信息丰富的学习信号。实验表明，Gen-Searcher带来显著性能提升，在KnowGen基准上将Qwen-Image模型性能提高约16分，在WISE基准上提升约15分。我们希望这项工作能为图像生成领域的搜索智能体提供开放基础，并已全面开源相关数据、模型及代码。

摘要 (Abstract)

Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generation agent, which performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation. To achieve this, we construct a tailored data pipeline and curate two high-quality datasets, Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, containing diverse search-intensive prompts and corresponding ground-truth synthesis images. We further introduce KnowGen, a comprehensive benchmark that explicitly requires search-grounded external knowledge for image generation and evaluates models from multiple dimensions. Based on these resources, we train Gen-Searcher with SFT followed by agentic reinforcement learning with dual reward feedback, which combines text-based and image-based rewards to provide more stable and informative learning signals for GRPO training. Experiments show that Gen-Searcher brings substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE. We hope this work can serve as an open foundation for search agents in image generation, and we fully open-source our data, models, and code.

关键词: image generation, search agent, multi-hop reasoning, retrieval-augmented generation, agentic reinforcement learning, knowledge-intensive prompts, grounded generation, external knowledge

190. ❌ PoseDreamer: Scalable and Photorealistic Human Data Generation Pipeline with Diffusion Models

作者: Lorenza Prospero, Orest Kupyn, Ostap Viniavskyi, João F. Henriques, Christian Rupprecht 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28763v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究利用扩散模型生成带3D网格标注的人体姿态数据集，属于计算机视觉和生成式AI领域。与大多数大语言模型（LLM）技术关键词无关，仅与’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’有一定关联（摘要提到使用DPO进行控制对齐），以及与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（可视为AI在科学数据生成中的应用）。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出PoseDreamer，一种利用扩散模型生成大规模、高质量带3D网格标注的人体姿态数据集的管道，解决了现有数据集规模有限、成本高或真实性不足的问题，生成的模型训练效果优于或媲美真实和传统合成数据集。

摘要翻译

由于深度模糊性以及从单目图像标注三维几何的固有困难，获取用于三维人体网格估计的标注数据集具有挑战性。现有数据集要么是真实数据集（具有手动标注的三维几何但规模有限），要么是基于三维引擎渲染的合成数据集（虽能提供精确标注，但存在真实感不足、多样性有限且制作成本高昂的问题）。在本研究中，我们探索了第三条路径：生成式数据。我们提出了PoseDreamer，一种利用扩散模型生成具有三维网格标注的大规模合成数据集的新型流程。该方法将可控图像生成与基于直接偏好优化的控制对齐、课程式难样本挖掘以及多阶段质量过滤相结合。这些组件共同作用，自然地保持了三维标注与生成图像之间的对应关系，同时优先处理具有挑战性的样本以最大化数据集效用。利用PoseDreamer，我们生成了超过50万个高质量合成样本，在图像质量指标上相比基于渲染的数据集提升了76%。使用PoseDreamer训练出的模型，其性能达到甚至超越了基于真实世界和传统合成数据集训练的模型。此外，将PoseDreamer与合成数据集结合使用，能获得比结合真实世界与合成数据集更好的性能，这证明了我们数据集的互补性。我们将公开完整数据集及生成代码。

摘要 (Abstract)

Acquiring labeled datasets for 3D human mesh estimation is challenging due to depth ambiguities and the inherent difficulty of annotating 3D geometry from monocular images. Existing datasets are either real, with manually annotated 3D geometry and limited scale, or synthetic, rendered from 3D engines that provide precise labels but suffer from limited photorealism, low diversity, and high production costs. In this work, we explore a third path: generated data. We introduce PoseDreamer, a novel pipeline that leverages diffusion models to generate large-scale synthetic datasets with 3D mesh annotations. Our approach combines controllable image generation with Direct Preference Optimization for control alignment, curriculum-based hard sample mining, and multi-stage quality filtering. Together, these components naturally maintain correspondence between 3D labels and generated images, while prioritizing challenging samples to maximize dataset utility. Using PoseDreamer, we generate more than 500,000 high-quality synthetic samples, achieving a 76% improvement in image-quality metrics compared to rendering-based datasets. Models trained on PoseDreamer achieve performance comparable to or superior to those trained on real-world and traditional synthetic datasets. In addition, combining PoseDreamer with synthetic datasets results in better performance than combining real-world and synthetic datasets, demonstrating the complementary nature of our dataset. We will release the full dataset and generation code.

关键词: diffusion models, human mesh estimation, synthetic data generation, 3D mesh annotations, Direct Preference Optimization, dataset generation pipeline, photorealistic human data, curriculum-based hard sample mining

191. ❌ SHOW3D: Capturing Scenes of 3D Hands and Objects in the Wild

作者: Patrick Rim, Kevin Harris, Braden Copple, Shangchen Han, Xu Xie, Ivan Shugurov, Sizhe An, He Wen, Alex Wong, Tomas Hodan, Kun He 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28760v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域，特别是3D手-物体交互数据集的采集和标注技术，未涉及任何大语言模型、深度学习技术原理或AI for Science的具体应用，与所有评分关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种新型无标记多相机系统，用于在真实野外环境中采集3D手-物体交互数据，并发布了首个大规模3D标注数据集SHOW3D，解决了环境真实性与3D标注精度之间的权衡问题。

摘要翻译

在具身计算机视觉领域，对操作过程中人手与物体的精确三维理解仍是一个重大挑战。现有的人-物交互数据集主要在受控的摄影棚环境中采集，这既限制了环境多样性，也导致基于此类数据训练的模型难以泛化至真实世界场景。为应对这一挑战，我们提出了一种新颖的无标记多相机系统，该系统允许在真正的野外条件下实现近乎无约束的运动，同时仍能生成人手与物体的精确三维标注。该采集系统由一个轻量化的背戴式多相机阵列组成，并与用户佩戴的VR头显同步和校准。针对手与物体的三维真值标注，我们开发了一套第一人称-第三人称（ego-exo）追踪流程，并对其质量进行了严格评估。最后，我们发布了SHOW3D数据集——首个在多样化真实世界环境（包括户外场景）中提供人手与物体交互三维标注的大规模数据集。我们的方法显著降低了环境真实性与三维标注精度之间的固有权衡，这一优势通过多个下游任务的实验得到了验证。项目网址：show3d-dataset.github.io

摘要 (Abstract)

Accurate 3D understanding of human hands and objects during manipulation remains a significant challenge for egocentric computer vision. Existing hand-object interaction datasets are predominantly captured in controlled studio settings, which limits both environmental diversity and the ability of models trained on such data to generalize to real-world scenarios. To address this challenge, we introduce a novel marker-less multi-camera system that allows for nearly unconstrained mobility in genuinely in-the-wild conditions, while still having the ability to generate precise 3D annotations of hands and objects. The capture system consists of a lightweight, back-mounted, multi-camera rig that is synchronized and calibrated with a user-worn VR headset. For 3D ground-truth annotation of hands and objects, we develop an ego-exo tracking pipeline and rigorously evaluate its quality. Finally, we present SHOW3D, the first large-scale dataset with 3D annotations that show hands interacting with objects in diverse real-world environments, including outdoor settings. Our approach significantly reduces the fundamental trade-off between environmental realism and accuracy of 3D annotations, which we validate with experiments on several downstream tasks. show3d-dataset.github.io

关键词: 3D hand-object interaction, egocentric computer vision, marker-less multi-camera system, in-the-wild dataset, 3D ground-truth annotation, real-world environments, SHOW3D dataset, environmental realism

作者: Sadra Safadoust, Fabio Tosi, Matteo Poggi, Fatma Güney 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28759v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《FlowIt: Global Matching for Optical Flow with Confidence-Guided Refinement》专注于计算机视觉中的光流估计任务，提出了一种基于分层Transformer和最优传输的架构。虽然使用了Transformer，但这是针对特定视觉任务的架构设计，而非通用大语言模型（LLM）或基础模型。论文内容与所有评分关键词（均围绕大语言模型技术、训练方法、推理优化、对齐、代理等）完全无关，未涉及任何大模型在不同领域的应用或技术原理创新。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为FlowIt的新型光流估计架构，通过分层Transformer捕获全局上下文，将流初始化建模为最优传输问题，并利用置信度引导的细化阶段，在Sintel和KITTI等基准测试中取得了最先进的结果。

摘要翻译

本文提出FlowIt，一种用于光流估计的新型架构，旨在鲁棒地处理大幅像素位移。其核心在于采用分层式Transformer架构，该架构能够捕获广泛的全局上下文信息，使模型能有效建模长距离对应关系。为克服局部匹配的局限性，我们将光流初始化构建为最优传输问题。该构建方式生成了高度鲁棒的初始光流场，并同时显式推导出遮挡图与置信度图。这些信息随后被无缝整合到引导式细化阶段：网络主动将高置信度区域的可靠运动估计传播至模糊的低置信度区域。在Sintel、KITTI、Spring和LayeredFlow数据集上的大量实验验证了我们方法的有效性。FlowIt在竞争激烈的Sintel和KITTI基准测试中取得了领先性能，同时在Sintel、Spring和LayeredFlow数据集上实现了跨数据集零样本泛化性能的新最优结果。

摘要 (Abstract)

We present FlowIt, a novel architecture for optical flow estimation designed to robustly handle large pixel displacements. At its core, FlowIt leverages a hierarchical transformer architecture that captures extensive global context, enabling the model to effectively model long-range correspondences. To overcome the limitations of localized matching, we formulate the flow initialization as an optimal transport problem. This formulation yields a highly robust initial flow field, alongside explicitly derived occlusion and confidence maps. These cues are then seamlessly integrated into a guided refinement stage, where the network actively propagates reliable motion estimates from high-confidence regions into ambiguous, low-confidence areas. Extensive experiments across the Sintel, KITTI, Spring, and LayeredFlow datasets validate the efficacy of our approach. FlowIt achieves state-of-the-art results on the competitive Sintel and KITTI benchmarks, while simultaneously establishing new state-of-the-art cross-dataset zero-shot generalization performance on Sintel, Spring, and LayeredFlow.

关键词: optical flow estimation, hierarchical transformer, global matching, optimal transport, confidence-guided refinement, state-of-the-art, zero-shot generalization

193. ❌ SonoWorld: From One Image to a 3D Audio-Visual Scene

作者: Derong Jin, Xiyi Chen, Ming C. Lin, Ruohan Gao 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28757v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SonoWorld专注于从单张图像生成3D视听场景，涉及计算机视觉、3D重建、音频生成和空间音频渲染等技术。虽然属于AI应用领域，但所有评分关键词均针对大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理系统等），而本文未使用或提及任何LLM技术，也未涉及生物信息学或化学信息学等科学AI应用。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了SonoWorld框架，解决了从单张图像生成3D视听场景的问题，实现了空间音频与场景几何和语义的对齐。

摘要翻译

视觉场景生成领域已取得巨大进展，能够将单张图像转化为可探索的3D世界，然而缺乏声音的沉浸感仍不完整。我们提出了Image2AVScene这一新任务，即从单张图像生成3D视听场景，并首次推出解决该挑战的框架SonoWorld。我们的流程从单张图像出发：首先生成360°全景扩展图，将其提升为可导航的3D场景，随后放置语言引导的声学锚点，并为点声源、面声源及环境声源渲染高阶Ambisonics音频，最终生成与场景几何结构及语义对齐的空间音频。基于新构建的真实世界数据集进行的量化评估以及受控用户研究均证实了我们方法的有效性。除了自由视点视听渲染外，我们还展示了该方法在单样本声学学习与视听空间声源分离中的应用潜力。项目网站：https://humathe.github.io/sonoworld/

摘要 (Abstract)

Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360° panorama, lifts it into a navigable 3D scene, places language-guided sound anchors, and renders ambisonics for point, areal, and ambient sources, yielding spatial audio aligned with scene geometry and semantics. Quantitative evaluations on a newly curated real-world dataset and a controlled user study confirm the effectiveness of our approach. Beyond free-viewpoint audio-visual rendering, we also demonstrate applications to one-shot acoustic learning and audio-visual spatial source separation. Project website: https://humathe.github.io/sonoworld/

关键词: 3D audio-visual scene, single image, spatial audio, scene generation, audio rendering, panorama, acoustic learning, source separation

194. ❌ Pandora: Articulated 3D Scene Graphs from Egocentric Vision

作者: Alan Yu, Yun Chang, Christopher Xie, Luca Carlone 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28732v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Pandora: Articulated 3D Scene Graphs from Egocentric Vision》专注于机器人视觉和3D场景理解，研究如何利用人类第一视角数据构建包含关节物体部件的3D场景图，以增强机器人移动操作能力。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文的核心内容涉及计算机视觉、机器人学和3D重建，未涉及大模型技术、训练方法、推理优化、代理系统或AI for Science等主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究解决了机器人因自身限制无法完全探索环境的问题，通过利用人类第一视角数据构建包含关节物体部件的3D场景图，成功提升了机器人执行移动操作任务的能力。

摘要翻译

机器人建图系统通常基于机器人自身传感器与摄像头构建度量-语义场景表征。然而，这些"第一人称"地图因机器人本体能力或技能限制而存在固有局限，可能导致环境中的许多方面未被探索。例如，机器人可能无法打开抽屉或触及壁柜。在此意义上，地图表征并不完整，需要能力更强的机器人来填补空白。我们通过利用人类佩戴Project Aria眼镜自然探索场景时采集的自我中心（egocentric）数据来缩小现有方法的盲区，这提供了一种将物体可动结构（articulation）知识直接从人类转移至任意可部署机器人的途径。我们证明，通过使用简单启发式方法，可利用自我中心数据重建可动物体部件的模型，其质量可与基于其他输入模态的最先进方法相媲美。我们还展示了如何将这些模型整合到三维场景图（3D scene graph）表征中，从而提升对物体动态特性及物体-容器关系的理解。最后，我们通过具体应用验证：这些包含可动结构的三维场景图能增强机器人执行移动操控任务的能力——实验中仅以三维场景图为输入，波士顿动力Spot机器人成功完成了检索隐藏目标物品的任务。

摘要 (Abstract)

Robotic mapping systems typically approach building metric-semantic scene representations from the robot’s own sensors and cameras. However, these “first person” maps inherit the robot’s own limitations due to its embodiment or skillset, which may leave many aspects of the environment unexplored. For example, the robot might not be able to open drawers or access wall cabinets. In this sense, the map representation is not as complete, and requires a more capable robot to fill in the gaps. We narrow these blind spots in current methods by leveraging egocentric data captured as a human naturally explores a scene wearing Project Aria glasses, giving a way to directly transfer knowledge about articulation from the human to any deployable robot. We demonstrate that, by using simple heuristics, we can leverage egocentric data to recover models of articulate object parts, with quality comparable to those of state-of-the-art methods based on other input modalities. We also show how to integrate these models into 3D scene graph representations, leading to a better understanding of object dynamics and object-container relationships. We finally demonstrate that these articulated 3D scene graphs enhance a robot’s ability to perform mobile manipulation tasks, showcasing an application where a Boston Dynamics Spot is tasked with retrieving concealed target items, given only the 3D scene graph as input.

关键词: egocentric vision, 3D scene graphs, articulated objects, robotic mapping, mobile manipulation, Project Aria glasses, object dynamics, Boston Dynamics Spot

195. ❌ DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing

作者: Kailai Feng, Yuxiang Wei, Bo Chen, Yang Pan, Hu Ye, Songwei Liu, Chenqian Yan, Yuan Gao 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28713v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于轻量级扩散模型（DreamLite，0.39B参数）在设备端图像生成和编辑的统一应用，属于大模型在不同领域（计算机视觉）的研究应用，具有创新性（首个统一设备端扩散模型）。核心相关关键词：‘Small Language Models OR SLMs OR On-device AI’（高度相关，核心内容，10分），‘Post-training OR Supervised Fine-tuning OR SFT’（高度相关，核心内容，10分），‘Pre-training OR Continual Pre-training OR Domain Adaptation’（有一定关联，5分），‘RLHF OR RLAIF OR Direct Preference Optimization OR DPO’（有一定关联，摘要提到强化学习，5分），‘Quantization OR Model Compression OR Low-bit Weights’（有一定关联，涉及模型轻量化，5分），‘Speculative Decoding OR Inference Acceleration’（有一定关联，涉及推理加速，5分），‘In-context Learning OR Many-shot Learning’（有一定关联，摘要提到in-context spatial concatenation，5分）。其他关键词与论文主题（扩散模型、图像生成/编辑）无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种轻量级设备端统一扩散模型DreamLite，解决了现有设备端模型仅支持图像生成而缺乏编辑功能的问题，通过任务渐进联合预训练、高质量SFT和强化学习，在图像生成和编辑任务上均达到高性能，并在智能手机上实现快速处理。

摘要翻译

扩散模型在文本到图像生成与文本引导图像编辑领域均取得显著进展。然而，这些模型通常包含数十亿参数，导致高延迟并增加了部署难度。尽管设备端扩散模型提升了效率，但它们主要集中于文本到图像生成，缺乏对图像编辑的支持。本文提出DreamLite，一个紧凑的统一设备端扩散模型（0.39B），可在单一网络中同时支持文本到图像生成与文本引导图像编辑。DreamLite基于剪枝后的移动U-Net骨干网络构建，并通过潜在空间中的上下文空间拼接实现条件统一。它将图像水平拼接作为输入，采用（目标图像 | 空白）配置处理生成任务，采用（目标图像 | 源图像）配置处理编辑任务。为稳定这一紧凑模型的训练，我们提出一种任务渐进式联合预训练策略，依次针对文本到图像生成、图像编辑及联合任务进行训练。经过高质量的有监督微调和强化学习后，DreamLite在图像生成任务上达到GenEval（0.72）得分，在图像编辑任务上达到ImgEdit（4.11）得分，其性能超越现有设备端模型，并与多个服务器端模型保持竞争力。通过采用步骤蒸馏技术，我们将去噪处理步骤进一步缩减至仅4步，使得DreamLite能够在小米14智能手机上于1秒内生成或编辑一张1024×1024分辨率的图像。据我们所知，DreamLite是首个支持图像生成与图像编辑的统一设备端扩散模型。

摘要 (Abstract)

Diffusion models have made significant progress in both text-to-image (T2I) generation and text-guided image editing. However, these models are typically built with billions of parameters, leading to high latency and increased deployment challenges. While on-device diffusion models improve efficiency, they largely focus on T2I generation and lack support for image editing. In this paper, we propose DreamLite, a compact unified on-device diffusion model (0.39B) that supports both T2I generation and text-guided image editing within a single network. DreamLite is built on a pruned mobile U-Net backbone and unifies conditioning through in-context spatial concatenation in the latent space. It concatenates images horizontally as input, using a (target | blank) configuration for generation tasks and (target | source) for editing tasks. To stabilize the training of this compact model, we introduce a task-progressive joint pretraining strategy that sequentially targets T2I, editing, and joint tasks. After high-quality SFT and reinforcement learning, DreamLite achieves GenEval (0.72) for image generation and ImgEdit (4.11) for image editing, outperforming existing on-device models and remaining competitive with several server-side models. By employing step distillation, we further reduce denoising processing to just 4 steps, enabling our DreamLite could generate or edit a 1024 x 1024 image in less than 1s on a Xiaomi 14 smartphone. To the best of our knowledge, DreamLite is the first unified on-device diffusion model that supports both image generation and image editing.

关键词: diffusion models, on-device, image generation, image editing, unified model, lightweight, pruned U-Net, step distillation

196. ❌ Sim-to-Real Fruit Detection Using Synthetic Data: Quantitative Evaluation and Embedded Deployment with Isaac Sim

作者: Martina Hutter-Mironovova 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28670v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究合成数据在目标检测中的sim-to-real迁移效果，使用YOLO模型在嵌入式设备上部署。所有评分关键词均与大语言模型、深度学习技术原理创新或AI在科学领域的应用相关，而本文专注于计算机视觉中的目标检测（特别是水果检测）、合成数据生成和嵌入式部署优化，未涉及任何大语言模型技术、深度学习原理创新或AI在生物信息学等科学领域的应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究评估了在数据受限和嵌入式部署条件下，使用合成数据进行sim-to-real迁移对水果目标检测的有效性，发现混合使用合成和真实数据的训练策略能接近纯真实数据训练的精度，同时减少人工标注需求，并成功在Jetson Orin NX上实现实时推理。

摘要翻译

本研究探讨了在数据受限和嵌入式部署要求下，合成数据在目标检测的仿真到现实迁移中的有效性。研究在NVIDIA Isaac Sim中生成合成数据集，并将其与有限的真实水果图像结合，分别在纯真实数据、纯合成数据和混合数据三种模式下训练基于YOLO的检测模型。性能评估在两个测试数据集上进行：一个与训练数据条件匹配的域内数据集，以及一个包含真实水果和不同背景条件的域偏移数据集。结果表明，仅用真实数据训练的模型获得了最高精度，而纯合成模型因存在域差距导致性能下降。与纯合成方法相比，混合训练策略显著提升了性能，并在减少人工标注需求的同时，取得了接近纯真实数据训练的结果。在域偏移条件下，所有模型均表现出性能下降，但混合模型展现出更强的鲁棒性。训练后的模型通过TensorRT优化成功部署在Jetson Orin NX平台上，实现了实时推理性能。研究结果强调，合成数据在与真实数据结合使用时最为有效，且部署约束必须与检测精度一同纳入考量。

摘要 (Abstract)

This study investigates the effectiveness of synthetic data for sim-to-real transfer in object detection under constrained data conditions and embedded deployment requirements. Synthetic datasets were generated in NVIDIA Isaac Sim and combined with limited real-world fruit images to train YOLO-based detection models under real-only, synthetic-only, and hybrid regimes. Performance was evaluated on two test datasets: an in-domain dataset with conditions matching the training data and a domain shift dataset containing real fruit and different background conditions. Results show that models trained exclusively on real data achieve the highest accuracy, while synthetic-only models exhibit reduced performance due to a domain gap. Hybrid training strategies significantly improve performance compared to synthetic-only approaches and achieve results close to real-only training while reducing the need for manual annotation. Under domain shift conditions, all models show performance degradation, with hybrid models providing improved robustness. The trained models were successfully deployed on a Jetson Orin NX using TensorRT optimization, achieving real-time inference performance. The findings highlight that synthetic data is most effective when used in combination with real data and that deployment constraints must be considered alongside detection accuracy.

关键词: synthetic data, sim-to-real transfer, object detection, YOLO, embedded deployment, domain shift, TensorRT optimization, real-time inference

197. ❌ Industrial3D: A Terrestrial LiDAR Point Cloud Dataset and CrossParadigm Benchmark for Industrial Infrastructure

作者: Chao Yin, Hongzhe Yue, Qing Han, Difeng Hu, Zhenyu Liang, Fangzhou Lin, Bing Sun, Boyu Wang, Mingkai Li, Wei Yao, Jack C. P. Cheng 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28660v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于工业基础设施的3D点云数据集创建和基准测试，属于计算机视觉和3D场景理解领域。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐技术等）完全无关，因此评分为0。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及AI在工业设施（可视为工程科学应用）中的3D理解，但并非核心的生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一个用于工业机械、电气和管道设施的大规模地面LiDAR点云数据集Industrial3D，并建立了首个跨范式基准，揭示了现有方法在工业领域面临的严重领域转移挑战，其中最佳监督方法仅达到55.74% mIoU，而零样本Point-SAM仅为15.79%。

摘要翻译

密集点云的自动化语义理解是扫描到建筑信息模型（Scan-to-BIM）流程、数字孪生构建以及竣工验证的先决条件——这些是建筑业数字化转型中的核心任务。然而，对于工业机械、电气和管道（MEP）设施而言，这一挑战在很大程度上仍未解决：水处理厂、制冷机房和泵站的陆地激光扫描（TLS）采集数据表现出极端的几何模糊性、严重遮挡以及极端的类别不平衡，这是现有建筑领域基准数据集（如S3DIS或ScanNet）所无法充分代表的。我们提出了Industrial3D数据集，这是一个包含13个水处理设施、分辨率达6毫米、拥有6.12亿个专家标注点的地面激光雷达数据集。其规模达到最接近的可比MEP数据集的6.6倍，为工业三维场景理解提供了迄今为止规模最大、要求最严苛的测试平台。我们进一步建立了首个工业跨范式基准，在统一的基准协议下，评估了涵盖全监督、弱监督、无监督以及基础模型（foundation model）设置的九种代表性方法。最佳监督方法的平均交并比（mIoU）为55.74%，而零样本Point-SAM方法仅达到15.79%——这39.95个百分点的差距量化了工业TLS数据在领域迁移方面尚未解决的挑战。系统性分析表明，这一差距源于双重危机：统计稀有性（类别不平衡比达215:1，比S3DIS严重3.5倍）和几何模糊性（尾部类别的点与头部类别的管道共享圆柱体基元），仅靠基于频率的重新加权方法无法解决。Industrial3D数据集，连同基准代码与预训练模型，将在https://github.com/pointcloudyc/Industrial3D 公开提供。

摘要 (Abstract)

Automated semantic understanding of dense point clouds is a prerequisite for Scan-to-BIM pipelines, digital twin construction, and as-built verification–core tasks in the digital transformation of the construction industry. Yet for industrial mechanical, electrical, and plumbing (MEP) facilities, this challenge remains largely unsolved: TLS acquisitions of water treatment plants, chiller halls, and pumping stations exhibit extreme geometric ambiguity, severe occlusion, and extreme class imbalance that architectural benchmarks (e.g., S3DIS or ScanNet) cannot adequately represent. We present Industrial3D, a terrestrial LiDAR dataset comprising 612 million expertly labelled points at 6 mm resolution from 13 water treatment facilities. At 6.6x the scale of the closest comparable MEP dataset, Industrial3D provides the largest and most demanding testbed for industrial 3D scene understanding to date. We further establish the first industrial cross-paradigm benchmark, evaluating nine representative methods across fully supervised, weakly supervised, unsupervised, and foundation model settings under a unified benchmark protocol. The best supervised method achieves 55.74% mIoU, whereas zero-shot Point-SAM reaches only 15.79%–a 39.95 percentage-point gap that quantifies the unresolved domain-transfer challenge for industrial TLS data. Systematic analysis reveals that this gap originates from a dual crisis: statistical rarity (215:1 imbalance, 3.5x more severe than S3DIS) and geometric ambiguity (tail-class points share cylindrical primitives with head-class pipes) that frequency-based re-weighting alone cannot resolve. Industrial3D, along with benchmark code and pre-trained models, will be publicly available at https://github.com/pointcloudyc/Industrial3D.

关键词: Terrestrial LiDAR, Point Cloud Dataset, Industrial Infrastructure, Semantic Segmentation, Cross-paradigm Benchmark, Domain Transfer Challenge, MEP Facilities, 3D Scene Understanding

198. ❌ Divide and Restore: A Modular Task-Decoupled Framework for Universal Image Restoration

作者: Joanna Wiekiera, Martyna Zur 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28658v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的图像修复任务，提出了一种模块化、任务解耦的框架，使用CNN分类器和U-Net专家模型。与评分关键词列表中的大多数大语言模型、对齐、推理、代理等技术完全无关。唯一的相关点是’Mixture of Experts’概念，因为论文使用了多个独立的专家模型（U-Net）进行特定任务修复，并通过路由机制分配任务，这与MoE的思想有一定相似性，但并非严格意义上的MoE架构（通常指稀疏激活的专家混合）。因此，仅在该关键词上给予5分（有一定关联），其他关键词均为0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文提出了一种模块化、任务解耦的图像修复框架，通过诊断路由机制将输入图像动态分配给独立的专家模型，从而避免任务干扰并降低训练开销，实现了在标准硬件上高效的多退化图像修复。

摘要翻译

恢复受多种退化类型（如噪声、模糊或不当曝光）影响的图像，仍然是计算机视觉领域的一项重大挑战。尽管当前趋势倾向于复杂的单体式一体化架构，但这些模型常受到任务间负面干扰的影响，且需要在高端计算集群上进行大量联合训练。本文提出一种基于显式诊断路由机制的模块化、任务解耦图像恢复框架。该架构包含一个轻量级卷积神经网络（CNN）分类器，用于评估输入图像并将其动态引导至专用的恢复节点。该框架的一个关键优势在于其模型无关的可扩展性：虽然我们使用三个独立的U-Net专家模型进行演示，但该系统允许集成任何针对特定任务定制的恢复方法。通过隔离重建路径，该框架避免了特征冲突，并显著降低了训练开销。与单体模型不同，在本框架中添加新的退化类型仅需训练单个专家模型并更新路由器，而无需进行完整的系统重训练。实验结果表明，这种计算友好的方法为标准本地硬件上的多退化图像恢复提供了一个可扩展且高效的解决方案。代码将在论文录用后公开。

摘要 (Abstract)

Restoring images affected by various types of degradation, such as noise, blur, or improper exposure, remains a significant challenge in computer vision. While recent trends favor complex monolithic all-in-one architectures, these models often suffer from negative task interference and require extensive joint training cycles on high-end computing clusters. In this paper, we propose a modular, task-decoupled image restoration framework based on an explicit diagnostic routing mechanism. The architecture consists of a lightweight Convolutional Neural Network (CNN) classifier that evaluates the input image and dynamically directs it to a specialized restoration node. A key advantage of this framework is its model-agnostic extensibility: while we demonstrate it using three independent U-Net experts, the system allows for the integration of any restoration method tailored to specific tasks. By isolating reconstruction paths, the framework prevents feature conflicts and significantly reduces training overhead. Unlike monolithic models, adding new degradation types in our framework only requires training a single expert and updating the router, rather than a full system retraining. Experimental results demonstrate that this computationally accessible approach offers a scalable and efficient solution for multi-degradation restoration on standard local hardware. The code will be published upon paper acceptance.

关键词: image restoration, modular framework, task-decoupled, diagnostic routing, U-Net experts, multi-degradation, computational efficiency, scalable solution

199. ❌ Unsafe2Safe: Controllable Image Anonymization for Downstream Utility

作者: Mih Dinh, SouYoung Jin 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28605v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究图像隐私保护技术，使用大语言模型（LLM）生成编辑指令，因此与’Large Language Models’高度相关（8分）。论文提到对扩散编辑器进行微调，这与’Post-training/SFT’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、Quantization等均未在摘要中提及，与论文主题无关（0分）。论文属于计算机视觉与隐私保护交叉领域，不属于生物信息学等科学AI应用（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了Unsafe2Safe自动化管道，使用视觉语言模型检测隐私风险图像，并利用大语言模型生成指令驱动扩散编辑来匿名化敏感区域，在保护隐私的同时保持了图像的下游任务效用。

摘要翻译

大规模图像数据集常包含可识别或敏感内容，在训练可能记忆并泄露此类信息的模型时会引发隐私风险。我们提出Unsafe2Safe——一个全自动处理流程，能够检测易泄露隐私的图像，并仅通过多模态引导的扩散编辑技术重写其敏感区域。Unsafe2Safe分两个阶段运行：第一阶段使用视觉语言模型（i）检测图像的隐私风险，（ii）生成成对的私有与公共描述文本，分别包含和省略敏感属性，以及（iii）提示大语言模型基于公共描述生成结构化的、身份中立的编辑指令。第二阶段采用指令驱动的扩散编辑器，应用这些双重文本提示，生成既保持全局结构和任务相关语义、又消除私有内容的隐私安全图像。为衡量匿名化质量，我们引入了一套涵盖质量、作弊风险、隐私性和实用性的统一评估体系。在MS-COCO、Caltech101和MIT Indoor67数据集上的实验表明，Unsafe2Safe大幅降低了人脸相似度、文本相似度和人口统计特征可预测性，同时保持下游模型准确率与原始数据训练相当。基于我们自动生成的三元组（私有描述、公共描述、编辑指令）对扩散编辑器进行微调，可进一步提升隐私保护效果和语义保真度。Unsafe2Safe为构建大规模隐私安全数据集提供了一种可扩展、原则性的解决方案，且无需牺牲视觉一致性或下游实用性。

摘要 (Abstract)

Large-scale image datasets frequently contain identifiable or sensitive content, raising privacy risks when training models that may memorize and leak such information. We present Unsafe2Safe, a fully automated pipeline that detects privacy-prone images and rewrites only their sensitive regions using multimodally guided diffusion editing. Unsafe2Safe operates in two stages. Stage 1 uses a vision-language model to (i) inspect images for privacy risks, (ii) generate paired private and public captions that respectively include and omit sensitive attributes, and (iii) prompt a large language model to produce structured, identity-neutral edit instructions conditioned on the public caption. Stage 2 employs instruction-driven diffusion editors to apply these dual textual prompts, producing privacy-safe images that preserve global structure and task-relevant semantics while neutralizing private content. To measure anonymization quality, we introduce a unified evaluation suite covering Quality, Cheating, Privacy, and Utility dimensions. Across MS-COCO, Caltech101, and MIT Indoor67, Unsafe2Safe reduces face similarity, text similarity, and demographic predictability by large margins, while maintaining downstream model accuracy comparable to training on raw data. Fine-tuning diffusion editors on our automatically generated triplets (private caption, public caption, edit instruction) further improves both privacy protection and semantic fidelity. Unsafe2Safe provides a scalable, principled solution for constructing large, privacy-safe datasets without sacrificing visual consistency or downstream utility.

关键词: image anonymization, privacy protection, vision-language model, large language model, diffusion editing, instruction-driven editing, privacy-safe datasets, downstream utility

200. ❌ ELViS: Efficient Visual Similarity from Local Descriptors that Generalizes Across Domains

作者: Pavel Suma, Giorgos Kordopatis-Zilos, Yannis Kalantidis, Giorgos Tolias 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28603v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是计算机视觉领域的图像相似性检索问题，具体提出了ELViS模型用于跨域图像相似性计算。论文内容完全聚焦于视觉相似性、局部描述符、最优传输和图像检索技术，与所有评分关键词（均涉及大语言模型、深度学习技术原理、AI科学应用等）均无直接关联。论文未涉及任何语言模型、模型训练技术、推理优化、AI代理或科学AI应用等内容。

!!! tip deepseek-chat TL;DR

该论文针对图像检索中跨域泛化能力不足的问题，提出了ELViS模型，通过局部描述符对应关系、最优传输和投票机制计算图像相似性，在跨域场景下显著优于现有方法且计算成本更低。

摘要翻译

大规模实例级训练数据稀缺，因此模型通常在特定领域数据集上进行训练。然而在实际检索场景中，模型必须处理多样化的领域，这使得对未见数据的泛化能力至关重要。我们提出ELViS（图像到图像相似度模型），该模型能有效泛化至未见领域。与传统方法不同，我们的模型在相似度空间而非表征空间中运作，从而促进跨领域迁移。该方法利用局部描述符对应关系，通过最优传输步骤优化其相似度——该步骤采用数据依赖的增益机制以抑制信息量不足的描述符，并通过投票过程将强对应关系聚合为图像级相似度。这一设计注入了强归纳偏置，产生了一个简洁、高效且可解释的模型。为评估泛化性能，我们构建了一个包含地标、艺术品、商品及多领域集合的八数据集基准，并将ELViS作为重排序方法进行评估。实验表明，在领域外场景及平均性能上，ELViS大幅优于现有方法，同时仅需其计算成本的一小部分。代码发布于：https://github.com/pavelsuma/ELViS/

摘要 (Abstract)

Large-scale instance-level training data is scarce, so models are typically trained on domain-specific datasets. Yet in real-world retrieval, they must handle diverse domains, making generalization to unseen data critical. We introduce ELViS, an image-to-image similarity model that generalizes effectively to unseen domains. Unlike conventional approaches, our model operates in similarity space rather than representation space, promoting cross-domain transfer. It leverages local descriptor correspondences, refines their similarities through an optimal transport step with data-dependent gains that suppress uninformative descriptors, and aggregates strong correspondences via a voting process into an image-level similarity. This design injects strong inductive biases, yielding a simple, efficient, and interpretable model. To assess generalization, we compile a benchmark of eight datasets spanning landmarks, artworks, products, and multi-domain collections, and evaluate ELViS as a re-ranking method. Our experiments show that ELViS outperforms competing methods by a large margin in out-of-domain scenarios and on average, while requiring only a fraction of their computational cost. Code available at: https://github.com/pavelsuma/ELViS/

关键词: image similarity, cross-domain generalization, local descriptors, optimal transport, image retrieval, re-ranking, computational efficiency, visual similarity

201. ❌ ORSIFlow: Saliency-Guided Rectified Flow for Optical Remote Sensing Salient Object Detection

作者: Haojing Chen, Yutong Li, Zhihang Liu, Tao Tan, Haoyu Bian, Qiuju Ma 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28584v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于光学遥感图像显著目标检测（ORSI-SOD），提出了一种基于修正流（rectified flow）的生成式方法ORSIFlow。所有关键词均与大语言模型（LLMs）、深度学习技术原理或特定AI应用领域（如生物信息学）相关，但本文研究的是计算机视觉中的特定任务，未涉及LLMs、MoE、推理加速、对齐等大模型技术，也未应用于生物/化学信息学等科学领域。仅’AI for Science’关键词因遥感属于广义科学应用而获得5分（有一定关联），其余关键词完全无关。

!!! tip deepseek-chat TL;DR

本文针对光学遥感图像显著目标检测中复杂背景、低对比度等挑战，提出了一种基于修正流的生成式框架ORSIFlow，在多个公开基准上实现了最先进的性能并显著提升了效率。

摘要翻译

光学遥感图像显著目标检测（ORSI-SOD）由于背景复杂、对比度低、目标形状不规则以及目标尺度变化大，仍然面临挑战。现有的判别式方法直接回归显著图，而近期基于扩散的生成式方法则受限于随机采样和高计算成本。本文提出ORSIFlow，一种显著性引导的整流流框架，将ORSI-SOD重新定义为确定性潜空间流生成问题。ORSIFlow在由冻结变分自编码器构建的紧凑潜空间中执行显著掩码生成，仅需少量步骤即可实现高效推理。为增强显著性感知能力，我们设计了用于全局语义判别的显著特征判别器，以及用于精确边界细化的显著特征校准器。在多个公开基准上的大量实验表明，ORSIFlow以显著提升的效率实现了最先进的性能。代码发布于：https://github.com/Ch3nSir/ORSIFlow。

摘要 (Abstract)

Optical Remote Sensing Image Salient Object Detection (ORSI-SOD) remains challenging due to complex backgrounds, low contrast, irregular object shapes, and large variations in object scale. Existing discriminative methods directly regress saliency maps, while recent diffusion-based generative approaches suffer from stochastic sampling and high computational cost. In this paper, we propose ORSIFlow, a saliency-guided rectified flow framework that reformulates ORSI-SOD as a deterministic latent flow generation problem. ORSIFlow performs saliency mask generation in a compact latent space constructed by a frozen variational autoencoder, enabling efficient inference with only a few steps. To enhance saliency awareness, we design a Salient Feature Discriminator for global semantic discrimination and a Salient Feature Calibrator for precise boundary refinement. Extensive experiments on multiple public benchmarks show that ORSIFlow achieves state-of-the-art performance with significantly improved efficiency. Codes are available at: https://github.com/Ch3nSir/ORSIFlow.

关键词: Optical Remote Sensing Image Salient Object Detection, ORSIFlow, rectified flow, saliency-guided, latent flow generation, variational autoencoder, Salient Feature Discriminator, efficient inference

202. ❌ XSPA: Crafting Imperceptible X-Shaped Sparse Adversarial Perturbations for Transferable Attacks on VLMs

作者: Chengyin Hu, Jiaju Han, Xuemeng Sun, Qike Zhang, Yiwei Wei, Ang Li, Chunlei Meng, Xiang Chen, Jiahuan Long 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28568v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉语言模型（VLMs）的对抗攻击，属于计算机视觉与多模态安全领域，而非大语言模型（LLMs）或深度学习技术原理的创新。所有关键词均围绕LLMs、其训练方法、推理优化、对齐、应用等，与论文主题无直接关联。论文未涉及任何关键词中的技术或概念，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种针对视觉语言模型的X形稀疏像素对抗攻击方法，实验表明即使极稀疏的视觉扰动也能显著破坏多模态系统的跨任务语义理解，揭示了当前模型的鲁棒性缺陷。

摘要翻译

视觉-语言模型（Vision-Language Models, VLMs）依赖于共享的视觉-文本表征空间来执行零样本分类、图像描述生成和视觉问答（Visual Question Answering, VQA）等任务。尽管这种共享空间赋予了模型强大的跨任务泛化能力，但它也可能引入一种共同的脆弱性：微小的视觉扰动可通过共享嵌入空间传播，导致不同任务间出现语义层面的关联性失效。这种风险在交互式与决策支持场景中尤为重要，然而目前尚不清楚VLMs是否对高度受限、稀疏且几何结构固定的扰动具有鲁棒性。为探究此问题，我们提出X形稀疏像素攻击（X-shaped Sparse Pixel Attack, XSPA），这是一种难以察觉的结构化攻击方法，将扰动限制在两条交叉的对角线上。与稠密扰动或灵活局部贴片攻击相比，XSPA在更严格的攻击预算下操作，从而为VLM鲁棒性提供了更严苛的测试。在此稀疏支撑域内，XSPA联合优化分类目标、跨任务语义引导以及扰动幅度与沿线平滑度的正则化，在保持视觉隐蔽性的同时，诱导可迁移的错误分类，并在描述生成与VQA任务中引发语义漂移。在默认设置下，XSPA仅修改约1.76%的图像像素。在COCO数据集上的实验表明，XSPA能持续降低所有三项任务的性能：在OpenAI CLIP ViT-L/14模型上零样本准确率下降52.33个百分点，在OpenCLIP ViT-B/16模型上下降67.00个百分点；基于GPT-4评估的描述一致性最多降低58.60个百分点，VQA正确率最多下降44.38个百分点。这些结果表明，即使具有固定几何先验、高度稀疏且视觉隐蔽的扰动，也能显著破坏VLMs中的跨任务语义理解，揭示了当前多模态系统中存在的显著鲁棒性缺陷。

摘要 (Abstract)

Vision-language models (VLMs) rely on a shared visual-textual representation space to perform tasks such as zero-shot classification, image captioning, and visual question answering (VQA). While this shared space enables strong cross-task generalization, it may also introduce a common vulnerability: small visual perturbations can propagate through the shared embedding space and cause correlated semantic failures across tasks. This risk is particularly important in interactive and decision-support settings, yet it remains unclear whether VLMs are robust to highly constrained, sparse, and geometrically fixed perturbations. To address this question, we propose X-shaped Sparse Pixel Attack (XSPA), an imperceptible structured attack that restricts perturbations to two intersecting diagonal lines. Compared with dense perturbations or flexible localized patches, XSPA operates under a much stricter attack budget and thus provides a more stringent test of VLM robustness. Within this sparse support, XSPA jointly optimizes a classification objective, cross-task semantic guidance, and regularization on perturbation magnitude and along-line smoothness, inducing transferable misclassification as well as semantic drift in captioning and VQA while preserving visual subtlety. Under the default setting, XSPA modifies only about 1.76% of image pixels. Experiments on the COCO dataset show that XSPA consistently degrades performance across all three tasks. Zero-shot accuracy drops by 52.33 points on OpenAI CLIP ViT-L/14 and 67.00 points on OpenCLIP ViT-B/16, while GPT-4-evaluated caption consistency decreases by up to 58.60 points and VQA correctness by up to 44.38 points. These results suggest that even highly sparse and visually subtle perturbations with fixed geometric priors can substantially disrupt cross-task semantics in VLMs, revealing a notable robustness gap in current multimodal systems.

关键词: Vision-language models, Adversarial attacks, Sparse perturbations, Transferable attacks, Multimodal robustness, Cross-task generalization, X-shaped perturbations, Semantic drift

203. ❌ StreamingVLA: Streaming Vision-Language-Action Model with Action Flow Matching and Adaptive Early Observation

作者: Yiran Shi, Dongqi Guo, Tianchen Zhao, Feng Gao, Liangzhi Shi, Chao Yu, ZhiJian Mo, Qihua Xiao, XiaoShuai Peng, Qingmin Liao, Yu Wang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28565v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文StreamingVLA专注于视觉-语言-动作（VLA）模型的效率优化，通过异步并行化、动作流匹配和自适应观察机制来解决延迟和停顿问题。虽然VLA模型属于多模态AI，但论文的核心是系统优化和并行计算，而非大语言模型（LLM）技术本身。所有评分关键词均围绕LLM技术原理、训练方法、推理优化、对齐、代理系统等，与论文的VLA系统优化主题无直接关联。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

论文提出StreamingVLA模型，通过动作流匹配和自适应观察机制实现VLA阶段的异步并行化，在保持性能的同时将延迟加速2.4倍并减少6.5倍执行停顿。

摘要翻译

视觉-语言-动作（Vision-Language-Action, VLA）模型在自然语言驱动的感知与控制任务中展现出卓越性能。然而，VLA模型的高计算成本带来了显著的效率挑战，尤其在实际部署于资源受限的边缘平台时。由于VLA的不同阶段（观察、动作生成与执行）必须顺序执行，并需等待前一阶段完成，系统常面临频繁停顿与高延迟问题。为解决此问题，我们进行了系统性分析，以识别实现快速流畅生成的挑战，并提出让VLA具备以“流式”方式在阶段间异步并行化的能力。首先，我们消除了对动作分块的依赖，采用动作流匹配方法，该方法学习动作流的轨迹而非逐块去噪动作，从而实现了动作生成与执行延迟的重叠。其次，我们设计了一种动作显著性感知的自适应观察机制，进而重叠了执行与观察的延迟。在不牺牲性能的前提下，StreamingVLA实现了显著的加速，并提升了执行流畅度。其延迟加速比达到2.4倍，并将执行停顿减少了6.5倍。

摘要 (Abstract)

Vision-language-action (VLA) models have demonstrated exceptional performance in natural language-driven perception and control. However, the high computational cost of VLA models poses significant efficiency challenges, particularly for resource-constrained edge platforms in real-world deployments. However, since different stages of VLA (observation, action generation and execution) must proceed sequentially, and wait for the completion of the preceding stage, the system suffers from frequent halting and high latency. To address this, We conduct a systematic analysis to identify the challenges for fast and fluent generation, and propose enabling VLAs with the ability to asynchronously parallelize across VLA stages in a “streaming” manner. First, we eliminate the reliance on action chunking and adopt action flow matching, which learns the trajectory of action flows rather than denoising chunk-wise actions. It overlaps the latency of action generation and execution. Second, we design an action saliency-aware adaptive observation mechanism, thereby overlapping the latency of execution and observation. Without sacrificing performance, StreamingVLA achieves substantial speedup and improves the fluency of execution. It achieves a 2.4 $\times$ latency speedup and reduces execution halting by 6.5 $\times$.

关键词: StreamingVLA, Vision-Language-Action, action flow matching, adaptive observation, latency reduction, asynchronous parallelization, execution fluency

204. ❌ Curriculum-Guided Myocardial Scar Segmentation for Ischemic and Non-ischemic Cardiomyopathy

作者: Nivetha Jayakumar, Jonathan Pan, Shuo Wang, Bishow Paudel, Nisha Hosadurg, Cristiane C. Singulane, Sivam Bhatt, Amit R. Patel, Miaomiao Zhang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28560v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 这篇论文专注于医学图像分割（心肌瘢痕分割），使用课程学习框架来改进分割性能。论文的核心是深度学习在生物医学图像分析中的应用，属于AI for Science（AI4Science）的范畴，因此与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评分5分）。然而，论文没有涉及任何大语言模型（LLM）、模型架构创新（如MoE、稀疏模型）、训练技术（如预训练、微调、对齐、RLHF）、推理优化（如量化、加速）、代理系统、可解释性或其他列出的特定大模型技术关键词。所有其他关键词均与论文内容完全无关，评分为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于课程学习的心肌瘢痕分割框架，通过渐进式训练策略提高了在具有挑战性的LGE-CMR图像上对模糊或弥漫性瘢痕的分割准确性和鲁棒性。

摘要翻译

心肌瘢痕的识别与量化对心血管疾病的诊断与预后评估至关重要。然而，基于钆延迟增强心脏磁共振（LGE-CMR）图像实现可靠的瘢痕分割仍面临挑战，其原因包括患者间对比增强程度存在差异、成像条件欠佳（如对比剂洗脱效应），以及观察者间差异导致的弥漫性瘢痕金标准标注不一致。本研究提出一种基于课程学习（curriculum learning）的框架，旨在提升模型在此类挑战性条件下的分割性能。该方法采用渐进式训练策略，引导模型从高置信度、边界清晰的瘢痕区域逐步学习至低置信度或视觉模糊、瘢痕负荷较低的样本。通过这种结构化学习过程，网络能够增强对不确定标签及传统训练流程中常被忽略的细微瘢痕表现的鲁棒性。实验结果表明，所提方法显著提升了分割的准确性与一致性，尤其在瘢痕最小化或弥漫性瘢痕病例中表现优于标准训练基线。该策略为在临床应用中利用不完美数据改进心肌瘢痕量化提供了理论依据。我们的代码已在GitHub上公开。

摘要 (Abstract)

Identification and quantification of myocardial scar is important for diagnosis and prognosis of cardiovascular diseases. However, reliable scar segmentation from Late Gadolinium Enhancement Cardiac Magnetic Resonance (LGE-CMR) images remains a challenge due to variations in contrast enhancement across patients, suboptimal imaging conditions such as post contrast washout, and inconsistencies in ground truth annotations on diffuse scars caused by inter observer variability. In this work, we propose a curriculum learning-based framework designed to improve segmentation performance under these challenging conditions. The method introduces a progressive training strategy that guides the model from high-confidence, clearly defined scar regions to low confidence or visually ambiguous samples with limited scar burden. By structuring the learning process in this manner, the network develops robustness to uncertain labels and subtle scar appearances that are often underrepresented in conventional training pipelines. Experimental results show that the proposed approach enhances segmentation accuracy and consistency, particularly for cases with minimal or diffuse scar, outperforming standard training baselines. This strategy provides a principled way to leverage imperfect data for improved myocardial scar quantification in clinical applications. Our code is publicly available on GitHub.

关键词: myocardial scar segmentation, curriculum learning, LGE-CMR, cardiac magnetic resonance, deep learning, medical image analysis, cardiovascular disease

205. ❌ MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures

作者: Tim Strohmeyer, Lucas Morin, Gerhard Ingmar Meijer, Valéry Weber, Ahmed Nassar, Peter Staar 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28550v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于化学文档中Markush结构的多模态识别，属于AI for Science（具体为化学信息学）领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。其他关键词主要涉及大模型技术原理、训练方法、推理优化、代理系统等，论文未提及这些技术，因此均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了化学文档中多模态Markush结构识别精度不足的问题，提出了MarkushGrapher-2端到端方法，通过融合文本、图像和布局信息，显著提升了识别性能，并发布了大规模数据集和基准。

摘要翻译

从文档中自动提取化学结构对于化学文献的大规模分析至关重要。目前已有自动化流程被开发出来，可分别识别图像或文本中表示的分子。然而，从多模态描述（马库什结构）中识别化学结构的方法在精度上相对滞后，无法用于自动化的大规模处理。在本工作中，我们提出了MarkushGrapher-2，一种用于文档中化学结构多模态识别的端到端方法。首先，我们的方法采用专用的OCR模型从化学图像中提取文本。其次，通过一个视觉-文本-布局编码器和一个光学化学结构识别视觉编码器，对文本、图像和布局信息进行联合编码。最后，通过两阶段训练策略有效融合生成的编码，并用于自回归地生成马库什结构的表示。针对训练数据缺乏的问题，我们引入了一个自动化流程来构建大规模的真实世界马库什结构数据集。此外，我们提出了IP5-M——一个大规模人工标注的真实世界马库什结构基准数据集，旨在推动这一挑战性任务的研究。大量实验表明，我们的方法在多模态马库什结构识别方面显著优于现有最先进的模型，同时在分子结构识别方面保持强劲性能。代码、模型和数据集均已公开发布。

摘要 (Abstract)

Automatically extracting chemical structures from documents is essential for the large-scale analysis of the literature in chemistry. Automatic pipelines have been developed to recognize molecules represented either in figures or in text independently. However, methods for recognizing chemical structures from multimodal descriptions (Markush structures) lag behind in precision and cannot be used for automatic large-scale processing. In this work, we present MarkushGrapher-2, an end-to-end approach for the multimodal recognition of chemical structures in documents. First, our method employs a dedicated OCR model to extract text from chemical images. Second, the text, image, and layout information are jointly encoded through a Vision-Text-Layout encoder and an Optical Chemical Structure Recognition vision encoder. Finally, the resulting encodings are effectively fused through a two-stage training strategy and used to auto-regressively generate a representation of the Markush structure. To address the lack of training data, we introduce an automatic pipeline for constructing a large-scale dataset of real-world Markush structures. In addition, we present IP5-M, a large manually-annotated benchmark of real-world Markush structures, designed to advance research on this challenging task. Extensive experiments show that our approach substantially outperforms state-of-the-art models in multimodal Markush structure recognition, while maintaining strong performance in molecule structure recognition. Code, models, and datasets are released publicly.

关键词: chemical structure recognition, multimodal recognition, Markush structures, vision-text-layout encoder, optical chemical structure recognition, end-to-end approach, dataset construction, benchmark evaluation

206. ❌ Seen2Scene: Completing Realistic 3D Scenes with Visibility-Guided Flow

作者: Quan Meng, Yujin Chen, Lei Li, Matthias Nießner, Angela Dai 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28548v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Seen2Scene: Completing Realistic 3D Scenes with Visibility-Guided Flow》专注于3D场景补全和生成，采用流匹配和稀疏变换器等技术处理不完整的真实3D扫描数据。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是3D计算机视觉和几何处理，未涉及任何大语言模型、深度学习技术原理创新或AI在生物/化学等科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文提出了Seen2Scene，一种基于可见性引导流匹配的方法，首次直接在真实世界不完整的3D扫描数据上训练，用于实现复杂杂乱真实环境的逼真3D场景补全和生成，实验表明其在补全准确性和生成质量上优于基线方法。

摘要翻译

本文提出Seen2Scene，这是首个基于流匹配、直接在真实世界不完整三维扫描数据上进行训练的场景补全与生成方法。与先前依赖完整（因而通常是合成）三维数据的方法不同，我们的方法引入了可见性引导的流匹配，该方法显式地掩蔽真实扫描中的未知区域，从而能够从真实世界的不完整观测中进行有效学习。我们使用稀疏网格编码的截断有符号距离场（TSDF）体素来表示三维场景，并采用稀疏Transformer来高效建模复杂场景结构，同时掩蔽未知区域。我们以三维布局框作为输入条件信号，且该方法可灵活适配其他多种输入，如文本或部分扫描。通过直接从真实世界不完整三维扫描中学习，Seen2Scene能够为复杂、杂乱的真实环境实现逼真的三维场景补全。实验表明，我们的模型能够生成连贯、完整且逼真的三维场景，在补全准确性和生成质量上均优于基线方法。

摘要 (Abstract)

We present Seen2Scene, the first flow matching-based approach that trains directly on incomplete, real-world 3D scans for scene completion and generation. Unlike prior methods that rely on complete and hence synthetic 3D data, our approach introduces visibility-guided flow matching, which explicitly masks out unknown regions in real scans, enabling effective learning from real-world, partial observations. We represent 3D scenes using truncated signed distance field (TSDF) volumes encoded in sparse grids and employ a sparse transformer to efficiently model complex scene structures while masking unknown regions. We employ 3D layout boxes as an input conditioning signal, and our approach is flexibly adapted to various other inputs such as text or partial scans. By learning directly from real-world, incomplete 3D scans, Seen2Scene enables realistic 3D scene completion for complex, cluttered real environments. Experiments demonstrate that our model produces coherent, complete, and realistic 3D scenes, outperforming baselines in completion accuracy and generation quality.

关键词: 3D scene completion, flow matching, visibility-guided, real-world 3D scans, sparse transformer, TSDF volumes, scene generation, partial observations

207. ❌ GEditBench v2: A Human-Aligned Benchmark for General Image Editing

作者: Zhangqi Jiang, Zheng Sun, Xianfang Zeng, Yufeng Yang, Xuanyang Zhang, Yongliang Wu, Wei Cheng, Gang Yu, Xu Yang, Bihan Wen 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28547v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于图像编辑领域的基准测试和评估方法开发，具体包括：1）构建GEditBench v2图像编辑基准数据集；2）开发PVC-Judge视觉一致性评估模型；3）创建VCReward-Bench评估数据集。论文内容完全围绕计算机视觉中的图像编辑任务，未涉及任何大语言模型、深度学习技术原理、科学AI应用或相关技术关键词。所有评分关键词均与大语言模型、深度学习技术、科学AI应用相关，而本文是纯粹的图像编辑研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对现有图像编辑评估框架的局限性，提出了GEditBench v2综合基准测试集和PVC-Judge视觉一致性评估模型，通过实验验证了其评估性能优于现有开源模型并超越GPT-5.1，为精确图像编辑提供了可靠的评估基础。

摘要翻译

图像编辑领域的最新进展使得模型能够处理复杂指令并实现令人印象深刻的真实感。然而，现有的评估框架却相对滞后：当前基准测试存在任务覆盖范围狭窄的问题，而标准指标未能充分捕捉视觉一致性，即编辑后图像与原始图像之间在身份、结构和语义连贯性上的保持。为应对这些局限，我们推出GEditBench v2——一个包含1,200个真实用户查询的综合基准测试集，涵盖23类编辑任务，其中特别设置了开放集类别，用于评估预定义任务之外的、无约束的分布外编辑指令。此外，我们提出了PVC-Judge，一个用于评估视觉一致性的开源成对评估模型，该模型通过两种新颖的区域解耦偏好数据合成流程进行训练。同时，我们利用专家标注的偏好对构建了VCReward-Bench，用以评估PVC-Judge在视觉一致性评判方面与人类判断的契合度。实验表明，我们的PVC-Judge在开源模型中达到了最先进的评估性能，平均表现甚至超越了GPT-5.1。最后，通过对16个前沿编辑模型进行基准测试，我们证明GEditBench v2能够实现更符合人类感知的评估，揭示当前模型的关键局限，并为推进精确图像编辑技术提供可靠基础。

摘要 (Abstract)

Recent advances in image editing have enabled models to handle complex instructions with impressive realism. However, existing evaluation frameworks lag behind: current benchmarks suffer from narrow task coverage, while standard metrics fail to adequately capture visual consistency, i.e., the preservation of identity, structure and semantic coherence between edited and original images. To address these limitations, we introduce GEditBench v2, a comprehensive benchmark with 1,200 real-world user queries spanning 23 tasks, including a dedicated open-set category for unconstrained, out-of-distribution editing instructions beyond predefined tasks. Furthermore, we propose PVC-Judge, an open-source pairwise assessment model for visual consistency, trained via two novel region-decoupled preference data synthesis pipelines. Besides, we construct VCReward-Bench using expert-annotated preference pairs to assess the alignment of PVC-Judge with human judgments on visual consistency evaluation. Experiments show that our PVC-Judge achieves state-of-the-art evaluation performance among open-source models and even surpasses GPT-5.1 on average. Finally, by benchmarking 16 frontier editing models, we show that GEditBench v2 enables more human-aligned evaluation, revealing critical limitations of current models, and providing a reliable foundation for advancing precise image editing.

关键词: image editing, benchmark evaluation, visual consistency, GEditBench v2, PVC-Judge, preference assessment, human-aligned evaluation, open-source model

208. ❌ ManipArena: Comprehensive Real-world Evaluation of Reasoning-Oriented Generalist Robot Manipulation

作者: Yu Sun, Meng Cao, Ping Yang, Rongtao Xu, Yunxiao Yan, Runze Xu, Liang Ma, Roy Gan, Andy Zhai, Qingxuan Chen, Zunnan Xu, Hao Wang, Jincheng Yu, Lucy Liang, Qian Wang, Ivan Laptev, Ian D Reid, Xiaodan Liang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28545v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要关注机器人操作评估框架，与大多数大模型技术关键词无关。仅与’World Models AND General World Models’高度相关（摘要明确提及），与’Chain of Thought’和’System 2 Thinking’有一定关联（论文强调推理导向任务）。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对视觉-语言-动作模型和世界模型在机器人操作中缺乏真实世界评估标准的问题，提出了一个名为ManipArena的标准化评估框架，包含多样化任务和真实到模拟环境，以促进公平、可复现的评估。

摘要翻译

视觉-语言-动作模型与世界模型近期已成为通用机器人智能领域的重要范式，但其发展受限于缺乏能反映真实世界部署需求的可靠评估方案。现有基准大多以仿真环境为核心，虽具备可控性，却难以捕捉由感知噪声、复杂接触动力学、硬件限制及系统延迟导致的现实差距。此外，分散在不同机器人平台上的真实世界评估阻碍了公平且可复现的比较。为应对这些挑战，我们提出ManipArena——一个旨在连接仿真与真实执行的标准化评估框架。该框架包含20类多样化任务及10,812条专家演示轨迹，重点关注需要语义与空间推理的认知型操作任务；通过受控的分布外场景设置支持多层次泛化能力评估，并涵盖桌面场景之外的长时程移动操作任务。该框架进一步提供丰富的传感诊断数据（包括底层运动信号），以及通过高质量三维扫描构建的同步虚实环境。这些特性共同为视觉-语言-动作模型与世界模型方法提供了公平、贴近现实且可复现的评估体系，为具身智能系统的诊断与推进奠定了可扩展的基础。

摘要 (Abstract)

Vision-Language-Action (VLA) models and world models have recently emerged as promising paradigms for general-purpose robotic intelligence, yet their progress is hindered by the lack of reliable evaluation protocols that reflect real-world deployment. Existing benchmarks are largely simulator-centric, which provide controllability but fail to capture the reality gap caused by perception noise, complex contact dynamics, hardware constraints, and system latency. Moreover, fragmented real-world evaluations across different robot platforms prevent fair and reproducible comparison. To address these challenges, we introduce ManipArena, a standardized evaluation framework designed to bridge simulation and real-world execution. ManipArena comprises 20 diverse tasks across 10,812 expert trajectories emphasizing reasoning-oriented manipulation tasks requiring semantic and spatial reasoning, supports multi-level generalization through controlled out-of-distribution settings, and incorporates long-horizon mobile manipulation beyond tabletop scenarios. The framework further provides rich sensory diagnostics, including low-level motor signals, and synchronized real-to-sim environments constructed via high-quality 3D scanning. Together, these features enable fair, realistic, and reproducible evaluation for both VLA and world model approaches, providing a scalable foundation for diagnosing and advancing embodied intelligence systems.

关键词: Vision-Language-Action models, world models, robotic manipulation, evaluation framework, real-world deployment, reasoning-oriented tasks, ManipArena, embodied intelligence

209. ❌ Bridging the Geometry Mismatch: Frequency-Aware Anisotropic Serialization for Thin-Structure SSMs

作者: Jin Bai, Huiyao Zhang, Qi Wen, Ningyang Li, Shengyang Li, Atta ur Rahman, Xiaolin Tian 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28503v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是基于State-Space Models (SSMs)的thin-structure segmentation方法，提出FGOS-Net框架解决几何不匹配问题。所有评分关键词都明确针对大语言模型(LLMs)及其相关技术（如微调、对齐、推理、代理等），而本文专注于计算机视觉中的SSMs用于图像分割，与大语言模型技术无直接关联。虽然SSMs在NLP中也有应用，但本文完全在CV领域，未涉及任何LLM相关内容。

!!! tip deepseek-chat TL;DR

该论文解决了thin-structure segmentation中State-Space Models的几何不匹配问题，提出了FGOS-Net框架，通过频率感知的各向异性序列化方法在多个基准测试中显著提升了性能。

摘要翻译

细长线性结构的分割本质上具有拓扑敏感性，局部微小误差即可破坏长程连通性。尽管近期状态空间模型（SSMs）能实现高效的长程建模，但其各向同性的序列化方式（如光栅扫描）与各向异性目标存在几何失配，导致状态传播跨越而非沿着结构轨迹进行。为解决此问题，我们提出FGOS-Net——一个基于频率-几何解耦的框架。我们首先将特征分解为稳定的拓扑载体和方向性高频分量，利用后者显式校正下采样导致的空间错位。在此校准拓扑的基础上，我们提出频率对齐扫描策略，将序列化提升为几何条件决策过程，从而保持方向一致的轨迹追踪。结合主动探测策略以选择性注入高频细节并抑制纹理歧义，FGOS-Net在四个具有挑战性的基准测试中均稳定超越现有强基线模型。值得注意的是，该模型在DeepCrack数据集上实现了91.3%的mIoU和97.1%的clDice，同时以80 FPS的运行速度仅需7.87 GFLOPs计算量。

摘要 (Abstract)

The segmentation of thin linear structures is inherently topology allowbreak-critical, where minor local errors can sever long-range connectivity. While recent State-Space Models (SSMs) offer efficient long-range modeling, their isotropic serialization (e.g., raster scanning) creates a geometry mismatch for anisotropic targets, causing state propagation across rather than along the structure trajectories. To address this, we propose FGOS-Net, a framework based on frequency allowbreak-geometric disentanglement. We first decompose features into a stable topology carrier and directional high-frequency bands, leveraging the latter to explicitly correct spatial misalignments induced by downsampling. Building on this calibrated topology, we introduce frequency-aligned scanning that elevates serialization to a geometry-conditioned decision, preserving direction-consistent traces. Coupled with an active probing strategy to selectively inject high-frequency details and suppress texture ambiguity, FGOS-Net consistently outperforms strong baselines across four challenging benchmarks. Notably, it achieves 91.3% mIoU and 97.1% clDice on DeepCrack while running at 80 FPS with only 7.87 GFLOPs.

关键词: thin-structure segmentation, State-Space Models, geometry mismatch, frequency-geometric disentanglement, anisotropic serialization, FGOS-Net, topology preservation, computational efficiency

210. ❌ ConceptWeaver: Weaving Disentangled Concepts with Flow

作者: Jintao Chen, Aiming Hao, Xiaoqing Chen, Chengyu Bai, Chubin Chen, Yanxun Li, Jiahong Wu, Xiangxiang Chu, Shanghang Zhang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28493v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是基于流模型（flow-based models）的概念解缠和合成编辑技术，虽然属于生成模型领域，但所有评分关键词都明确针对大语言模型（LLMs）及其相关技术（如MoE、SFT、RLHF、RAG等），而本文完全不涉及语言模型或文本处理，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了ConceptWeaver框架，通过分析流模型生成过程的三阶段特性，实现了从单张参考图像中解缠概念并进行高保真合成编辑。

摘要翻译

预训练的流模型在合成复杂场景方面表现出色，但缺乏从单次真实世界来源中解耦和定制其底层概念的直接机制。为阐明这一过程，我们首先引入一种新颖的差分探测技术，用于分离和分析随时间推移单个概念标记对速度场的影响。这项研究揭示了一个关键发现：生成过程并非单一整体，而是分三个不同阶段展开。初始的蓝图阶段建立低频结构，随后是关键性的实例化阶段——在此阶段内容概念以峰值强度涌现并自然解耦，为操作创造了最佳窗口。最终的概念不敏感细化阶段则合成细粒度细节。基于这一发现，我们提出了ConceptWeaver，一个用于单次概念解耦的框架。ConceptWeaver通过一种与三阶段框架对齐的阶段感知优化策略，从单张参考图像中学习特定概念的语义偏移。这些学习到的偏移随后在推理过程中通过我们新颖的ConceptWeaver引导机制进行部署，该机制策略性地将其注入到适当的生成阶段。大量实验验证表明，ConceptWeaver能够实现高保真度的组合式合成与编辑，证明理解并利用流模型内在的阶段性本质，是实现精确、多粒度内容操控的关键。

摘要 (Abstract)

Pre-trained flow-based models excel at synthesizing complex scenes yet lack a direct mechanism for disentangling and customizing their underlying concepts from one-shot real-world sources. To demystify this process, we first introduce a novel differential probing technique to isolate and analyze the influence of individual concept tokens on the velocity field over time. This investigation yields a critical insight: the generative process is not monolithic but unfolds in three distinct stages. An initial \textbf{Blueprint Stage} establishes low-frequency structure, followed by a pivotal \textbf{Instantiation Stage} where content concepts emerge with peak intensity and become naturally disentangled, creating an optimal window for manipulation. A final concept-insensitive refinement stage then synthesizes fine-grained details. Guided by this discovery, we propose \textbf{ConceptWeaver}, a framework for one-shot concept disentanglement. ConceptWeaver learns concept-specific semantic offsets from a single reference image using a stage-aware optimization strategy that aligns with the three-stage framework. These learned offsets are then deployed during inference via our novel ConceptWeaver Guidance (CWG) mechanism, which strategically injects them at the appropriate generative stage. Extensive experiments validate that ConceptWeaver enables high-fidelity, compositional synthesis and editing, demonstrating that understanding and leveraging the intrinsic, staged nature of flow models is key to unlocking precise, multi-granularity content manipulation.

关键词: flow-based models, concept disentanglement, one-shot learning, generative process, stage-aware optimization, compositional synthesis, content manipulation, ConceptWeaver Guidance

211. ❌ INSID3: Training-Free In-Context Segmentation with DINOv3

作者: Claudia Cuttano, Gabriele Trivigno, Christoph Reich, Daniel Cremers, Carlo Masone, Stefan Roth 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28480v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文INSID3专注于计算机视觉领域的上下文分割任务，使用DINOv3自监督视觉特征，与绝大多数关键词（涉及大语言模型、训练技术、推理优化、对齐、代理等）完全无关。唯一相关的是’In-context Learning OR Many-shot Learning’，因为论文核心是上下文分割（In-context segmentation），属于上下文学习在视觉任务中的应用，因此给10分。其他关键词均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需训练的上下文分割方法INSID3，仅利用冻结的DINOv3特征，在单样本语义、部件和个性化分割任务上实现了最先进的性能，同时减少了参数数量且无需监督。

摘要翻译

上下文分割（In-context segmentation, ICS）旨在根据一个已标注的视觉示例，分割任意概念，例如物体、部件或个性化实例。现有研究主要依赖两种方法：（i）对视觉基础模型（Vision Foundation Models, VFMs）进行微调，这能提升域内效果但损害泛化能力；或（ii）结合多个冻结的VFMs，虽能保持泛化性，却导致架构复杂且分割粒度固定。本文从极简视角重新审视ICS，并提出：能否仅凭单一自监督骨干网络，在无需任何监督或辅助模型的情况下，同时支持语义匹配与分割？我们证明，DINOv3 生成的规模化密集自监督特征具有强烈的空间结构和语义对应性。我们提出 INSID3，一种无需训练的方法，仅基于冻结的 DINOv3 特征，在给定上下文示例的情况下，实现不同粒度的概念分割。INSID3 在一次性语义分割、部件分割和个性化分割任务中均达到最先进水平，平均交并比（mIoU）较先前工作提升 +7.5%，同时参数量减少三倍，且无需任何掩码或类别级监督。代码发布于 https://github.com/visinf/INSID3。

摘要 (Abstract)

In-context segmentation (ICS) aims to segment arbitrary concepts, e.g., objects, parts, or personalized instances, given one annotated visual examples. Existing work relies on (i) fine-tuning vision foundation models (VFMs), which improves in-domain results but harms generalization, or (ii) combines multiple frozen VFMs, which preserves generalization but yields architectural complexity and fixed segmentation granularities. We revisit ICS from a minimalist perspective and ask: Can a single self-supervised backbone support both semantic matching and segmentation, without any supervision or auxiliary models? We show that scaled-up dense self-supervised features from DINOv3 exhibit strong spatial structure and semantic correspondence. We introduce INSID3, a training-free approach that segments concepts at varying granularities only from frozen DINOv3 features, given an in-context example. INSID3 achieves state-of-the-art results across one-shot semantic, part, and personalized segmentation, outperforming previous work by +7.5 % mIoU, while using 3x fewer parameters and without any mask or category-level supervision. Code is available at https://github.com/visinf/INSID3 .

关键词: In-context segmentation, DINOv3, self-supervised learning, training-free, one-shot segmentation, vision foundation models, semantic matching, frozen features

212. ❌ Post-hoc Self-explanation of CNNs

作者: Ahcène Boubekki, Line H. Clemmensen 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28466v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究卷积神经网络（CNN）的事后自解释方法，通过k-means分类器和特征激活组合生成概念解释图。论文核心是CNN的可解释性技术，与大多数关键词（涉及大模型、训练方法、推理优化、代理系统等）完全无关。仅与’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为论文直接研究CNN的解释性方法，属于可解释AI范畴。其他关键词均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于k-means分类器和卷积特征激活组合的CNN事后自解释方法，在保持预测性能的同时生成概念解释图，并通过ResNet34实验验证了浅层特征激活在语义保真度与预测性能间的权衡。

摘要翻译

尽管标准卷积神经网络（CNN）在数学上可被重新解释为自解释模型（SEM），但其内置原型本身无法准确表征数据。将最终线性层替换为基于$k$-means的分类器可在不影响性能的前提下解决这一局限。本研究针对分类器、编码器最终输出（B4）以及中间特征激活的组合，提出了基于$k$-means的事后解释的统一形式化框架。后一种方法利用卷积感受野的空间一致性生成基于概念的解释图，并通过无梯度特征归因图提供支持。基于ResNet34的实证评估表明，使用较浅层、压缩程度较低的特征激活（如最后三个模块B234的特征）会在语义保真度与轻微预测性能下降之间产生权衡。

摘要 (Abstract)

Although standard Convolutional Neural Networks (CNNs) can be mathematically reinterpreted as Self-Explainable Models (SEMs), their built-in prototypes do not on their own accurately represent the data. Replacing the final linear layer with a $k$-means-based classifier addresses this limitation without compromising performance. This work introduces a common formalization of $k$-means-based post-hoc explanations for the classifier, the encoder’s final output (B4), and combinations of intermediate feature activations. The latter approach leverages the spatial consistency of convolutional receptive fields to generate concept-based explanation maps, which are supported by gradient-free feature attribution maps. Empirical evaluation with a ResNet34 shows that using shallower, less compressed feature activations, such as those from the last three blocks (B234), results in a trade-off between semantic fidelity and a slight reduction in predictive performance.

关键词: Convolutional Neural Networks, Self-Explainable Models, k-means classifier, post-hoc explanations, concept-based explanation maps, feature attribution maps, ResNet34, semantic fidelity

213. ❌ Decoupling Wavelet Sub-bands for Single Source Domain Generalization in Fundus Image Segmentation

作者: Shramana Dey, Varun Ajith, Abhirup Banerjee, Sushmita Mitra 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28463v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学图像分割中的单源域泛化问题，提出了一种基于小波子带分解的深度学习网络WaveSDG。论文的核心技术是计算机视觉和图像处理中的小波变换、特征解耦和分割网络设计，与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究将AI应用于生物医学图像分析（眼底图像分割），属于AI for Science的范畴，但并非其核心创新点，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究解决了眼底图像分割中因设备差异导致的单源域泛化难题，提出了一种小波引导的解耦分割网络WaveSDG，通过在多个未见目标数据集上的实验验证，其性能超越了七种先进方法，实现了更高的准确性和跨域稳定性。

摘要翻译

眼底成像领域的域泛化因设备与临床环境采集条件差异而面临挑战。深度学习模型难以适应这些变化，导致其在未见域上性能下降。此外，跨域获取标注数据成本高昂，且隐私限制进一步制约了数据可用性。尽管单源域泛化为该问题提供了现实解决方案，但现有方法往往未能有效捕捉解剖拓扑结构或将外观特征与解剖特征解耦。本研究提出WaveSDG，一种新型小波引导的单源域泛化分割网络。该网络通过小波子带分解将解剖结构与域特异性外观特征解耦。我们提出了一种新颖的基于小波的不变结构提取与优化模块，通过利用各小波子带的差异化语义角色处理编码器特征。该模块优化低频分量以锚定全局解剖结构，同时选择性增强高频子带中的方向性边缘并抑制噪声。大量消融实验验证了该模块及其解耦策略的有效性。我们在一个源数据集和五个未见目标数据集上对视杯与视盘分割进行评估，结果表明WaveSDG在七种前沿方法中持续表现最优。值得注意的是，该方法以更低的方差取得了最佳平衡Dice分数和最低95%豪斯多夫距离，显示出更高的准确性、鲁棒性与跨域稳定性。

摘要 (Abstract)

Domain generalization in fundus imaging is challenging due to variations in acquisition conditions across devices and clinical settings. The inability to adapt to these variations causes performance degradation on unseen domains for deep learning models. Besides, obtaining annotated data across domains is often expensive and privacy constraints restricts their availability. Although single-source domain generalization (SDG) offers a realistic solution to this problem, the existing approaches frequently fail to capture anatomical topology or decouple appearance from anatomical features. This research introduces WaveSDG, a new wavelet-guided segmentation network for SDG. It decouples anatomical structure from domain-specific appearance through a wavelet sub-band decomposition. A novel Wavelet-based Invariant Structure Extraction and Refinement (WISER) module is proposed to process encoder features by leveraging distinct semantic roles of each wavelet sub-band. The module refines low-frequency components to anchor global anatomy, while selectively enhancing directional edges and suppressing noise within the high-frequency sub-bands. Extensive ablation studies validate the effectiveness of the WISER module and its decoupling strategy. Our evaluations on optic cup and optic disc segmentation across one source and five unseen target datasets show that WaveSDG consistently outperforms seven state-of-the-art methods. Notably, it achieves the best balanced Dice score and lowest 95th percentile Hausdorff distance with reduced variance, indicating improved accuracy, robustness, and cross-domain stability.

关键词: Domain Generalization, Fundus Image Segmentation, Wavelet Sub-band Decomposition, Single-source Domain Generalization, Anatomical Structure Decoupling, WaveSDG, Optic Cup/Disc Segmentation, Cross-domain Robustness

214. ❌ $R_{dm}$: Re-conceptualizing Distribution Matching as a Reward for Diffusion Distillation

作者: Linqian Fan, Peiqin Sun, Tiancheng Wen, Shun Lu, Chengru Song 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28460v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于扩散模型的蒸馏技术，提出了一种将分布匹配重新概念化为奖励的新范式，并引入了组归一化分布匹配（GNDM）来稳定优化。虽然论文涉及强化学习（RL）概念，但其核心是扩散模型的生成和蒸馏，而非大语言模型（LLM）或深度学习技术原理的创新。所有关键词均与大语言模型、其训练方法、推理优化、应用领域或特定技术（如MoE、RAG、CoT等）相关，与本文的扩散模型研究无直接关联，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种将分布匹配重新概念化为奖励的新范式（R_dm），并引入了组归一化分布匹配（GNDM）来稳定扩散模型蒸馏的优化，从而在减少采样步骤的同时提高了生成质量和效率。

摘要翻译

扩散模型虽能实现最先进的生成性能，但其缓慢的迭代采样过程从根本上构成了瓶颈。尽管扩散蒸馏技术能够实现高保真的少步生成，但传统目标函数往往将学生模型的性能仅锚定于教师模型，从而限制了其表现。近期研究尝试通过引入强化学习（RL）来突破这一上限，通常采用蒸馏目标与RL目标的简单叠加方式。本文提出一种新范式，将分布匹配重新定义为一种奖励信号，记为$R_{dm}$。这一统一视角弥合了扩散匹配蒸馏（DMD）与强化学习之间的算法鸿沟，并带来以下关键优势：（1）增强的优化稳定性：我们提出组归一化分布匹配（GNDM），通过将标准RL中的组归一化技术适配于$R_{m}$估计以提升稳定性。GNDM利用组均值统计量，建立起更鲁棒有效的优化方向。（2）无缝的奖励集成：我们以奖励为核心的公式天然支持自适应加权机制，能够灵活地将DMD与外部奖励模型相结合。（3）提升的采样效率：该框架通过与强化学习原则对齐，可自然融入重要性采样（IS），从而显著提高采样效率。大量实验表明，GNDM优于原始DMD方法，将FID指标降低了1.87。此外，我们的多奖励变体GNDMR在美学质量与保真度之间取得了优异平衡，达到了30.37的峰值HPS分数和12.21的低FID-SD值，超越了现有基线方法。总体而言，$R_{dm}$为实时高保真合成提供了一个灵活、稳定且高效的框架。代码将在论文发表后开源。

摘要 (Abstract)

Diffusion models achieve state-of-the-art generative performance but are fundamentally bottlenecked by their slow iterative sampling process. While diffusion distillation techniques enable high-fidelity few-step generation, traditional objectives often restrict the student’s performance by anchoring it solely to the teacher. Recent approaches have attempted to break this ceiling by integrating Reinforcement Learning (RL), typically through a simple summation of distillation and RL objectives. In this work, we propose a novel paradigm by reconceptualizing distribution matching as a reward, denoted as $R_{dm}$. This unified perspective bridges the algorithmic gap between Diffusion Matching Distillation (DMD) and RL, providing several key benefits. (1) Enhanced optimization stability: we introduce Group Normalized Distribution Matching (GNDM), which adapts standard RL group normalization to stabilize $R_{dm}$ estimation. By leveraging group-mean statistics, GNDM establishes a more robust and effective optimization direction. (2) Seamless reward integration: our reward-centric formulation inherently supports adaptive weighting mechanisms, allowing flexible combination of DMD with external reward models. (3) Improved sampling efficiency: by aligning with RL principles, the framework readily incorporates importance sampling (IS), leading to a significant boost in sampling efficiency. Extensive experiments demonstrate that GNDM outperforms vanilla DMD, reducing the FID by 1.87. Furthermore, our multi-reward variant, GNDMR, surpasses existing baselines by achieving a strong balance between aesthetic quality and fidelity, reaching a peak HPS of 30.37 and a low FID-SD of 12.21. Overall, $R_{dm}$ provides a flexible, stable, and efficient framework for real-time high-fidelity synthesis. Code will be released upon publication.

关键词: diffusion models, diffusion distillation, distribution matching, reinforcement learning, GNDM, sampling efficiency, generative performance, real-time synthesis

215. ❌ Tele-Catch: Adaptive Teleoperation for Dexterous Dynamic 3D Object Catching

作者: Weiguang Zhao, Junting Dong, Rui Zhang, Kailin Li, Qin Zhao, Kaizhu Huang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28427v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究机器人遥操作和动态物体抓取，核心是共享自主权框架、扩散策略和几何感知决策，未涉及大语言模型、深度学习技术原理或科学AI应用，与所有评分关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了Tele-Catch框架，通过动态感知自适应集成机制和几何感知扩散策略，解决了灵巧手遥操作在动态物体抓取任务中的准确性和鲁棒性问题。

摘要翻译

遥操作是将人类灵巧性迁移至机器人的关键范式，然而现有研究大多针对初始静止的物体，如抓取或操控任务。动态物体捕捉——即物体在接触前处于运动状态的任务——仍未得到充分探索。在此类任务中，纯遥操作常因时机、位姿和力控误差而失败，这凸显了需要将人类输入与自主策略相结合的共享自主控制方法。为此，我们提出了Tele-Catch，一个用于动态物体捕捉的灵巧手遥操作系统框架。其核心是设计了DAIM（动态感知自适应集成机制），该机制通过将基于数据手套的遥操作信号融合至扩散策略的去噪过程中，实现了共享自主控制。它能根据交互物体的状态自适应地调节控制。为提升策略鲁棒性，我们提出了DP-U3R，该方法将来自点云观测的无监督几何表征整合到扩散策略学习中，实现了几何感知的决策。大量实验表明，Tele-Catch在动态捕捉任务中显著提升了准确性与鲁棒性，同时在不同灵巧手实体及先前未见过的物体类别上也表现出一致的性能提升。

摘要 (Abstract)

Teleoperation is a key paradigm for transferring human dexterity to robots, yet most prior work targets objects that are initially static, such as grasping or manipulation. Dynamic object catch, where objects move before contact, remains underexplored. Pure teleoperation in this task often fails due to timing, pose, and force errors, highlighting the need for shared autonomy that combines human input with autonomous policies. To this end, we present Tele-Catch, a systematic framework for dexterous hand teleoperation in dynamic object catching. At its core, we design DAIM, a dynamics-aware adaptive integration mechanism that realizes shared autonomy by fusing glove-based teleoperation signals into the diffusion policy denoising process. It adaptively modulates control based on the interaction object state. To improve policy robustness, we introduce DP-U3R, which integrates unsupervised geometric representations from point cloud observations into diffusion policy learning, enabling geometry-aware decision making. Extensive experiments demonstrate that Tele-Catch significantly improves accuracy and robustness in dynamic catching tasks, while also exhibiting consistent gains across distinct dexterous hand embodiments and previously unseen object categories.

关键词: teleoperation, dynamic object catching, shared autonomy, diffusion policy, dexterous hand, adaptive integration, geometry-aware decision making, robustness

216. ❌ From Pixels to Reality: Physical-Digital Patch Attacks on Real-World Camera

作者: Victoria Leonenkova, Ekaterina Shumitskaya, Dmitriy Vatolin, Anastasia Antsiferova 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28425v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究数字物理对抗攻击（DiPA）在摄像头认证系统中的应用，属于计算机视觉和网络安全领域，与所有评分关键词（均围绕大模型、深度学习技术原理及其在科学领域的应用）完全无关。论文未涉及任何大模型、语言模型、训练方法、推理技术、对齐、压缩、代理系统或科学AI应用等内容。

!!! tip deepseek-chat TL;DR

该论文提出了一种针对摄像头认证系统的数字物理对抗攻击方法（DiPA），通过在智能手机屏幕上显示对抗性补丁来实时干扰人脸识别系统，实验证明该方法比现有物理攻击具有更高的成功率和特征空间扭曲效果。

摘要翻译

本演示提出数字-物理对抗攻击（DiPA），这是一类针对普适性摄像头认证系统的新型实用对抗攻击方法。攻击者无需依赖打印载体，而是直接在智能手机屏幕上显示对抗性补丁。这种纯数字化的物理呈现方式实现了快速部署，消除了全变分正则化的需求，并提升了黑盒条件下的补丁可迁移性。DiPA通过集成前沿人脸识别模型（包括ArcFace、MagFace、CosFace）来增强对未知商业系统的跨模型迁移能力。我们的交互式演示实时展示了针对已部署人脸识别摄像头的躲避攻击：当参与者动态调整补丁图案时，授权用户无法被系统识别，同时攻击对传感流程的即时影响可被直观观测。我们进一步验证了DiPA在攻击成功率、特征空间扭曲度和检测置信度降低等维度上优于现有物理攻击方法，从而揭示了移动设备、普适视觉与传感器驱动认证基础设施交叉领域存在的关键安全漏洞。

摘要 (Abstract)

This demonstration presents Digital-Physical Adversarial Attacks (DiPA), a new class of practical adversarial attacks against pervasive camera-based authentication systems, where an attacker displays an adversarial patch directly on a smartphone screen instead of relying on printed artifacts. This digital-only physical presentation enables rapid deployment, removes the need for total-variation regularization, and improves patch transferability in black-box conditions. DiPA leverages an ensemble of state-of-the-art face-recognition models (ArcFace, MagFace, CosFace) to enhance transfer across unseen commercial systems. Our interactive demo shows a real-time dodging attack against a deployed face-recognition camera, preventing authorized users from being recognized while participants dynamically adjust patch patterns and observe immediate effects on the sensing pipeline. We further demonstrate DiPA’s superiority over existing physical attacks in terms of success rate, feature-space distortion, and reductions in detection confidence, highlighting critical vulnerabilities at the intersection of mobile devices, pervasive vision, and sensor-driven authentication infrastructures.

关键词: adversarial attacks, face recognition, camera authentication, digital-physical attacks, transferability, real-time attack, security vulnerability, mobile devices

217. ❌ Unified Restoration-Perception Learning: Maritime Infrared-Visible Image Fusion and Segmentation

作者: Weichao Cai, Weiliang Huang, Biao Xue, Chao Huang, Fei Yuan, Bob Zhang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28414v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域，研究红外-可见光图像融合与分割，采用多任务学习框架解决海洋场景下的图像退化问题。论文内容涉及图像处理、多模态融合、语义分割等传统计算机视觉任务，未涉及大语言模型、深度学习技术原理创新、AI for Science等关键词相关的任何内容。所有关键词均与大模型、深度学习技术原理或科学AI应用无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对海洋环境中红外-可见光图像因雾和强反射导致的退化问题，提出了一个多任务互补学习框架，通过图像恢复、多模态融合和语义分割的协同处理，在自建数据集上实现了最先进的分割性能。

摘要翻译

海洋场景理解与分割在海事监测与航行安全中起着至关重要的作用。然而，海洋环境中普遍存在的雾、强反射等因素会导致严重的图像退化，显著影响语义感知的稳定性。现有的复原与增强方法通常针对特定退化类型或仅关注视觉质量，缺乏能够同时提升结构恢复与语义有效性的端到端协同机制。此外，公开可用的红外-可见光数据集主要采集自城市场景，未能捕捉海洋环境中耦合退化的真实特征。为应对这些挑战，本文提出了红外-可见光海上船舶数据集（Infrared-Visible Maritime Ship Dataset, IVMSD），涵盖多种天气与光照条件下的海洋场景。基于该数据集，我们提出了一种多任务互补学习框架（Multi-task Complementary Learning Framework, MCLF），在统一架构内协同执行图像复原、多模态融合与语义分割。该框架包含用于退化抑制与结构增强的频率-空间增强互补（Frequency-Spatial Enhancement Complementary, FSEC）模块，用于语义一致性引导的语义-视觉一致性注意力（Semantic-Visual Consistency Attention, SVCA）模块，以及用于选择性融合的跨模态引导注意力机制。在IVMSD上的实验结果表明，所提方法实现了最先进的分割性能，显著提升了复杂海洋条件下的鲁棒性与感知质量。

摘要 (Abstract)

Marine scene understanding and segmentation plays a vital role in maritime monitoring and navigation safety. However, prevalent factors like fog and strong reflections in maritime environments cause severe image degradation, significantly compromising the stability of semantic perception. Existing restoration and enhancement methods typically target specific degradations or focus solely on visual quality, lacking end-to-end collaborative mechanisms that simultaneously improve structural recovery and semantic effectiveness. Moreover, publicly available infrared-visible datasets are predominantly collected from urban scenes, failing to capture the authentic characteristics of coupled degradations in marine environments. To address these challenges, the Infrared-Visible Maritime Ship Dataset (IVMSD) is proposed to cover various maritime scenarios under diverse weather and illumination conditions. Building upon this dataset, a Multi-task Complementary Learning Framework (MCLF) is proposed to collaboratively perform image restoration, multimodal fusion, and semantic segmentation within a unified architecture. The framework includes a Frequency-Spatial Enhancement Complementary (FSEC) module for degradation suppression and structural enhancement, a Semantic-Visual Consistency Attention (SVCA) module for semantic-consistent guidance, and a cross-modality guided attention mechanism for selective fusion. Experimental results on IVMSD demonstrate that the proposed method achieves state-of-the-art segmentation performance, significantly enhancing robustness and perceptual quality under complex maritime conditions.

关键词: Infrared-Visible Image Fusion, Semantic Segmentation, Maritime Scene Understanding, Image Restoration, Multi-task Learning, Marine Environment, Multimodal Fusion, Degradation Suppression

218. ❌ SVH-BD : Synthetic Vegetation Hyperspectral Benchmark Dataset for Emulation of Remote Sensing Images

作者: Chedly Ben Azizi, Claire Guilloteau, Gilles Roussel, Matthieu Puigt 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28390v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文主要介绍了一个合成高光谱植被数据集SVH-BD，用于支持辐射传输模拟、植被性状反演和不确定性量化研究。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关，因为这些关键词都聚焦于大语言模型和深度学习技术本身。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该数据集旨在支持机器学习任务（如开发模拟器和反演方法）在遥感科学领域的应用，属于AI for Science的范畴，但并非论文的核心创新点（核心是数据集本身），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究创建了一个名为SVH-BD的大规模合成高光谱植被数据集，包含10,915个图像立方体和像素级植被性状图，旨在为辐射传输模拟、植被性状反演和不确定性量化研究提供基准资源。

摘要翻译

本数据集提供了10,915组合成高光谱图像立方体与像素级植被性状图的配对集合，旨在支持辐射传输模拟、植被性状反演及不确定性量化的相关研究。每个高光谱立方体包含211个波段，覆盖400–2500 nm光谱范围，分辨率为10 nm，空间布局固定为64×64像素，提供连续的模拟地表反射光谱，适用于模拟器开发及需要高光谱细节的机器学习任务。植被性状通过基于PROSAIL的查找表方法反演Sentinel-2 Level-2A地表反射率数据获得，并进一步通过PROSAIL前向模拟在物理一致的冠层与光照条件下生成高光谱反射率。数据集涵盖四个生态多样性区域——东非、法国北部、印度东部和西班牙南部，同时包含第5与第95百分位的不确定性图及Sentinel-2场景分类层。该资源可用于反演方法的基准测试、快速辐射传输模拟器的开发，以及在受控但真实的环境变化下研究光谱与生物物理特征的关系。

摘要 (Abstract)

This dataset provides a large collection of 10,915 synthetic hyperspectral image cubes paired with pixel-level vegetation trait maps, designed to support research in radiative transfer emulation, vegetation trait retrieval, and uncertainty quantification. Each hyperspectral cube contains 211 bands spanning 400–2500 nm at 10 nm resolution and a fixed spatial layout of 64 \times 64 pixels, offering continuous simulated surface reflectance spectra suitable for emulator development and machine-learning tasks requiring high spectral detail. Vegetation traits were derived by inverting Sentinel-2 Level-2A surface reflectance using a PROSAIL-based lookup-table approach, followed by forward PROSAIL simulations to generate hyperspectral reflectance under physically consistent canopy and illumination conditions. The dataset covers four ecologically diverse regions – East Africa, Northern France, Eastern India, and Southern Spain – and includes 5th and 95th percentile uncertainty maps as well as Sentinel-2 scene classification layers. This resource enables benchmarking of inversion methods, development of fast radiative transfer emulators, and studies of spectral–biophysical relationships under controlled yet realistic environmental variability.

关键词: synthetic hyperspectral dataset, vegetation trait retrieval, radiative transfer emulation, PROSAIL simulation, uncertainty quantification, benchmark dataset, remote sensing, machine learning

219. ❌ AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation

作者: Milton Zhou, Sizhong Qin, Yongzhi Li, Quan Chen, Peng Jiang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28366v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出AutoCut框架，基于基础模型开发多模态大语言模型用于视频编辑，通过监督微调（SFT）实现，与’Supervised Fine-tuning’高度相关（10分）。使用残差向量量化进行离散化，与’Quantization’有一定关联（5分）。框架基于基础模型构建，与’Large Language Models’相关（8分）。其他关键词如MoE、Scaling Laws、RLHF等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该研究针对短视频广告制作成本高、效率低的问题，提出了基于多模态离散化和可控生成的端到端广告视频编辑框架AutoCut，通过实验证明其能显著降低制作成本、缩短迭代时间并提高一致性和可控性。

摘要翻译

短视频已成为数字广告的主要媒介，其内容创作需要具备可扩展性和高效性。然而，当前的工作流程与人工智能工具仍处于割裂状态且局限于单一模态，导致制作成本高昂、整体效率低下。为解决这一问题，我们提出了AutoCut——一个基于多模态离散化与可控编辑的端到端广告视频编辑框架。AutoCut采用专用编码器提取视频与音频特征，随后通过残差向量量化将其离散化为与文本表征对齐的统一标记（tokens），从而构建共享的视频-音频-文本标记空间。基于基础模型，我们进一步通过多模态对齐与监督微调相结合的方式，开发了用于视频编辑的多模态大语言模型，该模型在统一的编辑框架内支持视频筛选与排序、脚本生成及背景音乐选择等任务。最终，完整的生产流水线将预测的标记序列转换为可部署的长视频输出。在真实广告数据集上的实验表明，AutoCut在显著提升一致性与可控性的同时，有效降低了制作成本与迭代时间，为规模化视频创作开辟了新路径。

摘要 (Abstract)

Short-form videos have become a primary medium for digital advertising, requiring scalable and efficient content creation. However, current workflows and AI tools remain disjoint and modality-specific, leading to high production costs and low overall efficiency. To address this issue, we propose AutoCut, an end-to-end advertisement video editing framework based on multimodal discretization and controllable editing. AutoCut employs dedicated encoders to extract video and audio features, then applies residual vector quantization to discretize them into unified tokens aligned with textual representations, constructing a shared video-audio-text token space. Built upon a foundation model, we further develop a multimodal large language model for video editing through combined multimodal alignment and supervised fine-tuning, supporting tasks covering video selection and ordering, script generation, and background music selection within a unified editing framework. Finally, a complete production pipeline converts the predicted token sequences into deployable long video outputs. Experiments on real-world advertisement datasets show that AutoCut reduces production cost and iteration time while substantially improving consistency and controllability, paving the way for scalable video creation.

关键词: advertisement video editing, multimodal discretization, controllable generation, multimodal large language model, supervised fine-tuning, residual vector quantization, end-to-end framework, video-audio-text token space

220. ❌ Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models

作者: Tao Xia, Jiawei Liu, Yukun Zhang, Ting Liu, Wei Wang, Lei Zhang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28367v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视觉自回归模型（VAR）在文本引导图像编辑中的应用，提出了一种改进结构一致性的方法。所有评分关键词均与大语言模型（LLM）、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是计算机视觉领域的图像编辑技术，与文本语言模型无关，也未涉及深度学习技术原理的创新或科学领域应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文针对视觉自回归模型在文本引导图像编辑中存在的结构一致性挑战，提出了一种基于特征分析和强化学习的自适应特征注入方法，显著提升了编辑结果的结构保持能力。

摘要翻译

视觉自回归（VAR）模型近年来已成为一类前景广阔的生成模型，能够支持文本引导图像编辑等多种下游视觉任务。通过将编辑范式从基于扩散方法的噪声操作转向令牌级操作，基于VAR的方法实现了更优的背景保持能力和显著更快的推理速度。然而，现有的基于VAR的编辑方法仍面临两个关键挑战：如何准确定位可编辑令牌，以及如何在编辑结果中保持结构一致性。本研究提出了一种基于VAR模型中间特征分布分析的新型文本引导图像编辑框架。首先，我们引入一种由粗到精的令牌定位策略，能够细化可编辑区域，在编辑保真度与背景保持之间取得平衡。其次，我们分析了VAR模型的中间表示，识别出与结构相关的特征，并据此设计了一种简单而有效的特征注入机制，以增强编辑图像与源图像之间的结构一致性。第三，我们开发了一种基于强化学习的自适应特征注入方案，能够自动学习特定尺度与层级的注入比例，从而联合优化编辑保真度与结构保持能力。大量实验表明，在局部和全局编辑场景下，我们的方法相比现有先进技术均实现了更优的结构一致性与编辑质量。

摘要 (Abstract)

Visual autoregressive (VAR) models have recently emerged as a promising family of generative models, enabling a wide range of downstream vision tasks such as text-guided image editing. By shifting the editing paradigm from noise manipulation in diffusion-based methods to token-level operations, VAR-based approaches achieve better background preservation and significantly faster inference. However, existing VAR-based editing methods still face two key challenges: accurately localizing editable tokens and maintaining structural consistency in the edited results. In this work, we propose a novel text-guided image editing framework rooted in an analysis of intermediate feature distributions within VAR models. First, we introduce a coarse-to-fine token localization strategy that can refine editable regions, balancing editing fidelity and background preservation. Second, we analyze the intermediate representations of VAR models and identify structure-related features, by which we design a simple yet effective feature injection mechanism to enhance structural consistency between the edited and source images. Third, we develop a reinforcement learning-based adaptive feature injection scheme that automatically learns scale- and layer-specific injection ratios to jointly optimize editing fidelity and structure preservation. Extensive experiments demonstrate that our method achieves superior structural consistency and editing quality compared with state-of-the-art approaches, across both local and global editing scenarios.

关键词: visual autoregressive models, text-guided image editing, structure preservation, token localization, feature injection, reinforcement learning, editing fidelity, background preservation

221. ❌ SEA: Evaluating Sketch Abstraction Efficiency via Element-level Commonsense Visual Question Answering

作者: Jiho Park, Sieun Choi, Jaeyoon Seo, Minho Sohn, Yeana Kim, Jihie Kim 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28363v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究草图抽象效率评估，提出SEA指标和CommonSketch数据集，使用视觉问答模型评估草图元素保留情况。所有关键词均涉及大模型/深度学习技术原理或特定应用领域（如生物信息学），而本文专注于计算机视觉中的草图理解评估，未涉及大模型技术原理创新或AI for Science应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了SEA指标来评估草图抽象效率，并创建了CommonSketch数据集，通过视觉问答模型量化草图在简化视觉元素的同时保留语义可识别性的能力。

摘要翻译

草图是一种经过提炼的视觉抽象形式，它通过简化而有意图的笔触传达核心概念，同时省略无关细节。尽管其表现力强大，量化草图中语义抽象的效率仍然具有挑战性。现有的评估方法依赖于参考图像、低级视觉特征或识别准确度，未能捕捉到抽象性这一草图的本质属性。为应对这些局限，我们提出了SEA（面向抽象效率的草图评估指标），这是一种无需参考的指标，用于评估草图如何精炼地表现类别定义的视觉元素，同时保持语义可识别性。这些元素是基于关于草图中通常描绘特征的常识知识，按类别提取的。SEA利用视觉问答模型来确定每个元素的存在与否，并返回一个量化分数，该分数反映了在视觉经济性下的语义保留程度。为支持此指标，我们提出了CommonSketch，这是首个带有语义标注的草图数据集，包含跨越300个类别的23,100幅手绘草图，每幅草图均配有标题和元素级标注。实验表明，SEA与人类判断高度一致，并能可靠地区分抽象效率的层次，而CommonSketch作为一个基准数据集，为评估各种视觉-语言模型在元素级草图理解方面的能力提供了系统化的评估框架。

摘要 (Abstract)

A sketch is a distilled form of visual abstraction that conveys core concepts through simplified yet purposeful strokes while omitting extraneous detail. Despite its expressive power, quantifying the efficiency of semantic abstraction in sketches remains challenging. Existing evaluation methods that rely on reference images, low-level visual features, or recognition accuracy do not capture abstraction, the defining property of sketches. To address these limitations, we introduce SEA (Sketch Evaluation metric for Abstraction efficiency), a reference-free metric that assesses how economically a sketch represents class-defining visual elements while preserving semantic recognizability. These elements are derived per class from commonsense knowledge about features typically depicted in sketches. SEA leverages a visual question answering model to determine the presence of each element and returns a quantitative score that reflects semantic retention under visual economy. To support this metric, we present CommonSketch, the first semantically annotated sketch dataset, comprising 23,100 human-drawn sketches across 300 classes, each paired with a caption and element-level annotations. Experiments show that SEA aligns closely with human judgments and reliably discriminates levels of abstraction efficiency, while CommonSketch serves as a benchmark providing systematic evaluation of element-level sketch understanding across various vision-language models.

关键词: sketch abstraction, visual question answering, evaluation metric, semantic recognition, element-level annotation, benchmark dataset, computer vision, vision-language models

222. ❌ Optimized Weighted Voting System for Brain Tumor Classification Using MRI Images

作者: Ha Anh Vu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28357v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用加权集成学习方法进行脑肿瘤MRI图像分类，涉及深度学习模型（如ResNet、DenseNet）和传统机器学习模型（如SVM、KNN），属于医学图像分析领域。所有关键词均与大模型技术、训练方法、推理优化、代理系统等无关，因此除’AI for Science OR Bioinformatics OR Cheminformatics’外，其他关键词评分为0。‘AI for Science’关键词评分为5，因为论文应用AI于医学图像分析（生物信息学相关），但并非核心创新点，仅为应用场景。

!!! tip deepseek-chat TL;DR

该论文提出了一种加权集成学习方法，结合深度学习和传统机器学习模型，用于脑肿瘤MRI图像分类，在Figshare和Kaggle数据集上实现了最先进的准确率。

摘要翻译

脑肿瘤在磁共振成像扫描中的精确分类对于有效诊断和治疗规划至关重要。本文提出一种加权集成学习方法，通过结合深度学习与传统机器学习模型以提升分类性能。该系统整合了多种分类器，包括ResNet101、DenseNet121、Xception、CNN-MRI、基于边缘增强图像的ResNet50，以及采用方向梯度直方图特征的SVM和K近邻算法。通过加权投票机制，为具有更高个体准确率的模型分配更大权重，从而确保决策的鲁棒性。研究采用平衡对比度增强、K均值聚类和Canny边缘检测等图像处理技术以优化特征提取。在Figshare和Kaggle磁共振成像数据集上的实验评估表明，所提方法达到了先进的准确率，性能优于现有模型。这些发现凸显了基于集成的学习在改善脑肿瘤分类方面的潜力，为医学图像分析提供了一个可靠且可扩展的框架。

摘要 (Abstract)

The accurate classification of brain tumors from MRI scans is essential for effective diagnosis and treatment planning. This paper presents a weighted ensemble learning approach that combines deep learning and traditional machine learning models to improve classification performance. The proposed system integrates multiple classifiers, including ResNet101, DenseNet121, Xception, CNN-MRI, and ResNet50 with edge-enhanced images, SVM, and KNN with HOG features. A weighted voting mechanism assigns higher influence to models with better individual accuracy, ensuring robust decision-making. Image processing techniques such as Balance Contrast Enhancement, K-means clustering, and Canny edge detection are applied to enhance feature extraction. Experimental evaluations on the Figshare and Kaggle MRI datasets demonstrate that the proposed method achieves state-of-the-art accuracy, outperforming existing models. These findings highlight the potential of ensemble-based learning for improving brain tumor classification, offering a reliable and scalable framework for medical image analysis.

关键词: brain tumor classification, MRI images, ensemble learning, weighted voting, deep learning, medical image analysis, ResNet, SVM

223. ❌ VistaGEN: Consistent Driving Video Generation with Fine-Grained Control Using Multiview Visual-Language Reasoning

作者: Li-Heng Chen, Ke Cheng, Yahui Liu, Lei Shi, Shi-Sheng Huang, Hongbo Fu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28353v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文VistaGEN专注于驾驶视频生成技术，通过多视图视觉-语言推理实现细粒度控制和时空一致性。虽然涉及视觉-语言模型，但所有关键词均针对大语言模型（LLM）的技术原理、训练方法、推理优化、对齐、应用等具体方向，而本文的核心是视频生成模型（如扩散模型）与视觉-语言特征的结合，未涉及LLM架构、训练、推理或科学应用。因此，所有关键词与论文内容完全无关，得分为0。

!!! tip deepseek-chat TL;DR

论文提出VistaGEN，一种通过多视图视觉-语言推理实现细粒度控制和时空一致性的驾驶视频生成技术，解决了长视频生成中对象级可控性和一致性问题，并通过生成-评估-再生闭环机制提升了输出质量。

摘要翻译

驾驶视频生成技术在可控性、视频分辨率与时长方面已取得显著进展，但在保持时空一致性（尤其是生成长视频时）的同时，仍难以实现对多样化驾驶视频的细粒度物体级控制。本文提出一种新的驾驶视频生成技术VistaGEN，它能够在长视频序列中保持时空一致性的同时，实现对特定实体（包括3D物体、图像及文本描述）的细粒度控制。我们的核心创新在于将多视角视觉-语言推理融入长驾驶视频生成过程。为此，我们将视觉-语言特征注入多视角视频生成器，以实现细粒度可控性。更重要的是，我们提出一种多视角视觉-语言评估器（MV-VLM），能够智能且自动地评估生成内容的时空一致性，从而构建出一种新颖的“生成-评估-再生成”闭环生成机制。该机制确保了高质量、连贯的输出，助力构建复杂可靠的驾驶场景。此外，在闭环生成框架内，我们引入了物体级优化模块，用于精炼MV-VLM评估中未达标的生成结果，并将其反馈至视频生成器进行重新生成。大量实验评估表明，我们的VistaGEN能够实现具有细粒度可控性的多样化驾驶视频生成结果（尤其针对长尾物体），且在时空一致性方面显著优于现有方法。

摘要 (Abstract)

Driving video generation has achieved much progress in controllability, video resolution, and length, but fails to support fine-grained object-level controllability for diverse driving videos, while preserving the spatiotemporal consistency, especially in long video generation. In this paper, we present a new driving video generation technique, called VistaGEN, which enables fine-grained control of specific entities, including 3D objects, images, and text descriptions, while maintaining spatiotemporal consistency in long video sequences. Our key innovation is the incorporation of multiview visual-language reasoning into the long driving video generation. To this end, we inject visual-language features into a multiview video generator to enable fine-grained controllability. More importantly, we propose a multiview vision-language evaluator (MV-VLM) to intelligently and automatically evaluate spatiotemporal consistency of the generated content, thus formulating a novel generation-evaluation-regeneration closed-loop generation mechanism. This mechanism ensures high-quality, coherent outputs, facilitating the creation of complex and reliable driving scenarios. Besides, within the closed-loop generation, we introduce an object-level refinement module to refine the unsatisfied results evaluated from the MV-VLM and then feed them back to the video generator for regeneration. Extensive evaluation shows that our VistaGEN achieves diverse driving video generation results with fine-grained controllability, especially for long-tail objects, and much better spatiotemporal consistency than previous approaches.

关键词: driving video generation, fine-grained control, spatiotemporal consistency, multiview visual-language reasoning, closed-loop generation, object-level refinement, long video sequences, visual-language features

224. ❌ SFDemorpher: Generalizable Face Demorphing for Operational Morphing Attack Detection

作者: Raul Ismayilov, Luuk Spreeuwers 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28322v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training	0.0	0.0/10	0.0
Post-training	0.0	0.0/10	0.0
Instruction Tuning	0.0	0.0/10	0.0
RLHF	0.0	0.0/10	0.0
PEFT	0.0	0.0/10	0.0
Retrieval-Augmented Generation	0.0	0.0/10	0.0
Context Window Extension	0.0	0.0/10	0.0
KV Cache Compression	0.0	0.0/10	0.0
Chain of Thought	0.0	0.0/10	0.0
System 2 Thinking	0.0	0.0/10	0.0
Monte Carlo Tree Search AND LLM	0.0	0.0/10	0.0
Self-Correction	0.0	0.0/10	0.0
LLM Agents	0.0	0.0/10	0.0
Tool Use	0.0	0.0/10	0.0
Multi-agent Systems	0.0	0.0/10	0.0
Quantization	0.0	0.0/10	0.0
Speculative Decoding	0.0	0.0/10	0.0
Hallucination Mitigation	0.0	0.0/10	0.0
Mechanistic Interpretability	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging	0.0	0.0/10	0.0
In-context Learning	0.0	0.0/10	0.0
AI for Science	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉和生物识别安全领域，研究人脸去变形攻击检测（D-MAD）和去变形技术，使用StyleGAN等生成模型。所有评分关键词均涉及大语言模型（LLM）及其相关技术（如训练、对齐、推理、部署优化等）或AI for Science应用，而本文完全不涉及LLM、深度学习技术原理创新或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出SFDemorpher框架，通过联合StyleGAN潜在空间和高维特征空间的身份解纠缠，解决了人脸变形攻击检测中现有方法因训练数据有限和假设所有文档输入均为变形而缺乏操作泛化性的问题，实现了在未见身份、多样捕获条件和13种变形技术上的最先进泛化性能。

摘要翻译

面部融合攻击通过生成可验证多个身份的文件图像，破坏了生物识别安全性，从证件签发到边境管控均构成重大风险。差分融合攻击检测（D-MAD）提供了一种有效的应对策略，尤其在采用面部解融合技术以分离融合图像中的混合身份时效果显著。然而，现有方法因训练数据有限且假设所有输入文件均为融合图像，缺乏实际部署的泛化能力。本文提出SFDemorpher框架，专为面向D-MAD的面部解融合操作部署设计，该框架在联合StyleGAN潜在空间与高维特征空间内执行身份解耦。我们引入一种双通道训练策略，可同时处理融合文件与真实文件，并利用以合成身份为主的混合数据集增强对未知数据分布的鲁棒性。大量实验验证表明，该框架在边境验证及具有挑战性的证件注册阶段，对未知身份、多样采集条件及13种融合技术均展现出先进的泛化性能。我们的框架通过扩大真实样本与融合样本的分数分布差异，同时提供高保真视觉重建以增强可解释性，实现了卓越的D-MAD性能。

摘要 (Abstract)

Face morphing attacks compromise biometric security by creating document images that verify against multiple identities, posing significant risks from document issuance to border control. Differential Morphing Attack Detection (D-MAD) offers an effective countermeasure, particularly when employing face demorphing to disentangle identities blended in the morph. However, existing methods lack operational generalizability due to limited training data and the assumption that all document inputs are morphs. This paper presents SFDemorpher, a framework designed for the operational deployment of face demorphing for D-MAD that performs identity disentanglement within joint StyleGAN latent and high-dimensional feature spaces. We introduce a dual-pass training strategy handling both morphed and bona fide documents, leveraging a hybrid corpus with predominantly synthetic identities to enhance robustness against unseen distributions. Extensive evaluation confirms state-of-the-art generalizability across unseen identities, diverse capture conditions, and 13 morphing techniques, spanning both border verification and the challenging document enrollment stage. Our framework achieves superior D-MAD performance by widening the margin between the score distributions of bona fide and morphed samples while providing high-fidelity visual reconstructions facilitating explainability.

关键词: Face Demorphing, Morphing Attack Detection, D-MAD, StyleGAN, Identity Disentanglement, Generalizability, Biometric Security, Operational Deployment

225. ❌ Beyond Scanpaths: Graph-Based Gaze Simulation in Dynamic Scenes

作者: Luke Palmer, Petar Palasek, Hazem Abdelkawy 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28319v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的人眼注视建模，特别是驾驶场景中的动态注意力模拟。研究内容涉及图神经网络（Affinity Relation Transformer）、自回归动态系统、原始注视轨迹建模和数据集构建（Focus100）。所有给定的关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本论文的核心是计算机视觉中的特定任务（注视模拟），未涉及任何大模型技术、深度学习创新或AI在生物/化学等科学领域的应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于图神经网络的自回归动态系统来模拟驾驶场景中的人眼注视轨迹，通过Affinity Relation Transformer和Object Density Network模型，在原始注视数据上训练，生成了比现有方法更自然的注视轨迹和显著性图。

摘要翻译

精确建模人类注意力对众多计算机视觉应用至关重要，尤其在汽车安全领域。现有方法通常将注视数据简化为显著图或扫描路径，仅隐式处理注视动态。本研究将注视建模构建为自回归动态系统，基于注视历史与动态环境，显式地随时间展开原始注视轨迹。驾驶场景被表示为以注视为中心的图结构，通过亲和关系变换器（Affinity Relation Transformer, ART）进行处理——这是一种异构图变换器，能够建模驾驶员注视、交通对象与道路结构之间的交互。我们进一步引入对象密度网络（Object Density Network, ODN）来预测下一步注视分布，以捕捉复杂环境中注意力转移的随机性及以对象为中心的特性。同时，我们发布了Focus100数据集，该数据集包含30名参与者观看第一人称驾驶录像时的原始注视数据。我们的统一方法直接在原始注视数据上进行训练（无需注视点过滤），相比现有注意力模型，能够生成更自然的注视轨迹、扫描路径动态及显著图，为动态环境中人类注意力的时序建模提供了重要见解。

摘要 (Abstract)

Accurately modelling human attention is essential for numerous computer vision applications, particularly in the domain of automotive safety. Existing methods typically collapse gaze into saliency maps or scanpaths, treating gaze dynamics only implicitly. We instead formulate gaze modelling as an autoregressive dynamical system and explicitly unroll raw gaze trajectories over time, conditioned on both gaze history and the evolving environment. Driving scenes are represented as gaze-centric graphs processed by the Affinity Relation Transformer (ART), a heterogeneous graph transformer that models interactions between driver gaze, traffic objects, and road structure. We further introduce the Object Density Network (ODN) to predict next-step gaze distributions, capturing the stochastic and object-centric nature of attentional shifts in complex environments. We also release Focus100, a new dataset of raw gaze data from 30 participants viewing egocentric driving footage. Trained directly on raw gaze, without fixation filtering, our unified approach produces more natural gaze trajectories, scanpath dynamics, and saliency maps than existing attention models, offering valuable insights for the temporal modelling of human attention in dynamic environments.

关键词: gaze modeling, dynamic scenes, autoregressive dynamical system, graph transformer, attention modeling, driving scenes, raw gaze trajectories, human attention

226. ❌ Prototype-Enhanced Multi-View Learning for Thyroid Nodule Ultrasound Classification

作者: Yangmei Chen, Zhongyuan Zhang, Xikun Zhang, Xinyu Hao, Mingliang Hou, Renqiang Luo, Ziqi Xu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28315v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于甲状腺结节超声分类的深度学习应用，属于医学影像分析领域。论文提出的PEMV-thyroid框架涉及多视图学习、原型增强和异构数据处理，但未涉及大语言模型（LLM）、模型架构创新（如MoE、量化）、训练技术（如RLHF、PEFT）、推理优化或智能体等关键词。仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为其属于AI在生物医学（Bioinformatics相关）的应用，但非核心内容，故给5分；其他关键词完全无关，均为0分。

!!! tip deepseek-chat TL;DR

该论文针对甲状腺结节超声图像分类中因数据异构性导致的泛化能力不足问题，提出了一种原型增强的多视图学习框架（PEMV-thyroid），在跨设备和跨域评估中显著提升了诊断准确性和泛化性能。

摘要翻译

甲状腺结节超声影像分类对早期诊断与临床决策至关重要；然而，尽管现有深度学习方法在分布内数据上表现出良好性能，但在不同超声设备或临床环境中部署时，其鲁棒性与泛化能力往往受限。这一局限主要源于甲状腺超声图像显著的异质性，可能导致模型学习到虚假关联而非可靠的诊断特征。为应对这一挑战，我们提出PEMV-thyroid（Prototype-Enhanced Multi-View Learning）框架，该框架通过从多特征视角学习互补表征，并利用混合原型信息的原型校正机制优化决策边界，从而有效处理数据异质性问题。通过将多视角表征与原型级指导相结合，所提方法能够在异质性成像条件下实现更稳定的表征学习。在多个甲状腺超声数据集上的大量实验表明，PEMV-thyroid始终优于现有先进方法，尤其在跨设备与跨域评估场景中表现突出，显著提升了真实临床环境中的诊断准确性与泛化性能。源代码已公开于https://github.com/chenyangmeii/Prototype-Enhanced-Multi-View-Learning。

摘要 (Abstract)

Thyroid nodule classification using ultrasound imaging is essential for early diagnosis and clinical decision-making; however, despite promising performance on in-distribution data, existing deep learning methods often exhibit limited robustness and generalisation when deployed across different ultrasound devices or clinical environments. This limitation is mainly attributed to the pronounced heterogeneity of thyroid ultrasound images, which can lead models to capture spurious correlations rather than reliable diagnostic cues. To address this challenge, we propose PEMV-thyroid, a Prototype-Enhanced Multi-View learning framework that accounts for data heterogeneity by learning complementary representations from multiple feature perspectives and refining decision boundaries through a prototype-based correction mechanism with mixed prototype information. By integrating multi-view representations with prototype-level guidance, the proposed approach enables more stable representation learning under heterogeneous imaging conditions. Extensive experiments on multiple thyroid ultrasound datasets demonstrate that PEMV-thyroid consistently outperforms state-of-the-art methods, particularly in cross-device and cross-domain evaluation scenarios, leading to improved diagnostic accuracy and generalisation performance in real-world clinical settings. The source code is available at https://github.com/chenyangmeii/Prototype-Enhanced-Multi-View-Learning.

关键词: Thyroid nodule classification, Ultrasound imaging, Multi-view learning, Prototype enhancement, Data heterogeneity, Cross-domain generalization, Deep learning, Medical image analysis

227. ❌ DinoDental: Benchmarking DINOv3 as a Unified Vision Encoder for Dental Image Analysis

作者: Kun Tang, Xinquan Yang, Mianjie Zheng, Xuefen Liu, Xuguang Li, Xiaoqi Guo, Ruihan Chen, Linlin Shen, He Meng 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28297v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文主要研究计算机视觉领域的DINOv3模型在牙科图像分析中的应用，属于AI for Science（生物医学AI）范畴，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文涉及预训练模型（DINOv3）的领域适应和微调策略，包括参数高效微调方法LoRA，因此与’Pre-training OR Continual Pre-training OR Domain Adaptation’（8分）和’Post-training OR Supervised Fine-tuning OR SFT’（8分）有一定关联。论文明确提到并评估了LoRA方法，与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分）。其他关键词主要涉及大语言模型（LLM）的特定技术（如MoE、Scaling Laws、RLHF、RAG、推理加速等），而本文专注于视觉基础模型（DINOv3）在特定科学领域的应用，未涉及LLM或相关技术，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了自监督视觉基础模型DINOv3作为统一编码器在牙科图像分析（包括全景X光片和口内照片）中的可靠性和性能，通过构建DinoDental基准并比较不同适应策略（如LoRA），发现DINOv3无需领域特定预训练即可作为强大的牙科图像编码器，尤其在口内图像理解和边界敏感密集预测任务中表现突出。

摘要翻译

牙科影像中专家标注的稀缺性与高昂成本，是人工智能在牙科领域发展的重大挑战。DINOv3作为一种先进的、基于17亿张图像预训练的自监督视觉基础模型，为缓解这一问题提供了可行路径。然而，该模型在迁移至具有独特成像特征与临床细微差异的牙科领域时，其可靠性尚不明确。为此，我们提出了DinoDental——一个统一的基准测试框架，旨在系统评估DINOv3能否在不进行领域特定预训练的情况下，作为可靠的开箱即用编码器，用于全面的牙科影像分析。DinoDental整合了多个公共数据集，涵盖全景X线片和口内照片的分类、检测与实例分割等多种任务。我们进一步通过缩放模型尺寸与输入分辨率，并比较不同适应策略（包括冻结特征、全微调以及参数高效的LoRA方法），分析了模型的迁移性能。实验表明，DINOv3可作为适用于全景X线片与口内照片的强效统一编码器，在各任务中保持竞争力，尤其在对口内图像理解及边界敏感的密集预测任务中展现出明显优势。总体而言，DinoDental为系统评估DINOv3在牙科分析中的性能提供了框架，建立了基础性基准，以指导牙科人工智能领域高效、有效的模型选择与适应。

摘要 (Abstract)

The scarcity and high cost of expert annotations in dental imaging present a significant challenge for the development of AI in dentistry. DINOv3, a state-of-the-art, self-supervised vision foundation model pre-trained on 1.7 billion images, offers a promising pathway to mitigate this issue. However, its reliability when transferred to the dental domain, with its unique imaging characteristics and clinical subtleties, remains unclear. To address this, we introduce DinoDental, a unified benchmark designed to systematically evaluate whether DINOv3 can serve as a reliable, off-the-shelf encoder for comprehensive dental image analysis without requiring domain-specific pre-training. Constructed from multiple public datasets, DinoDental covers a wide range of tasks, including classification, detection, and instance segmentation on both panoramic radiographs and intraoral photographs. We further analyze the model’s transfer performance by scaling its size and input resolution, and by comparing different adaptation strategies, including frozen features, full fine-tuning, and the parameter-efficient Low-Rank Adaptation (LoRA) method. Our experiments show that DINOv3 can serve as a strong unified encoder for dental image analysis across both panoramic radiographs and intraoral photographs, remaining competitive across tasks while showing particularly clear advantages for intraoral image understanding and boundary-sensitive dense prediction. Collectively, DinoDental provides a systematic framework for comprehensively evaluating DINOv3 in dental analysis, establishing a foundational benchmark to guide efficient and effective model selection and adaptation for the dental AI community.

关键词: DINOv3, dental image analysis, vision foundation model, parameter-efficient fine-tuning, LoRA, domain adaptation, benchmark, self-supervised learning

228. ❌ TerraSky3D: Multi-View Reconstructions of European Landmarks in 4K

作者: Mattia D’Urso, Yuxi Hu, Christian Sormann, Mattia Rossi, Friedrich Fraundorfer 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28287v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《TerraSky3D: Multi-View Reconstructions of European Landmarks in 4K》专注于3D重建数据集的创建，涉及高分辨率图像采集、相机标定和深度图生成，属于计算机视觉和3D建模领域。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文未涉及任何大模型、深度学习技术或AI for Science的具体内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有3D重建数据集在分辨率、规模和场景多样性上的不足，创建了一个包含5万张图像、150个地面和空中场景的高分辨率大规模3D重建数据集TerraSky3D，专注于欧洲地标，并提供校准数据、相机位姿和深度图。

摘要翻译

尽管日益复杂的三维重建流程对数据的需求不断增长，我们仍可观察到缺乏合适的公开数据集。现有的三维数据集要么分辨率较低，要么局限于少量场景，要么因从互联网收集而基于质量参差的图像，要么受限于特定的采集场景。
受此合适三维数据集匮乏的驱动，我们采集了TerraSky3D——一个高分辨率、大规模的三维重建数据集，包含5万张图像，划分为150个地面、航空及混合场景。该数据集聚焦于欧洲地标，并提供精心整理的校准数据、相机位姿及深度图。TerraSky3D旨在回应业界对具有挑战性数据集的需求，以用于训练和评估与三维重建相关的流程。

摘要 (Abstract)

Despite the growing need for data of more and more sophisticated 3D reconstruction pipelines, we can still observe a scarcity of suitable public datasets. Existing 3D datasets are either low resolution, limited to a small amount of scenes, based on images of varying quality because retrieved from the internet, or limited to specific capturing scenarios. Motivated by this lack of suitable 3D datasets, we captured TerraSky3D, a high-resolution large-scale 3D reconstruction dataset comprising 50,000 images divided into 150 ground, aerial, and mixed scenes. The dataset focuses on European landmarks and comes with curated calibration data, camera poses, and depth maps. TerraSky3D tries to answer the need for challenging dataset that can be used to train and evaluate 3D reconstruction-related pipelines.

关键词: 3D reconstruction, high-resolution dataset, multi-view images, European landmarks, camera calibration, depth maps, large-scale dataset, aerial and ground scenes

229. ❌ Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal

作者: Kazuma Ikeda, Ryosei Hara, Rokuto Nagata, Ozora Sako. Zihao Ding, Takahiro Kado, Ibuki Fujioka, Taro Beppu, Mariko Isogawa, Kentaro Yoshioka 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28224v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于LiDAR数据处理中的鬼点检测与去除问题，属于计算机视觉和传感器数据处理领域。论文的核心贡献是创建了一个大规模的全波形LiDAR数据集（Ghost-FWL）并提出了基于该数据的基线模型和自监督学习方法（FWL-MAE）。所有关键词均与大型语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用直接相关。论文虽然涉及AI技术（如自监督学习、基线模型），但其主题是LiDAR数据处理，而非大模型或深度学习技术原理的创新，也不属于生物信息学或化学信息学等AI for Science子领域。因此，除了’AI for Science OR Bioinformatics OR Cheminformatics’因涉及AI在科学/工程应用（如自动驾驶、机器人）而获得5分（有一定关联）外，其他关键词均评为0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文针对移动LiDAR中由玻璃和反射表面引起的鬼点问题，提出了首个大规模全波形LiDAR数据集Ghost-FWL和基于该数据集的基线模型，有效提升了鬼点去除精度并显著改善了SLAM和3D物体检测等下游任务性能。

摘要翻译

激光雷达已成为自动驾驶、机器人技术和智慧城市应用中的关键感知模态。然而，由玻璃及反射表面引发的多路径激光回波所产生的虚假反射点（即鬼影点），严重降低了三维建图与定位的精度。现有的鬼影去除方法依赖于密集点云中的几何一致性，难以适用于移动激光雷达所采集的稀疏动态数据。为此，我们利用全波形激光雷达技术来解决这一问题：该技术不仅记录峰值距离，更捕获完整的时间强度剖面，从而为移动场景中区分真实反射与鬼影提供了关键线索。由于这是一项新任务，我们提出了Ghost-FWL数据集——这是首个且规模最大的、用于鬼影检测与去除的移动全波形激光雷达标注数据集。Ghost-FWL涵盖10个多样化场景中的2.4万帧数据，包含75亿个峰值级标注，其规模是现有标注全波形数据集的100倍。受益于此大规模数据集，我们建立了一个基于全波形数据的鬼影检测基线模型，并提出了FWL-MAE——一种掩码自编码器，用于在全波形数据上进行高效的自监督表征学习。实验表明，我们的基线模型在鬼影去除准确率上优于现有方法，且鬼影去除进一步提升了下游任务的性能，例如基于激光雷达的同步定位与建图（轨迹误差降低66%）和三维目标检测（误报率降低50倍）。数据集与代码已公开，可通过项目页面访问：https://keio-csg.github.io/Ghost-FWL

摘要 (Abstract)

LiDAR has become an essential sensing modality in autonomous driving, robotics, and smart-city applications. However, ghost points (or ghosts), which are false reflections caused by multi-path laser returns from glass and reflective surfaces, severely degrade 3D mapping and localization accuracy. Prior ghost removal relies on geometric consistency in dense point clouds, failing on mobile LiDAR’s sparse, dynamic data. We address this by exploiting full-waveform LiDAR (FWL), which captures complete temporal intensity profiles rather than just peak distances, providing crucial cues for distinguishing ghosts from genuine reflections in mobile scenarios. As this is a new task, we present Ghost-FWL, the first and largest annotated mobile FWL dataset for ghost detection and removal. Ghost-FWL comprises 24K frames across 10 diverse scenes with 7.5 billion peak-level annotations, which is 100x larger than existing annotated FWL datasets. Benefiting from this large-scale dataset, we establish a FWL-based baseline model for ghost detection and propose FWL-MAE, a masked autoencoder for efficient self-supervised representation learning on FWL data. Experiments show that our baseline outperforms existing methods in ghost removal accuracy, and our ghost removal further enhances downstream tasks such as LiDAR-based SLAM (66% trajectory error reduction) and 3D object detection (50x false positive reduction). The dataset and code is publicly available and can be accessed via the project page: https://keio-csg.github.io/Ghost-FWL

关键词: LiDAR, ghost detection, full-waveform LiDAR, dataset, self-supervised learning, SLAM, 3D object detection, autonomous driving

230. ❌ Explaining CLIP Zero-shot Predictions Through Concepts

作者: Onat Ozdemir, Anders Christensen, Stephan Alaniz, Zeynep Akata, Emre Akbas 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28211v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究CLIP（视觉-语言模型）的零样本预测可解释性，通过概念瓶颈方法提供透明解释。与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、代理系统等）完全无关，仅与’Mechanistic Interpretability OR Explainable AI’高度相关，因为核心贡献是提高模型可解释性。

!!! tip deepseek-chat TL;DR

该论文解决了CLIP等大规模视觉-语言模型在零样本图像识别中预测不透明的问题，通过提出EZPC方法将CLIP的嵌入投影到概念空间，在保持分类准确性的同时提供可解释的概念级解释。

摘要翻译

诸如CLIP的大规模视觉语言模型在零样本图像识别方面取得了显著成功，但其预测结果对人类理解而言仍基本不透明。相比之下，概念瓶颈模型通过基于人工定义概念的推理提供可解释的中间表征，但它们依赖概念监督且缺乏泛化至未见类别的能力。我们提出的EZPC方法通过人类可理解的概念解释CLIP的零样本预测，从而桥接这两种范式。该方法将CLIP的联合图文嵌入映射至从语言描述中学习到的概念空间，无需额外监督即可生成忠实且透明的解释。模型通过结合对齐与重构目标学习此映射，确保概念激活在保持CLIP语义结构的同时仍具可解释性。在CIFAR-100、CUB-200-2011、Places365、ImageNet-100和ImageNet-1k五个基准数据集上的大量实验表明，我们的方法在维持CLIP强大零样本分类准确性的同时，能提供有意义的概念级解释。通过将开放词汇预测锚定于显式语义概念，本方法为构建可解释且可信赖的视觉语言模型迈出了理论性的一步。代码发布于https://github.com/oonat/ezpc。

摘要 (Abstract)

Large-scale vision-language models such as CLIP have achieved remarkable success in zero-shot image recognition, yet their predictions remain largely opaque to human understanding. In contrast, Concept Bottleneck Models provide interpretable intermediate representations by reasoning through human-defined concepts, but they rely on concept supervision and lack the ability to generalize to unseen classes. We introduce EZPC that bridges these two paradigms by explaining CLIP’s zero-shot predictions through human-understandable concepts. Our method projects CLIP’s joint image-text embeddings into a concept space learned from language descriptions, enabling faithful and transparent explanations without additional supervision. The model learns this projection via a combination of alignment and reconstruction objectives, ensuring that concept activations preserve CLIP’s semantic structure while remaining interpretable. Extensive experiments on five benchmark datasets, CIFAR-100, CUB-200-2011, Places365, ImageNet-100, and ImageNet-1k, demonstrate that our approach maintains CLIP’s strong zero-shot classification accuracy while providing meaningful concept-level explanations. By grounding open-vocabulary predictions in explicit semantic concepts, our method offers a principled step toward interpretable and trustworthy vision-language models. Code is available at https://github.com/oonat/ezpc.

关键词: CLIP, zero-shot prediction, interpretability, concept bottleneck, vision-language models, explainable AI, concept space, transparent explanations

231. ❌ A Closer Look at Cross-Domain Few-Shot Object Detection: Fine-Tuning Matters and Parallel Decoder Helps

作者: Xuanlong Yu, Youyang Sha, Longfei Liu, Xi Shen, Di Yang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28182v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉中的少样本目标检测（FSOD），提出了一种混合集成解码器和渐进式微调框架。核心创新在于微调策略和模型架构设计，与大多数关键词（特别是大模型相关技术）无关。仅与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分），因为微调是论文的核心方法；与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为论文利用了预训练权重并涉及跨域适应。其他关键词均不涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文针对少样本目标检测中优化不稳定和泛化能力有限的问题，提出了一种混合集成解码器和渐进式微调框架，在多个跨域数据集上显著提升了性能并增强了鲁棒性。

摘要翻译

少样本目标检测（FSOD）因训练样本稀缺导致的优化不稳定和泛化能力受限而面临挑战。为解决这些问题，我们提出了一种混合集成解码器，以在微调阶段增强泛化能力。受集成学习启发，该解码器由一个共享的层次化层和多个并行解码分支组成，其中每个分支采用从共享层继承或新初始化的去噪查询，以促进预测多样性。这一设计充分利用了预训练权重而不引入额外参数，所产生的多样化预测可通过有效集成来提升泛化性能。我们进一步采用统一的渐进式微调框架，结合平台感知的学习率调度策略，该策略能稳定优化过程，并在无需复杂数据增强或大量超参数调优的情况下实现强大的少样本适应能力。在CD-FSOD、ODinW-13和RF100-VL上的大量实验验证了本方法的有效性。值得注意的是，在涵盖多领域100个数据集的RF100-VL基准上，我们的方法在10样本设置下取得了平均41.9的性能表现，显著优于近期方法SAM3的35.7。我们进一步基于CD-FSOD构建了混合领域测试集以评估对分布外（OOD）样本的鲁棒性，结果表明所提出的模块带来了明显的性能提升。这些结果凸显了本方法在有效性、泛化性和鲁棒性方面的优势。代码发布于：https://github.com/Intellindust-AI-Lab/FT-FSOD。

摘要 (Abstract)

Few-shot object detection (FSOD) is challenging due to unstable optimization and limited generalization arising from the scarcity of training samples. To address these issues, we propose a hybrid ensemble decoder that enhances generalization during fine-tuning. Inspired by ensemble learning, the decoder comprises a shared hierarchical layer followed by multiple parallel decoder branches, where each branch employs denoising queries either inherited from the shared layer or newly initialized to encourage prediction diversity. This design fully exploits pretrained weights without introducing additional parameters, and the resulting diverse predictions can be effectively ensembled to improve generalization. We further leverage a unified progressive fine-tuning framework with a plateau-aware learning rate schedule, which stabilizes optimization and achieves strong few-shot adaptation without complex data augmentations or extensive hyperparameter tuning. Extensive experiments on CD-FSOD, ODinW-13, and RF100-VL validate the effectiveness of our approach. Notably, on RF100-VL, which includes 100 datasets across diverse domains, our method achieves an average performance of 41.9 in the 10-shot setting, significantly outperforming the recent approach SAM3, which obtains 35.7. We further construct a mixed-domain test set from CD-FSOD to evaluate robustness to out-of-distribution (OOD) samples, showing that our proposed modules lead to clear improvement gains. These results highlight the effectiveness, generalization, and robustness of the proposed method. Code is available at: https://github.com/Intellindust-AI-Lab/FT-FSOD.

关键词: few-shot object detection, fine-tuning, cross-domain, ensemble decoder, progressive fine-tuning, generalization, robustness, out-of-distribution

232. ❌ ToLL: Topological Layout Learning with Structural Multi-view Augmentation for 3D Scene Graph Pretraining

作者: Yucheng Huang, Luping Ji, Xiangwei Jiang, Wen Li, Mao Ye 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28178v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于3D场景图的预训练方法，核心贡献是Topological Layout Learning框架，属于计算机视觉和3D场景理解领域。与大多数大模型技术关键词无关，仅与’Pre-training’高度相关（10分），因为论文核心是预训练框架设计；与’AI for Science’有一定关联（5分），因为3D场景理解可视为AI在空间理解科学问题中的应用。其他关键词均不涉及大模型、深度学习技术原理或具体应用方法。

!!! tip deepseek-chat TL;DR

该论文针对3D场景图生成中数据稀缺和现有方法依赖标注或忽略谓词关系的问题，提出了一个基于拓扑布局学习和结构多视图增强的预训练框架ToLL，实验表明该方法能提升表示质量并超越现有基准。

摘要翻译

三维场景图（3DSG）生成在空间理解与语义可供性感知中起着关键作用。然而，其泛化能力常受限于数据稀缺性。现有方法主要集中于跨模态辅助表征学习以及以物体为中心的生成式预训练。前者高度依赖谓词标注，而后者则可能因强烈的物体先验而绕过谓词学习。因此，它们往往无法为三维场景图微调提供一种无标注且鲁棒的自监督代理任务。为弥补这一空白，我们提出了一种用于三维场景图预训练的拓扑布局学习框架。具体而言，我们设计了锚点条件化拓扑几何推理，通过图神经网络利用稀疏锚点的空间先验恢复零中心子图的全局布局。该过程严格受谓词特征调制，从而强化了谓词关系学习。此外，我们构建了结构多视图增强以避免语义失真，并通过自蒸馏提升表征质量。在3DSSG数据集上的大量实验表明，我们的拓扑布局学习方法能够有效提升表征质量，性能优于当前最先进的基线模型。

摘要 (Abstract)

3D Scene Graph (3DSG) generation plays a pivotal role in spatial understanding and semantic-affordance perception. However, its generalizability is often constrained by data scarcity. Current solutions primarily focus on cross-modal assisted representation learning and object-centric generation pre-training. The former relies heavily on predicate annotations, while the latter’s predicate learning may be bypassed due to strong object priors. Consequently, they could not often provide a label-free and robust self-supervised proxy task for 3DSG fine-tuning. To bridge this gap, we propose a Topological Layout Learning (ToLL) for 3DSG pretraining framework. In detail, we design an Anchor-Conditioned Topological Geometry Reasoning, with a GNN to recover the global layout of zero-centered subgraphs by the spatial priors from sparse anchors. This process is strictly modulated by predicate features, thereby enforcing the predicate relation learning. Furthermore, we construct a Structural Multi-view Augmentation to avoid semantic corruption, and enhancing representations via self-distillation. The extensive experiments on 3DSSG dataset demonstrate that our ToLL could improve representation quality, outperforming state-of-the-art baselines.

关键词: 3D Scene Graph, Pretraining, Topological Layout Learning, Self-supervised Learning, Graph Neural Network, Spatial Understanding, Multi-view Augmentation, Anchor-Conditioned Reasoning

233. ❌ Event-Based Method for High-Speed 3D Deformation Measurement under Extreme Illumination Conditions

作者: Banglei Guan, Yifei Bian, Zibin Liu, Haoyang Li, Xuanyu Bai, Taihang Lei, Bin Li, Yang Shang, Qifeng Yu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28159v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于事件相机的高精度3D变形测量方法，属于计算机视觉和工程测量领域。论文内容完全不涉及大语言模型、深度学习技术原理、AI for Science等关键词，所有关键词均与论文主题无关，因此所有关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于多事件相机阵列的方法，用于在极端光照条件下对大型工程结构进行高速3D变形测量，实验验证了该方法相对测量误差低于0.08%。

摘要翻译

背景：航天发射塔、悬索桥等大型工程结构承受极端载荷，易产生高速三维形变并危及安全。此类结构通常处于极端光照环境下工作。传统相机因动态范围有限，在强光照条件下易出现过曝问题，难以准确捕捉图像。
目的：事件相机凭借其高动态范围和低延迟特性，已成为极端光照环境下替代传统相机的有效方案。本文提出一套从标定到测量的完整方法，利用多事件相机阵列实现极端光照条件下结构高速三维形变监测。
方法：首先，本方法结合异步事件流特性与时间相关性分析，提取对应标记点中心位置。随后，通过求解克鲁帕方程并结合参数优化框架实现快速标定。最后，采用统一坐标变换与线性交会法实现目标结构三维形变测量。
结果：实验证实本方法相对测量误差低于0.08%。在极端光照条件下开展的多事件相机阵列自标定与三维形变测量现场实验，验证了所提方法的性能。
结论：本文解决了传统相机在极端光照条件下测量高速三维形变的关键局限。实验结果表明，相较于其他方法，所提方法能够在严苛光照条件下精确测量结构三维形变，且形变测量相对误差小于0.1%。

摘要 (Abstract)

Background: Large engineering structures, such as space launch towers and suspension bridges, are subjected to extreme forces that cause high-speed 3D deformation and compromise safety. These structures typically operate under extreme illumination conditions. Traditional cameras often struggle to handle strong light intensity, leading to overexposure due to their limited dynamic range. Objective: Event cameras have emerged as a compelling alternative to traditional cameras in high dynamic range and low-latency applications. This paper presents an integrated method, from calibration to measurement, using a multi-event camera array for high-speed 3D deformation monitoring of structures in extreme illumination conditions. Methods: Firstly, the proposed method combines the characteristics of the asynchronous event stream and temporal correlation analysis to extract the corresponding marker center point. Subsequently, the method achieves rapid calibration by solving the Kruppa equations in conjunction with a parameter optimization framework. Finally, by employing a unified coordinate transformation and linear intersection, the method enables the measurement of 3D deformation of the target structure. Results: Experiments confirmed that the relative measurement error is below 0.08%. Field experiments under extreme illumination conditions, including self-calibration of a multi-event camera array and 3D deformation measurement, verified the performance of the proposed method. Conclusions: This paper addressed the critical limitation of traditional cameras in measuring high-speed 3D deformations under extreme illumination conditions. The experimental results demonstrate that, compared to other methods, the proposed method can accurately measure 3D deformations of structures under harsh lighting conditions, and the relative error of the measured deformation is less than 0.1%.

关键词: Event cameras, 3D deformation measurement, Extreme illumination conditions, High-speed monitoring, Multi-camera array, Kruppa equations, Temporal correlation analysis, Structural health monitoring

234. ❌ ObjectMorpher: 3D-Aware Image Editing via Deformable 3DGS Models

作者: Yuhuan Xie, Aoxuan Pan, Yi-Hua Huang, Chirui Chang, Peng Dai, Xin Yu, Xiaojuan Qi 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28152v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ObjectMorpher专注于3D感知的图像编辑，通过可变形3D高斯泼溅模型实现精确的对象级控制。其核心贡献在于将2D编辑提升为基于几何的操作，涉及3D重建、非刚性变形和图像合成技术。所有评分关键词均围绕大语言模型（LLMs）及其相关技术（如训练方法、推理优化、对齐、代理系统等）或特定科学AI应用（如生物信息学）。本文研究内容（3D计算机视觉、图像编辑、几何处理）与这些关键词主题完全不同，无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

论文提出ObjectMorpher框架，通过将2D图像编辑转换为基于可变形3D高斯泼溅模型的几何操作，解决了现有方法缺乏3D感知导致编辑结果不精确或不合理的问题，实现了快速、保真且可控的逼真图像编辑。

摘要翻译

在图像编辑中实现精确的物体级控制仍具挑战性：二维方法缺乏三维感知能力，常产生模糊或不合理的结果，而现有的三维感知方法则依赖于繁重的优化或不完整的单目重建。我们提出了ObjectMorpher，一个统一的交互式框架，可将模糊的二维编辑转化为基于几何结构的操作。ObjectMorpher通过图像到三维生成器将目标实例提升为可编辑的三维高斯泼溅（3D Gaussian Splatting, 3DGS），从而实现快速且保持物体特性的操控。用户拖动控制点；基于图结构的非刚性变形结合尽可能刚性（as-rigid-as-possible, ARAP）约束，确保形状和姿态变化符合物理合理性。复合扩散模块协调光照、色彩和边界，实现无缝的重新融合。在多种类别中，ObjectMorpher提供了精细、逼真的编辑效果，在可控性和效率上表现优异，在KID、LPIPS、SIFID指标及用户偏好评估中均优于二维拖拽和三维感知基线方法。

摘要 (Abstract)

Achieving precise, object-level control in image editing remains challenging: 2D methods lack 3D awareness and often yield ambiguous or implausible results, while existing 3D-aware approaches rely on heavy optimization or incomplete monocular reconstructions. We present ObjectMorpher, a unified, interactive framework that converts ambiguous 2D edits into geometry-grounded operations. ObjectMorpher lifts target instances with an image-to-3D generator into editable 3D Gaussian Splatting (3DGS), enabling fast, identity-preserving manipulation. Users drag control points; a graph-based non-rigid deformation with as-rigid-as-possible (ARAP) constraints ensures physically sensible shape and pose changes. A composite diffusion module harmonizes lighting, color, and boundaries for seamless reintegration. Across diverse categories, ObjectMorpher delivers fine-grained, photorealistic edits with superior controllability and efficiency, outperforming 2D drag and 3D-aware baselines on KID, LPIPS, SIFID, and user preference.

关键词: 3D-aware image editing, 3D Gaussian Splatting, deformable models, object-level control, non-rigid deformation, photorealistic editing, geometry-grounded operations, interactive framework

235. ❌ BlankSkip: Early-exit Object Detection onboard Nano-drones

作者: Carlo Marra, Beatrice Alessandra Motetti, Alessio Burrello, Enrico Macii, Massimo Poncino, Daniele Jahier Pagliari 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28149v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是在纳米无人机上部署轻量级计算机视觉DNN进行目标检测，采用早期退出机制优化推理延迟。与大多数大模型关键词无关，但与’Small Language Models OR SLMs OR On-device AI’有一定关联（5分），因为涉及边缘设备部署；与’Quantization OR Model Compression OR Low-bit Weights’有一定关联（5分），因为涉及资源受限环境下的模型优化；与’Speculative Decoding OR Inference Acceleration’有一定关联（5分），因为早期退出机制旨在加速推理。其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为BlankSkip的自适应网络，通过在纳米无人机上部署早期退出目标检测系统，实现了在精度损失有限的情况下平均吞吐量提升24%。

摘要翻译

在纳米级无人机上部署微型计算机视觉深度神经网络是实现自主飞行的关键，但其计算平台的极端严苛限制（约10 MiB内存、1 W功耗预算）使这一过程变得复杂。针对“易于处理”的输入帧减少计算量的早期退出自适应深度神经网络，是降低平均推理延迟的有效途径。然而，尽管该方法在分类任务中已得到广泛研究，但其在目标检测等密集预测任务中的应用并不直接。本文提出BlankSkip——一种面向设备端目标检测的自适应网络，其利用简单的辅助分类任务实现早期退出，即识别不含感兴趣目标的空帧。通过在真实纳米无人机平台（Bitcraze Crazyflie 2.1）上进行的实验，我们在先进纳米无人机目标检测数据集上，相比静态MobileNet-SSD检测器，在平均精度均值仅下降0.015的情况下，实现了高达24%的平均吞吐量提升。

摘要 (Abstract)

Deploying tiny computer vision Deep Neural Networks (DNNs) on-board nano-sized drones is key for achieving autonomy, but is complicated by the extremely tight constraints of their computational platforms (approximately 10 MiB memory, 1 W power budget). Early-exit adaptive DNNs that dial down the computational effort for “easy-to-process” input frames represent a promising way to reduce the average inference latency. However, while this approach is extensively studied for classification, its application to dense tasks like object detection (OD) is not straightforward. In this paper, we propose BlankSkip, an adaptive network for on-device OD that leverages a simple auxiliary classification task for early exit, i.e., identifying frames with no objects of interest. With experiments using a real-world nano-drone platform, the Bitcraze Crazyflie 2.1, we achieve up to 24% average throughput improvement with a limited 0.015 mean Average Precision (mAP) drop compared to a static MobileNet-SSD detector, on a state-of-the-art nano-drones OD dataset.

关键词: early-exit, object detection, nano-drones, on-device AI, adaptive DNNs, inference acceleration, MobileNet-SSD, computational constraints

236. ❌ Intelligent Road Condition Monitoring using 3D In-Air SONAR Sensing

作者: Amber Cassimon, Robin Kerstens, Walter Daems, Jan Steckel 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28141v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究使用3D声纳传感器进行道路状况监测，包括路面材料分类和损伤检测。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用相关，而该论文专注于传感器数据处理和传统机器学习/计算机视觉任务，未涉及任何大模型技术、深度学习创新或AI for Science的具体应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了使用3D声纳传感器监测道路状况（路面材料分类和损伤检测），结果表明声纳在材料分类上表现良好（F1约90%），但在损伤检测上精度较低（F1约75%），是路面管理系统中有前景但需进一步研究的传感方式。

摘要翻译

本文研究了空中三维声纳传感器在监测路面状况方面的应用能力。具体而言，我们探讨了两项应用：路面材料分类与路面损伤检测及分类。尽管此类任务可通过其他传感器模态（如相机传感器与激光雷达传感器）完成，但这些传感器在恶劣感知条件下（如暴雨、烟雾或浓雾）往往失效。通过采用对此类干扰具有鲁棒性的感知模态，我们得以构建机会性感知应用，使执行其他任务（垃圾收集、邮件投递等）的车辆也能用于监测道路状况。针对这些任务，我们使用单一数据集，其中标注了不同类型的损伤，标签包含路面材料信息。在材料分类任务中，我们区分了三种不同路面材料：沥青路面、混凝土路面与构件路面。在损伤检测与分类任务中，我们判定是否存在损伤及损伤类型（与材料类型无关），但不进行损伤定位。实验表明，我们成功从声纳传感器数据中识别出路面类型，测试集F1分数接近90%；但发现损伤检测性能相对滞后，F1分数约为75%。由此我们得出结论：声纳感知是一种具有潜力的感知模态，可纳入基于机会性感知的路面管理系统，但需进一步研究以达到预期精度。

摘要 (Abstract)

In this paper, we investigate the capabilities of in-air 3D SONAR sensors for the monitoring of road surface conditions. Concretely, we consider two applications: Road material classification and Road damage detection and classification. While such tasks can be performed with other sensor modalities, such as camera sensors and LiDAR sensors, these sensor modalities tend to fail in harsh sensing conditions, such as heavy rain, smoke or fog. By using a sensing modality that is robust to such interference, we enable the creation of opportunistic sensing applications, where vehicles performing other tasks (garbage collection, mail delivery, etc.) can also be used to monitor the condition of the road. For these tasks, we use a single dataset, in which different types of damages are annotated, with labels including the material of the road surface. In the material classification task, we differentiate between three different road materials: Asphalt, Concrete and Element roads. In the damage detection and classification task, we determine if there is damage, and what type of damage (independent of material type), without localizing the damage. We are succesful in determining the road surface type from SONAR sensor data, with F1 scores approaching 90% on the test set, but find that for the detection of damages performace lags, with F1 score around 75%. From this, we conclude that SONAR sensing is a promising modality to include in opportunistic sensing-based pavement management systems, but that further research is needed to reach the desired accuracy.

关键词: 3D SONAR sensing, road condition monitoring, road material classification, damage detection, opportunistic sensing, pavement management, sensor data analysis, F1 score

237. ❌ Robust Remote Sensing Image-Text Retrieval with Noisy Correspondence

作者: Qiya Song, Yiqiang Xie, Yuan Sun, Renwei Dian, Xudong Kang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28134v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于遥感图像-文本检索（RSITR）任务，提出了一种处理噪声对应（Noisy Correspondence）问题的鲁棒方法。研究内容涉及多模态学习、自步学习策略和鲁棒三元组损失，属于计算机视觉和自然语言处理的交叉领域。所有关键词均与大模型（LLMs）或深度学习技术原理直接相关，而本文未涉及大模型技术，也未在生物信息学或化学信息学等具体科学领域应用大模型。仅关键词’AI for Science OR Bioinformatics OR Cheminformatics’因遥感可视为广义的科学应用领域（地球科学）而获得5分（有一定关联），但论文未明确使用大模型或深度学习创新技术于该领域。其他关键词均与大模型架构、训练、推理、对齐、代理等具体技术无关，故评分为0。

!!! tip deepseek-chat TL;DR

该论文针对遥感图像-文本检索中存在的噪声对应问题，提出了一种基于自步学习和鲁棒三元组损失的鲁棒检索范式，在多个基准数据集上显著提升了性能，特别是在高噪声率下。

摘要翻译

作为连接遥感视觉与语言理解的关键任务，遥感图像-文本检索近年来引起了广泛的研究关注。然而，现有的RSITR方法几乎都隐式地假设图像-文本对是完美匹配的。在实际应用中，获取大规模精确对齐的数据对往往成本极高甚至不可行。此外，我们注意到遥感数据集（如RSITMD）中确实存在一些不准确或失配的图像文本描述。基于以上观察，我们揭示了RSITR中一个重要但尚未被探索的问题，即噪声对应问题。为应对这些挑战，本文提出了一种新颖的鲁棒遥感图像-文本检索范式，该范式通过设计自步学习策略来模拟人类认知学习模式，从而实现对含噪声对应多模态数据从易到难的学习。具体而言，我们首先根据每个训练样本对的损失大小将其划分为三类：干净样本对、模糊样本对和噪声样本对。随后，通过基于损失值为每个样本对分配权重，分别估计各训练对的可靠性。进一步，我们分别设计了新的多模态自步函数，以动态调控样本的训练顺序和权重，从而建立渐进式学习过程。最后，针对噪声样本对，我们提出了一种鲁棒三元组损失函数，能够根据语义相似度动态调整软间隔，从而增强模型对噪声的鲁棒性。在三个主流基准数据集上的大量实验表明，所提出的RRSITR方法显著优于现有最先进方法，尤其在较高噪声率场景下表现突出。代码已开源：https://github.com/MSFLabX/RRSITR

摘要 (Abstract)

As a pivotal task that bridges remote visual and linguistic understanding, Remote Sensing Image-Text Retrieval (RSITR) has attracted considerable research interest in recent years. However, almost all RSITR methods implicitly assume that image-text pairs are matched perfectly. In practice, acquiring a large set of well-aligned data pairs is often prohibitively expensive or even infeasible. In addition, we also notice that the remote sensing datasets (e.g., RSITMD) truly contain some inaccurate or mismatched image text descriptions. Based on the above observations, we reveal an important but untouched problem in RSITR, i.e., Noisy Correspondence (NC). To overcome these challenges, we propose a novel Robust Remote Sensing Image-Text Retrieval (RRSITR) paradigm that designs a self-paced learning strategy to mimic human cognitive learning patterns, thereby learning from easy to hard from multi-modal data with NC. Specifically, we first divide all training sample pairs into three categories based on the loss magnitude of each pair, i.e., clean sample pairs, ambiguous sample pairs, and noisy sample pairs. Then, we respectively estimate the reliability of each training pair by assigning a weight to each pair based on the values of the loss. Further, we respectively design a new multi-modal self-paced function to dynamically regulate the training sequence and weights of the samples, thus establishing a progressive learning process. Finally, for noisy sample pairs, we present a robust triplet loss to dynamically adjust the soft margin based on semantic similarity, thereby enhancing the robustness against noise. Extensive experiments on three popular benchmark datasets demonstrate that the proposed RRSITR significantly outperforms the state-of-the-art methods, especially in high noise rates. The code is available at: https://github.com/MSFLabX/RRSITR

关键词: Remote Sensing Image-Text Retrieval, Noisy Correspondence, Self-paced Learning, Multi-modal Learning, Robust Triplet Loss, RSITR, RRSITR, Image-Text Matching

238. ❌ SVGS: Single-View to 3D Object Editing via Gaussian Splatting

作者: Pengcheng Xue, Yan Tian, Qiutao Song, Ziyi Wang, Linyang He, Weiping Ding, Mahmoud Hassaballah, Karen Egiazarian, Wei-Fa Yang, Leszek Rutkowski 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28126v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于3D场景编辑技术，提出了一种基于3D高斯泼溅的单视图文本驱动编辑方法SVGS，旨在解决现有方法（如NeRF和多视图编辑）的效率低下和视图不一致问题。论文的核心是计算机视觉和3D重建技术，而非大语言模型或深度学习技术原理的创新。所有关键词均与论文内容无关，除了’AI for Science OR Bioinformatics OR Cheminformatics’，该关键词与科学领域的AI应用有一定关联，因为3D编辑技术可视为AI在科学可视化或相关领域的潜在应用，但论文未明确涉及生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于3D高斯泼溅的单视图文本驱动3D对象编辑方法SVGS，解决了现有方法在编辑效率和视图一致性方面的不足，实验表明其在编辑能力和处理速度上优于基线方法。

摘要翻译

基于文本驱动的三维场景编辑因其便捷性与用户友好性吸引了广泛关注。然而，依赖隐式三维表征（如神经辐射场，NeRF）的方法虽然在渲染复杂场景方面表现有效，却受限于处理速度缓慢以及对场景特定区域的控制能力不足。此外，现有方法（包括Instruct-NeRF2NeRF与GaussianEditor）采用多视角编辑策略，在执行文本指令时经常在不同视角间产生不一致的结果。这种不一致性可能对模型的整体性能产生不利影响，使得在编辑结果的一致性与编辑效率之间取得平衡变得复杂。为应对这些挑战，我们提出了一种名为“基于高斯泼溅的单视角至三维物体编辑”（SVGS）的新方法，这是一种基于三维高斯泼溅（3DGS）的单视角文本驱动编辑技术。具体而言，针对文本指令，我们引入了一种基于多视角扩散模型的单视角编辑策略，该策略仅利用那些能够产生一致编辑结果的视角来重建三维场景。此外，我们采用稀疏三维高斯泼溅作为三维表征，从而显著提升了编辑效率。我们在多种场景设置下将SVGS与现有基线方法进行了对比分析，结果表明SVGS在编辑能力和处理速度上均优于同类方法，代表了三维编辑技术的重要进展。更多细节请访问我们的项目页面：https://amateurc.github.io/svgs.github.io。

摘要 (Abstract)

Text-driven 3D scene editing has attracted considerable interest due to its convenience and user-friendliness. However, methods that rely on implicit 3D representations, such as Neural Radiance Fields (NeRF), while effective in rendering complex scenes, are hindered by slow processing speeds and limited control over specific regions of the scene. Moreover, existing approaches, including Instruct-NeRF2NeRF and GaussianEditor, which utilize multi-view editing strategies, frequently produce inconsistent results across different views when executing text instructions. This inconsistency can adversely affect the overall performance of the model, complicating the task of balancing the consistency of editing results with editing efficiency. To address these challenges, we propose a novel method termed Single-View to 3D Object Editing via Gaussian Splatting (SVGS), which is a single-view text-driven editing technique based on 3D Gaussian Splatting (3DGS). Specifically, in response to text instructions, we introduce a single-view editing strategy grounded in multi-view diffusion models, which reconstructs 3D scenes by leveraging only those views that yield consistent editing results. Additionally, we employ sparse 3D Gaussian Splatting as the 3D representation, which significantly enhances editing efficiency. We conducted a comparative analysis of SVGS against existing baseline methods across various scene settings, and the results indicate that SVGS outperforms its counterparts in both editing capability and processing speed, representing a significant advancement in 3D editing technology. For further details, please visit our project page at: https://amateurc.github.io/svgs.github.io.

关键词: 3D scene editing, text-driven editing, 3D Gaussian Splatting, single-view editing, multi-view diffusion models, editing efficiency, view consistency, Neural Radiance Fields

239. ❌ MedLoc-R1: Performance-Aware Curriculum Reward Scheduling for GRPO-Based Medical Visual Grounding

作者: Guangjing Yang, Ziyuan Qin, Chaoran Zhang, Chenlin Du, Jinlin Wang, Wanran Sun, Zhenyu Zhang, Bing Ji, Qicheng Lao 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28120v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于医学视觉定位任务，提出了一种基于强化学习（GRPO）的性能感知奖励调度框架MedLoc-R1，以解决医学图像中奖励稀疏和训练不稳定的问题。所有关键词均与大模型（LLM）技术、训练方法、推理优化、代理系统等直接相关，而本文的核心是计算机视觉中的强化学习应用，并未涉及任何大模型技术、训练范式或相关优化方法。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文应用于医学领域（医学视觉定位），属于AI在科学（医学）中的应用，但并非核心创新点（创新点在RL奖励调度方法），因此给予5分（有一定关联）。其他关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对医学视觉定位任务中基于GRPO的强化学习方法存在的奖励稀疏和训练不稳定问题，提出了MedLoc-R1框架，通过性能感知的课程奖励调度自动调整奖励标准，从而提高了定位精度和训练稳定性。

摘要翻译

医学视觉定位是细粒度多模态推理与可解释临床决策支持的关键基础。尽管强化学习在定位任务中已取得进展，但现有方法如组相对策略优化在直接应用于医学图像时，仍面临严重的奖励稀疏性问题。这主要源于定位微小或模糊感兴趣区域的内在困难，而强化学习中基于固定交并比的奖励机制因其僵化性与次优性进一步加剧了该问题，导致策略梯度消失与优化停滞，尤其在训练早期阶段更为显著。为解决这一挑战，我们提出MedLoc-R1——一种性能感知的奖励调度框架，能够根据模型就绪状态逐步收紧奖励标准。该框架引入滑动窗口性能追踪器与多条件更新规则，可自动将奖励机制从密集易得的信号调整为更严格、细粒度的定位要求，同时保留组相对策略优化的优良特性，无需引入辅助网络或额外梯度路径。在三个医学视觉定位基准测试上的实验表明，MedLoc-R1相较于基于组相对策略优化的基线方法，持续提升了定位精度与训练稳定性。本框架为高风险医疗应用中基于强化学习的定位任务提供了一种通用、轻量且有效的解决方案。代码与模型检查点已发布于\hyperlink{}{https://github.com/MembrAI/MedLoc-R1}。

摘要 (Abstract)

Medical visual grounding serves as a crucial foundation for fine-grained multimodal reasoning and interpretable clinical decision support. Despite recent advances in reinforcement learning (RL) for grounding tasks, existing approaches such as Group Relative Policy Optimization~(GRPO) suffer from severe reward sparsity when directly applied to medical images, primarily due to the inherent difficulty of localizing small or ambiguous regions of interest, which is further exacerbated by the rigid and suboptimal nature of fixed IoU-based reward schemes in RL. This leads to vanishing policy gradients and stagnated optimization, particularly during early training. To address this challenge, we propose MedLoc-R1, a performance-aware reward scheduling framework that progressively tightens the reward criterion in accordance with model readiness. MedLoc-R1 introduces a sliding-window performance tracker and a multi-condition update rule that automatically adjust the reward schedule from dense, easily obtainable signals to stricter, fine-grained localization requirements, while preserving the favorable properties of GRPO without introducing auxiliary networks or additional gradient paths. Experiments on three medical visual grounding benchmarks demonstrate that MedLoc-R1 consistently improves both localization accuracy and training stability over GRPO-based baselines. Our framework offers a general, lightweight, and effective solution for RL-based grounding in high-stakes medical applications. Code & checkpoints are available at \hyperlink{}{https://github.com/MembrAI/MedLoc-R1}.

关键词: Medical visual grounding, Reinforcement learning, GRPO, Reward scheduling, Curriculum learning, Performance-aware, Localization accuracy, Training stability

240. ❌ $AutoDrive\text{-}P^3$: Unified Chain of Perception-Prediction-Planning Thought via Reinforcement Fine-Tuning

作者: Yuqi Ye, Zijian Zhang, Junhong Lin, Shangkun Sun, Changhao Peng, Wei Gao 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28116v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出AutoDrive-P3框架，将感知、预测和规划通过结构化推理集成，核心创新在于P3-CoT数据集和P3-GRPO分层强化学习算法。与关键词相关性分析：1）高度相关（8-10分）：论文明确使用chain-of-thought reasoning（CoT）作为核心方法，并引入详细思考/快速思考的双模式，与System 2 Thinking相关；论文基于视觉语言模型（VLM），属于大模型在自动驾驶领域的应用。2）中等相关（5分）：论文涉及自动驾驶代理的决策流程，与LLM Agents有一定关联；强调可解释性，与Explainable AI相关。3）无关（0分）：其他关键词如MoE、量化、RAG等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文针对现有基于视觉语言模型的自动驾驶系统存在缺乏链式推理或模块割裂的问题，提出了AutoDrive-P3框架，通过P3-CoT数据集和P3-GRPO分层强化学习算法实现感知-预测-规划的连贯推理，在多个基准测试中达到了最先进的规划性能。

摘要翻译

视觉语言模型因其在处理长尾场景中的卓越性能，正日益被应用于端到端自动驾驶系统。然而，当前基于视觉语言模型的方法存在两大主要局限：1）部分视觉语言模型直接输出规划结果，缺乏思维链推理，绕过了关键的感知与预测阶段，这造成了显著的领域差距并损害了决策能力；2）另一些视觉语言模型虽能生成感知、预测和规划任务的输出，但采用了割裂的决策方式，各模块独立运行，导致严重缺乏协同性，从而削弱了真实的规划性能。为应对这些局限，我们提出了 ${AutoDrive\text{-}P^3}$，这是一个通过结构化推理无缝整合 $\textbf{感知}$、$\textbf{预测}$ 与 $\textbf{规划}$ 的新型框架。我们引入了 ${P^3\text{-}CoT}$ 数据集以促进连贯推理，并提出了 ${P^3\text{-}GRPO}$，一种分层强化学习算法，为所有三项任务提供渐进式监督。具体而言，${AutoDrive\text{-}P^3}$ 逐步生成针对感知、预测和规划的思维链推理与答案，其中感知为后续的预测与规划提供关键信息，而感知与预测共同作用于最终的规划决策，从而实现更安全、更具可解释性的自动驾驶。此外，为平衡推理效率与性能，我们引入了双重思维模式：详细思维与快速思维。在开环（nuScenes）与闭环（NAVSIMv1/v2）基准测试上的大量实验表明，我们的方法在规划任务中达到了最先进的性能。代码发布于 https://github.com/haha-yuki-haha/AutoDrive-P3。

摘要 (Abstract)

Vision-language models (VLMs) are increasingly being adopted for end-to-end autonomous driving systems due to their exceptional performance in handling long-tail scenarios. However, current VLM-based approaches suffer from two major limitations: 1) Some VLMs directly output planning results without chain-of-thought (CoT) reasoning, bypassing crucial perception and prediction stages which creates a significant domain gap and compromises decision-making capability; 2) Other VLMs can generate outputs for perception, prediction, and planning tasks but employ a fragmented decision-making approach where these modules operate separately, leading to a significant lack of synergy that undermines true planning performance. To address these limitations, we propose ${AutoDrive\text{-}P^3}$, a novel framework that seamlessly integrates $\textbf{P}$erception, $\textbf{P}$rediction, and $\textbf{P}$lanning through structured reasoning. We introduce the ${P^3\text{-}CoT}$ dataset to facilitate coherent reasoning and propose ${P^3\text{-}GRPO}$, a hierarchical reinforcement learning algorithm that provides progressive supervision across all three tasks. Specifically, ${AutoDrive\text{-}P^3}$ progressively generates CoT reasoning and answers for perception, prediction, and planning, where perception provides essential information for subsequent prediction and planning, while both perception and prediction collectively contribute to the final planning decisions, enabling safer and more interpretable autonomous driving. Additionally, to balance inference efficiency with performance, we introduce dual thinking modes: detailed thinking and fast thinking. Extensive experiments on both open-loop (nuScenes) and closed-loop (NAVSIMv1/v2) benchmarks demonstrate that our approach achieves state-of-the-art performance in planning tasks. Code is available at https://github.com/haha-yuki-haha/AutoDrive-P3.

关键词: Vision-Language Models, Autonomous Driving, Chain-of-Thought Reasoning, Perception-Prediction-Planning, Reinforcement Learning, Hierarchical Supervision, Interpretable AI, End-to-End Systems

241. ❌ RAWIC: Bit-Depth Adaptive Lossless Raw Image Compression

作者: Chunhang Zheng, Tongda Xu, Mingli Xie, Yan Wang, Dou Li 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28105v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的原始图像无损压缩技术，提出了一种基于学习的位深度自适应压缩框架。所有评分关键词均与大语言模型、深度学习技术原理、AI科学应用等主题相关，而本文研究内容（原始图像压缩）与这些关键词无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对Bayer模式原始图像的大文件存储挑战，提出了一种位深度自适应的学习型无损压缩框架RAWIC，实现了比传统编解码器更高的压缩效率。

摘要翻译

原始图像保留了线性传感器测量数据和高位深度信息，这对高级视觉任务和摄影应用至关重要，但由于文件体积大、位深度多变且依赖传感器特性，其存储仍面临挑战。现有的学习型无损压缩方法主要针对8位sRGB图像，而原始图像重建方法本质上是有限损的，且依赖于相机特定的假设。为解决这些问题，我们提出了RAWIC，一种适用于拜耳模式原始图像的位深度自适应学习型无损压缩框架。我们首先将单通道拜耳数据转换为四通道RGGB格式并将其分割为图像块。针对每个图像块，我们计算其位深度并将其作为辅助输入来指导压缩过程。随后设计了一个位深度自适应熵模型，用于在给定位深度条件下估计图像块的分布。该架构使得单一模型能够处理来自不同相机和位深度的原始图像。实验表明，RAWIC始终优于传统无损编解码器，相比JPEG-XL平均实现了7.7%的码率降低。我们的代码公开于https://github.com/chunbaobao/RAWIC。

摘要 (Abstract)

Raw images preserve linear sensor measurements and high bit-depth information crucial for advanced vision tasks and photography applications, yet their storage remains challenging due to large file sizes, varying bit depths, and sensor-dependent characteristics. Existing learned lossless compression methods mainly target 8-bit sRGB images, while raw reconstruction approaches are inherently lossy and rely on camera-specific assumptions. To address these challenges, we introduce RAWIC, a bit-depth-adaptive learned lossless compression framework for Bayer-pattern raw images. We first convert single-channel Bayer data into a four-channel RGGB format and partition it into patches. For each patch, we compute its bit depth and use it as auxiliary input to guide compression. A bit-depth-adaptive entropy model is then designed to estimate patch distributions conditioned on their bit depths. This architecture enables a single model to handle raw images from diverse cameras and bit depths. Experiments show that RAWIC consistently surpasses traditional lossless codecs, achieving an average 7.7% bitrate reduction over JPEG-XL. Our code is available at https://github.com/chunbaobao/RAWIC.

关键词: Raw Image Compression, Lossless Compression, Bit-depth Adaptive, Bayer Pattern, Learned Compression, Entropy Model, JPEG-XL, Image Storage

242. ❌ Octree-based Learned Point Cloud Geometry Compression: A Lossy Perspective

作者: Kaiyu Zheng, Wei Gao, Huiming Zheng 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28095v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于点云几何压缩技术，具体研究基于八叉树的损失性压缩方法，包括叶节点压缩和速率控制。所有评分关键词均与大语言模型、深度学习技术原理或科学AI应用相关，而本文研究的是计算机视觉/图形学中的点云压缩，属于完全不同的技术领域，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对基于八叉树的点云损失性压缩问题，提出了针对对象点云的叶节点压缩方法和针对LiDAR点云的速率控制方法，实验表明这些方法显著优于现有方法。

摘要翻译

基于八叉树的上下文学习方法近期已成为点云压缩领域的主流技术，但其在有损压缩方面的潜力尚未得到充分探索。传统有损压缩范式采用无损八叉树表示并结合量化步长调整，可能因量化过程中大量点缺失而导致严重失真。为此，我们分析了不同点云的数据特性，并针对性提出了有损压缩方案。针对因量化步长调整而受损的物体点云，我们提出了一种新型叶节点有损压缩方法，通过对叶节点执行比特级编码与二进制预测实现有损压缩。对于激光雷达（LiDAR）点云，我们探索了可变码率方案，并提出一种简洁高效的码率控制方法。实验结果表明：所提出的叶节点有损压缩方法在物体点云上显著优于既往基于八叉树的方法；所提出的码率控制方法在激光雷达点云上无需微调即可实现约1%的比特误差控制。

摘要 (Abstract)

Octree-based context learning has recently become a leading method in point cloud compression. However, its potential on lossy compression remains undiscovered. The traditional lossy compression paradigm using lossless octree representation with quantization step adjustment may result in severe distortions due to massive missing points in quantization. Therefore, we analyze data characteristics of different point clouds and propose lossy approaches specifically. For object point clouds that suffer from quantization step adjustment, we propose a new leaf nodes lossy compression method, which achieves lossy compression by performing bit-wise coding and binary prediction on leaf nodes. For LiDAR point clouds, we explore variable rate approaches and propose a simple but effective rate control method. Experimental results demonstrate that the proposed leaf nodes lossy compression method significantly outperforms the previous octree-based method on object point clouds, and the proposed rate control method achieves about 1% bit error without finetuning on LiDAR point clouds.

关键词: point cloud compression, octree-based compression, lossy compression, leaf nodes compression, rate control, LiDAR point clouds, object point clouds, binary prediction

243. ❌ SHARP: Short-Window Streaming for Accurate and Robust Prediction in Motion Forecasting

作者: Alexander Prutsch, Christian Fruhwirth-Reisinger, David Schinagl, Horst Possegger 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28091v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于交通运动预测的深度学习模型优化，特别是流式推理框架，不涉及大语言模型（LLMs）、大模型技术原理或科学领域应用。所有关键词均与大模型、深度学习技术原理或AI for Science相关，而本文是计算机视觉/自动驾驶领域的特定应用研究，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种新颖的流式运动预测框架SHARP，通过增量处理观察窗口和实例感知上下文流来应对动态交通环境中异构观察长度导致的性能下降问题，在多个基准测试中实现了最先进的性能并保持了低延迟。

摘要翻译

在动态交通环境中，运动预测模型必须具备持续准确估计未来轨迹的能力。基于流式处理的方法是一种前景广阔的解决方案，但尽管近期取得了进展，这些方法在面临异构观测长度时性能仍会下降。为解决这一问题，我们提出了一种新颖的流式运动预测框架，该框架明确关注动态演化的场景。我们的方法逐步处理输入的观测窗口，并利用实例感知的上下文流式传输机制，在推理步骤间持续维护并更新智能体的隐式表征。通过双重训练目标，进一步确保了模型在不同观测时间跨度下预测精度的一致性。在Argoverse 2、nuScenes和Argoverse 1数据集上的大量实验表明，我们的方法在动态场景条件下以及单智能体基准测试中均表现出强鲁棒性。我们的模型在Argoverse 2多智能体基准测试的流式推理任务中取得了最先进的性能，同时保持了极低的延迟，凸显了其在实际部署中的适用性。

摘要 (Abstract)

In dynamic traffic environments, motion forecasting models must be able to accurately estimate future trajectories continuously. Streaming-based methods are a promising solution, but despite recent advances, their performance often degrades when exposed to heterogeneous observation lengths. To address this, we propose a novel streaming-based motion forecasting framework that explicitly focuses on evolving scenes. Our method incrementally processes incoming observation windows and leverages an instance-aware context streaming to maintain and update latent agent representations across inference steps. A dual training objective further enables consistent forecasting accuracy across diverse observation horizons. Extensive experiments on Argoverse 2, nuScenes, and Argoverse 1 demonstrate the robustness of our approach under evolving scene conditions and also on the single-agent benchmarks. Our model achieves state-of-the-art performance in streaming inference on the Argoverse 2 multi-agent benchmark, while maintaining minimal latency, highlighting its suitability for real-world deployment.

关键词: motion forecasting, streaming inference, dynamic traffic environments, observation windows, instance-aware context streaming, latent agent representations, real-world deployment, state-of-the-art performance

244. ❌ To View Transform or Not to View Transform: NeRF-based Pre-training Perspective

作者: Hyeonjun Jeong, Juyeb Shin, Dongsuk Kum 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28090v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究NeRF（神经辐射场）在自动驾驶3D感知中的预训练应用，提出了一种新的NeRF-Resembled Point-based 3D检测器（NeRP3D）。论文核心与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分），因为全文围绕NeRF-based预训练范式展开，并探讨了其在3D感知模型中的应用。其他关键词均与论文内容无关（0分），因为论文专注于计算机视觉和3D感知的特定技术（NeRF、点云检测、场景重建），未涉及大语言模型、推理、对齐、压缩、科学AI等主题。

!!! tip deepseek-chat TL;DR

该论文针对NeRF预训练与视图变换在3D感知中的先验冲突问题，提出了一种保留预训练NeRF网络的连续3D表示学习框架NeRP3D，在nuScenes数据集上显著提升了场景重建和检测任务的性能。

摘要翻译

神经辐射场（NeRFs）已成为以视觉为中心的自动驾驶领域重要的预训练范式，其以完全自监督的方式增强了对三维几何与外观的理解。为将基于NeRF的预训练应用于三维感知模型，现有方法通常直接将NeRF应用于通过视角变换获得的体素特征。然而，将NeRF与视角变换耦合会继承相互冲突的先验假设：视角变换强加了离散且刚性的表征，而辐射场则假设了连续且自适应的函数。当这些对立的假设被强行整合到单一流程中时，错位问题会表现为模糊且歧义的三维表征，最终限制对三维场景的理解。此外，用于预训练的NeRF网络在下游任务中被丢弃，导致通过NeRF增强的三维表征未能得到有效利用。本文提出一种新颖的类NeRF点云三维检测器，能够学习连续的三维表征，从而避免视角变换带来的先验错位。NeRP3D无论任务如何均保留预训练的NeRF网络，继承了连续三维表征学习的原则，为场景重建与检测任务带来更大潜力。在nuScenes数据集上的实验表明，所提方法显著超越了现有先进技术，不仅在预训练场景重建任务中表现优异，在下游检测任务中也取得更优性能。

摘要 (Abstract)

Neural radiance fields (NeRFs) have emerged as a prominent pre-training paradigm for vision-centric autonomous driving, which enhances 3D geometry and appearance understanding in a fully self-supervised manner. To apply NeRF-based pretraining to 3D perception models, recent approaches have simply applied NeRFs to volumetric features obtained from view transformation. However, coupling NeRFs with view transformation inherits conflicting priors; view transformation imposes discrete and rigid representations, whereas radiance fields assume continuous and adaptive functions. When these opposing assumptions are forced into a single pipeline, the misalignment surfaces as blurry and ambiguous 3D representations that ultimately limit 3D scene understanding. Moreover, the NeRF network for pre-training is discarded during downstream tasks, resulting in inefficient utilization of enhanced 3D representations through NeRF. In this paper, we propose a novel NeRF-Resembled Point-based 3D detector that can learn continuous 3D representation and thus avoid the misaligned priors from view transformation. NeRP3D preserves the pre-trained NeRF network regardless of the tasks, inheriting the principle of continuous 3D representation learning and leading to greater potentials for both scene reconstruction and detection tasks. Experiments on nuScenes dataset demonstrate that our proposed approach significantly improves previous state-of-the-art methods, outperforming not only pretext scene reconstruction tasks but also downstream detection tasks.

关键词: Neural Radiance Fields (NeRFs), Pre-training, 3D Perception, Autonomous Driving, View Transformation, Continuous 3D Representation, Scene Reconstruction, 3D Detection

245. ❌ GEMS: Agent-Native Multimodal Generation with Memory and Skills

作者: Zefeng He, Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Yu Cheng, Yang Yang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28088v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出GEMS框架，核心是构建多智能体系统（Multi-agent Systems）来提升多模态生成能力，这与’LLM Agents’和’Multi-agent Systems’高度相关（10分）。框架通过Agent Loop实现迭代优化，涉及’Self-Correction’（5分）。它利用基础模型（如Z-Image-Turbo），属于’Large Language Models’范畴（8分），并提到轻量级6B模型，与’Small Language Models’有一定关联（5分）。系统支持领域技能加载，与’Tool Use’相关（5分）。其他关键词如MoE、Scaling Laws、训练方法、推理加速等未在摘要中体现，评为0分。

!!! tip deepseek-chat TL;DR

该研究针对现有多模态生成模型在处理复杂指令和下游任务时的局限性，提出了GEMS框架，通过多智能体架构、记忆系统和技能库显著提升了生成性能，使轻量级模型在多个任务上超越现有先进模型。

摘要翻译

近期多模态生成模型在通用生成任务上取得了显著进展，但在处理复杂指令和专业化下游任务时仍面临挑战。受Claude Code等先进智能体框架成功的启发，我们提出\textbf{GEMS}（具备记忆与技能的智能体原生多模态生成框架），该框架旨在突破基础模型在通用任务和下游任务上的固有局限。GEMS构建于三个核心组件之上：智能体循环引入结构化多智能体框架，通过闭环优化迭代提升生成质量；智能体记忆提供持久化的轨迹级记忆系统，分层存储事实状态与压缩的经验摘要，既能从全局视角审视优化过程，又能减少冗余；智能体技能提供可扩展的领域专业知识库，支持按需加载，使系统能够有效处理多样化的下游应用。在涵盖五项主流任务和四项下游任务的评测中，基于多种生成后端模型的实验表明，GEMS持续取得显著的性能提升。最引人注目的是，该框架使轻量级6B模型Z-Image-Turbo在GenEval2基准上超越了当前最先进的Nano Banana 2模型，这证明了智能体机制能够有效扩展模型能力，突破其原有边界。

摘要 (Abstract)

Recent multimodal generation models have achieved remarkable progress on general-purpose generation tasks, yet continue to struggle with complex instructions and specialized downstream tasks. Inspired by the success of advanced agent frameworks such as Claude Code, we propose \textbf{GEMS} (Agent-Native Multimodal \textbf{GE}neration with \textbf{M}emory and \textbf{S}kills), a framework that pushes beyond the inherent limitations of foundational models on both general and downstream tasks. GEMS is built upon three core components. Agent Loop introduces a structured multi-agent framework that iteratively improves generation quality through closed-loop optimization. Agent Memory provides a persistent, trajectory-level memory that hierarchically stores both factual states and compressed experiential summaries, enabling a global view of the optimization process while reducing redundancy. Agent Skill offers an extensible collection of domain-specific expertise with on-demand loading, allowing the system to effectively handle diverse downstream applications. Across five mainstream tasks and four downstream tasks, evaluated on multiple generative backends, GEMS consistently achieves significant performance gains. Most notably, it enables the lightweight 6B model Z-Image-Turbo to surpass the state-of-the-art Nano Banana 2 on GenEval2, demonstrating the effectiveness of agent harness in extending model capabilities beyond their original limits.

关键词: multimodal generation, agent framework, multi-agent systems, memory systems, domain-specific skills, closed-loop optimization, downstream tasks, performance improvement

246. ❌ LogiStory: A Logic-Aware Framework for Multi-Image Story Visualization

作者: Chutian Meng, Fan Ma, Chi Zhang, Jiaxu Miao, Yi Yang, Yueting Zhuang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28082v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种用于多图像故事可视化的逻辑感知框架LogiStory，其核心创新是使用多智能体系统来建模视觉逻辑，包括角色定位、因果链提取和故事一致性验证。这与关键词’Multi-agent Systems OR Agent Coordination’高度相关（10分），因为论文明确构建了多智能体系统来实现其目标。然而，论文主要关注视觉序列生成和故事可视化，并未涉及大模型技术原理、深度学习创新或科学领域的AI应用，因此与其他所有关键词完全无关（0分）。

!!! tip deepseek-chat TL;DR

论文针对现有多模态系统在生成连贯视觉故事序列时缺乏逻辑流的问题，提出了一个逻辑感知的多图像故事可视化框架LogiStory，通过多智能体系统显式建模视觉逻辑，显著提升了生成视觉故事的叙事逻辑和视觉质量。

摘要翻译

生成连贯且具有叙事性的视觉序列（如图像序列与视频）仍是当前多模态系统面临的重要挑战。尽管在视觉质量与世界知识融合方面已取得进展，现有模型仍难以维持逻辑流畅性，常导致动作脱节、叙事碎片化及情节线模糊。我们将这些问题归因于对视觉逻辑关注的缺失——视觉逻辑是视觉序列生成中关键却尚未被充分探索的维度，我们将其定义为角色、动作与场景之间随时间推移形成的感知与因果连贯性。为弥补这一空白，我们提出一种逻辑感知的多图像故事可视化框架LogiStory。该框架围绕“在故事可视化中显式建模视觉逻辑”这一核心创新构建。为实现这一理念，我们设计了一个多智能体系统，通过角色定位、因果链提取与故事级一致性验证，将叙事连贯性从图像生成的隐性副产品转变为显式建模目标。这一设计有效衔接了结构化故事规划与视觉生成过程，从而提升了故事可视化中的叙事清晰度与视觉质量。此外，为评估生成能力，我们构建了LogicTale基准数据集，其中包含富含标注、强调因果推理与视觉逻辑可解释性的故事样本。我们建立了全面的自动与人工评估方案，用以衡量视觉逻辑与感知质量。实验表明，我们的方法显著提升了生成视觉故事的叙事逻辑性。本工作为在通用图像序列与视频生成任务中建模并强化视觉逻辑奠定了重要基础。

摘要 (Abstract)

Generating coherent and communicative visual sequences, such as image sequences and videos, remains a significant challenge for current multimodal systems. Despite advances in visual quality and the integration of world knowledge, existing models still struggle to maintain logical flow, often resulting in disjointed actions, fragmented narratives, and unclear storylines. We attribute these issues to the lack of attention to visual logic, a critical yet underexplored dimension of visual sequence generation that we define as the perceptual and causal coherence among characters, actions, and scenes over time. To bridge this gap, we propose a logic-aware multi-image story visualization framework, LogiStory. The framework is built around the central innovation of explicitly modeling visual logic in story visualization. To realize this idea, we design a multi-agent system that grounds roles, extracts causal chains, and verifies story-level consistency, transforming narrative coherence from an implicit byproduct of image generation into an explicit modeling objective. This design effectively bridges structured story planning with visual generation, enhancing both narrative clarity and visual quality in story visualization. Furthermore, to evaluate the generation capacity, we construct LogicTale, a benchmark comprising richly annotated stories, emphasizing causal reasoning, and visual logic interpretability. We establish comprehensive automatic and human evaluation protocols designed to measure both visual logic and perceptual quality. Experiments demonstrate that our approach significantly improves the narrative logic of generated visual stories. This work provides a foundational step towards modeling and enforcing visual logic in general image sequence and video generation tasks.

关键词: multi-image story visualization, visual logic, multi-agent system, causal reasoning, narrative coherence, story consistency, visual sequence generation, LogicTale benchmark

247. ❌ AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation

作者: Zhaohe Liao, Kaixun Jiang, Zhihang Liu, Yujie Wei, Junqiu Yu, Quanhao Li, Hong-Tao Yu, Pandeng Li, Yuzheng Wang, Zhen Xing, Shiwei Zhang, Chen-Wei Xie, Yun Zheng, Xihui Liu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28068v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究学术插图生成的视觉-逻辑一致性评估，提出了AIBench基准测试方法，使用VQA评估逻辑正确性和VLM评估美学质量。论文内容聚焦于图像生成评估、多模态理解和基准测试，与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐技术等）完全无关。仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为论文涉及学术论文插图生成，属于AI在科学领域的应用，但并非核心生物信息学或化学信息学应用。

!!! tip deepseek-chat TL;DR

该论文提出了AIBench基准测试，用于评估学术插图生成的视觉-逻辑一致性，发现当前模型在该任务上的性能差距远大于通用任务，且逻辑与美学难以同时优化。

摘要翻译

尽管图像生成技术通过其快速发展推动了多种应用，但当前最先进的模型能否为学术论文生成可直接使用的插图，这一问题在很大程度上仍未得到探索。直接使用视觉语言模型（VLM）对插图进行比较或评估虽直观，但需要理想的多模态理解能力，这对于冗长复杂的文本和插图而言并不可靠。为解决此问题，我们提出了AIBench——首个利用视觉问答（VQA）评估学术插图逻辑正确性、并借助VLM评估其美学质量的基准。具体而言，我们从论文方法部分总结的逻辑图中提炼出四个层次的问题，用于查询生成的插图在不同尺度上与论文内容的一致性。我们基于VQA的方法能够对视觉-逻辑一致性进行更准确、细致的评估，同时减少对评判VLM能力的依赖。借助高质量构建的AIBench，我们进行了大量实验，结果表明：各模型在此任务上的性能差距显著大于通用生成任务，这反映了它们在复杂推理和高密度生成能力上的差异。此外，逻辑性与美学性难以像人工绘制的插图那样同时优化。进一步的实验表明，对这两种能力进行测试时扩展可显著提升模型在此任务上的表现。

摘要 (Abstract)

Although image generation has boosted various applications via its rapid evolution, whether the state-of-the-art models are able to produce ready-to-use academic illustrations for papers is still largely unexplored.Directly comparing or evaluating the illustration with VLM is native but requires oracle multi-modal understanding ability, which is unreliable for long and complex texts and illustrations. To address this, we propose AIBench, the first benchmark using VQA for evaluating logic correctness of the academic illustrations and VLMs for assessing aesthetics. In detail, we designed four levels of questions proposed from a logic diagram summarized from the method part of the paper, which query whether the generated illustration aligns with the paper on different scales. Our VQA-based approach raises more accurate and detailed evaluations on visual-logical consistency while relying less on the ability of the judger VLM. With our high-quality AIBench, we conduct extensive experiments and conclude that the performance gap between models on this task is significantly larger than general ones, reflecting their various complex reasoning and high-density generation ability. Further, the logic and aesthetics are hard to optimize simultaneously as in handcrafted illustrations. Additional experiments further state that test-time scaling on both abilities significantly boosts the performance on this task.

关键词: academic illustration generation, visual-logical consistency, benchmark evaluation, VQA-based assessment, multi-modal understanding, complex reasoning, aesthetics evaluation, test-time scaling

248. ❌ \textit{4DSurf}: High-Fidelity Dynamic Scene Surface Reconstruction

作者: Renjie Wu, Hongdong Li, Jose M. Alvarez, Miaomiao Liu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28064v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《4DSurf》专注于计算机视觉领域的动态场景表面重建，使用高斯泼溅（Gaussian Splatting）技术解决几何一致性问题。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用相关，而本文研究的是纯计算机视觉中的3D重建问题，未涉及任何大模型、深度学习创新或AI for Science应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为4DSurf的新框架，用于解决动态场景表面重建中处理大变形和时间一致性的问题，通过高斯变形诱导的符号距离函数流正则化和重叠段划分策略，在Hi4D和CMU Panoptic数据集上显著优于现有方法。

摘要翻译

本文针对基于高斯溅射（Gaussian Splatting, GS）的动态场景表面重建问题，旨在恢复时间一致的几何结构。现有基于GS的动态表面重建方法虽能实现高质量重建，但通常局限于单一物体或仅发生微小形变的物体，难以在长时间跨度下保持大形变表面的时间一致性重建。我们提出“4DSurf”——一种新颖且统一的通用动态表面重建框架，无需预先指定场景中物体的数量或类型，能够处理大范围的表面形变及重建过程中的时间不一致性。本框架的核心创新在于引入了高斯形变诱导的符号距离函数流正则化，该机制约束高斯点的运动以贴合演化中的表面。为处理大形变，我们提出重叠分段策略，将序列划分为具有小形变的重叠片段，并通过共享的重叠时间步长在片段间渐进传递几何信息。在两个具有挑战性的动态场景数据集（Hi4D和CMU Panoptic）上的实验表明，本方法在倒角距离（Chamfer distance）指标上分别优于当前最先进的表面重建方法49%和19%，并在稀疏视角设置下实现了更优的时间一致性。

摘要 (Abstract)

This paper addresses the problem of dynamic scene surface reconstruction using Gaussian Splatting (GS), aiming to recover temporally consistent geometry. While existing GS-based dynamic surface reconstruction methods can yield superior reconstruction, they are typically limited to either a single object or objects with only small deformations, struggling to maintain temporally consistent surface reconstruction of large deformations over time. We propose ``\textit{4DSurf}’’, a novel and unified framework for generic dynamic surface reconstruction that does not require specifying the number or types of objects in the scene, can handle large surface deformations and temporal inconsistency in reconstruction. The key innovation of our framework is the introduction of Gaussian deformations induced Signed Distance Function Flow Regularization that constrains the motion of Gaussians to align with the evolving surface. To handle large deformations, we introduce an Overlapping Segment Partitioning strategy that divides the sequence into overlapping segments with small deformations and incrementally passes geometric information across segments through the shared overlapping timestep. Experiments on two challenging dynamic scene datasets, Hi4D and CMU Panoptic, demonstrate that our method outperforms state-of-the-art surface reconstruction methods by 49% and 19% in Chamfer distance, respectively, and achieves superior temporal consistency under sparse-view settings.

关键词: dynamic scene surface reconstruction, Gaussian Splatting, temporal consistency, large deformations, 4DSurf, signed distance function, overlapping segment partitioning, sparse-view settings

249. ❌ Object Detection Based on Distributed Convolutional Neural Networks

作者: Liang Sun 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28050v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究基于分布式卷积神经网络（DisCNN）的目标检测方法，属于传统的计算机视觉领域，专注于CNN架构优化和检测算法。所有评分关键词均与大模型、深度学习技术原理创新或AI在科学领域的应用相关，而本文完全不涉及大语言模型（LLM）、MoE、缩放定律、预训练/后训练、对齐、RAG、推理加速、幻觉缓解、可解释性、智能体等大模型相关技术，也未涉及生物信息学等科学AI应用。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于分布式卷积神经网络（DisCNN）的目标检测方法，通过检测多尺度特征并重叠高得分区域来生成边界框，实现了轻量级并行检测加速。

摘要翻译

基于分布式卷积神经网络（Distributed Convolutional Neural Network，DisCNN），本文提出了一种简洁的目标检测方法。DisCNN针对特定正类别的输出向量模长，与正类特征存在的概率呈正单调关系。因此，通过识别所有可能尺度上的高分区域，并将它们重叠以形成边界框，即可检测出正类目标。其核心思想在于，通过从特定子特征到由这些子特征构成的抽象特征的多尺度特征检测来实现目标检测。训练DisCNN仅需要以目标为中心的图像数据及正负类别标签。针对多个正类别的检测过程可并行执行以显著加速，同时由于其轻量级模型架构，单目标检测速度也更快。

摘要 (Abstract)

Based on the Distributed Convolutional Neural Network(DisCNN), a straightforward object detection method is proposed. The modules of the output vector of a DisCNN with respect to a specific positive class are positively monotonic with the presence probabilities of the positive features. So, by identifying all high-scoring patches across all possible scales, the positive object can be detected by overlapping them to form a bounding box. The essential idea is that the object is detected by detecting its features on multiple scales, ranging from specific sub-features to abstract features composed of these sub-features. Training DisCNN requires only object-centered image data with positive and negative class labels. The detection process for multiple positive classes can be conducted in parallel to significantly accelerate it, and also faster for single-object detection because of its lightweight model architecture.

关键词: Object Detection, Distributed Convolutional Neural Network, DisCNN, Multi-scale Features, Bounding Box, Lightweight Model, Parallel Detection, Training Data

250. ❌ Event6D: Event-based Novel Object 6D Pose Tracking

作者: Jae-Young Kang, Hoonehee Cho, Taeyeop Lee, Minjun Kang, Bowen Wen, Youngho Kim, Kuk-Jin Yoon 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28045v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于事件相机在6D物体姿态跟踪中的计算机视觉应用，涉及事件数据处理、深度重建和姿态跟踪算法，但完全不涉及大语言模型、深度学习技术原理或AI在科学领域的应用，与所有评分关键词均无关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于事件相机的6D物体姿态跟踪框架EventTrack6D，通过重建强度和深度信息实现对新物体的泛化跟踪，无需物体特定训练，并在合成数据上训练后能有效泛化到真实场景。

摘要翻译

事件相机具备微秒级延迟特性，使其适用于快速动态场景中的六维物体姿态跟踪，而传统RGB与深度流程在此类场景中易受运动模糊和大像素位移的影响。本文提出EventTrack6D——一种事件-深度跟踪框架，该框架通过在深度帧之间的任意时间戳重建强度与深度信息，无需针对特定物体进行训练即可泛化至新物体。基于最新深度测量数据，我们的双重重建方法能够从稀疏事件流中恢复密集的光度与几何线索。EventTrack6D的运行速度超过120帧/秒，并在快速运动下保持时间一致性。为支持训练与评估，我们构建了一套综合基准测试集：包括用于训练的大规模合成数据集，以及两个互补的评估集（含真实与模拟事件数据集）。仅通过合成数据训练的EventTrack6D无需微调即可有效泛化至真实场景，在不同物体与运动模式下均保持精确跟踪。我们的方法与数据集验证了事件相机在基于事件的新物体六维姿态跟踪中的有效性。代码与数据集已公开于https://chohoonhee.github.io/Event6D。

摘要 (Abstract)

Event cameras provide microsecond latency, making them suitable for 6D object pose tracking in fast, dynamic scenes where conventional RGB and depth pipelines suffer from motion blur and large pixel displacements. We introduce EventTrack6D, an event-depth tracking framework that generalizes to novel objects without object-specific training by reconstructing both intensity and depth at arbitrary timestamps between depth frames. Conditioned on the most recent depth measurement, our dual reconstruction recovers dense photometric and geometric cues from sparse event streams. Our EventTrack6D operates at over 120 FPS and maintains temporal consistency under rapid motion. To support training and evaluation, we introduce a comprehensive benchmark suite: a large-scale synthetic dataset for training and two complementary evaluation sets, including real and simulated event datasets. Trained exclusively on synthetic data, EventTrack6D generalizes effectively to real-world scenarios without fine-tuning, maintaining accurate tracking across diverse objects and motion patterns. Our method and datasets validate the effectiveness of event cameras for event-based 6D pose tracking of novel objects. Code and datasets are publicly available at https://chohoonhee.github.io/Event6D.

关键词: event cameras, 6D pose tracking, novel objects, depth reconstruction, real-time tracking, synthetic dataset, generalization

251. ❌ Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting

作者: Zhen Zou, Xiaoxiao Ma, Mingde Yao, Jie Huang, LinJiang Huang, Feng Zhao 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28049v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视觉生成模型的加速技术，特别是针对自回归-扩散混合范式。核心创新是提出Drift-AR方法，利用熵信号统一加速AR阶段和视觉解码阶段。仅与关键词’Speculative Decoding OR Inference Acceleration’高度相关（10分），因为论文明确提出了’Entropy-Informed Speculative Decoding’用于AR加速，并实现了单步解码的推理加速。其他关键词均与论文内容无关，论文不涉及大语言模型、对齐、微调、代理、科学AI应用等主题。

!!! tip deepseek-chat TL;DR

论文解决了自回归-扩散混合视觉生成模型的双重速度瓶颈问题，提出Drift-AR方法利用熵信号统一加速AR阶段和视觉解码阶段，实现了3.8-5.5倍加速和单步解码，同时保持或超越原始质量。

摘要翻译

自回归（AR）-扩散混合范式结合了AR的结构化语义建模能力与扩散模型的高保真合成特性，但其面临双重速度瓶颈：顺序执行的AR阶段与扩散视觉解码阶段的多步迭代去噪过程。现有方法往往孤立地处理这两个瓶颈，缺乏统一的设计原则。我们观察到，连续空间AR模型的逐位置预测熵天然编码了空间变化的生成不确定性，它同时主导AR阶段草稿预测的质量，并反映视觉解码阶段所需的修正强度，而这一特性此前尚未被充分探索。由于熵本质上与两个瓶颈均相关联，它可作为联合加速的自然统一信号。本文提出Drift-AR，利用熵信号加速两个阶段：1）针对AR加速，我们引入熵感知的推测解码，通过因果归一化熵损失对齐草稿与目标熵分布，解决因熵失配导致的草稿过度拒绝问题；2）针对视觉解码加速，我们将熵重新解释为反称漂移场初始状态的物理方差——高熵位置激活朝向数据流形的强漂移，而低熵位置产生可忽略的漂移——从而实现无需迭代去噪或蒸馏的单步（1-NFE）解码。此外，两个阶段共享同一熵信号，该信号仅需计算一次且无额外开销。在MAR、TransDiff和NextStep-1数据集上的实验表明，该方法在实现真实1-NFE解码的同时获得3.8–5.5倍的加速，且生成质量与原模型相当或更优。代码将在https://github.com/aSleepyTree/Drift-AR 公开。

摘要 (Abstract)

Autoregressive (AR)-Diffusion hybrid paradigms combine AR’s structured semantic modeling with diffusion’s high-fidelity synthesis, yet suffer from a dual speed bottleneck: the sequential AR stage and the iterative multi-step denoising of the diffusion vision decode stage. Existing methods address each in isolation without a unified principle design. We observe that the per-position \emph{prediction entropy} of continuous-space AR models naturally encodes spatially varying generation uncertainty, which simultaneously governing draft prediction quality in the AR stage and reflecting the corrective effort required by vision decoding stage, which is not fully explored before. Since entropy is inherently tied to both bottlenecks, it serves as a natural unifying signal for joint acceleration. In this work, we propose \textbf{Drift-AR}, which leverages entropy signal to accelerate both stages: 1) for AR acceleration, we introduce Entropy-Informed Speculative Decoding that align draft–target entropy distributions via a causal-normalized entropy loss, resolving the entropy mismatch that causes excessive draft rejection; 2) for visual decoder acceleration, we reinterpret entropy as the \emph{physical variance} of the initial state for an anti-symmetric drifting field – high-entropy positions activate stronger drift toward the data manifold while low-entropy positions yield vanishing drift – enabling single-step (1-NFE) decoding without iterative denoising or distillation. Moreover, both stages share the same entropy signal, which is computed once with no extra cost. Experiments on MAR, TransDiff, and NextStep-1 demonstrate 3.8–5.5$\times$ speedup with genuine 1-NFE decoding, matching or surpassing original quality. Code will be available at https://github.com/aSleepyTree/Drift-AR.

关键词: Autoregressive-Diffusion hybrid, Visual generation acceleration, Entropy-informed speculative decoding, Single-step decoding, Anti-symmetric drifting, Prediction entropy, Inference acceleration, 1-NFE decoding

252. ❌ Effort-Based Criticality Metrics for Evaluating 3D Perception Errors in Autonomous Driving

作者: Sharang Kaul, Simon Bultmann, Mario Berk, Abhinav Valada 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28029v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于自动驾驶中3D感知错误的评估指标，提出了基于努力（effort-based）的临界性度量方法（FSR、MDR、LEA），并利用可达性分析进行碰撞过滤。论文内容完全围绕自动驾驶感知系统评估、安全指标和运动学分析，未涉及任何大模型、深度学习技术原理、AI科学应用或相关技术关键词。所有评分关键词均与大模型技术、训练方法、推理优化、AI代理、科学AI应用等相关，与该论文的自动驾驶感知评估主题无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对自动驾驶中感知错误的评估问题，提出了三种基于努力（纵向速度损失、最大减速率、横向规避加速度）的临界性度量指标，通过可达性分析和轨迹匹配，能够更准确地识别安全相关的关键感知失败，弥补了传统时间或减速率指标的不足。

摘要翻译

诸如碰撞时间（TTC）等关键性指标量化了碰撞紧迫性，但混淆了误报（FP）与漏报（FN）感知错误所导致的后果。我们提出了两种新颖的基于干预努力的度量指标：误报减速（FSR），即由持续存在的虚影检测导致的累积速度损失；以及最大减速率（MDR），即在恒定加速度模型下，因漏检物体而产生的峰值制动需求。这些纵向指标由横向规避加速度（LEA）进行补充，该指标改编自先前的横向规避运动学模型，并与基于可达性的碰撞时序计算相结合，以量化规避预测碰撞所需的最小转向努力。一个基于可达性的椭球碰撞过滤器确保仅对动态上合理的威胁进行评分，并采用帧级匹配与轨迹级聚合。在nuScenes和Argoverse~2数据集上对不同感知流程的评估表明，65-93%的感知误差是非关键性的；斯皮尔曼相关性分析证实，所有三项指标均能捕捉到现有基于时间、基于减速率或归一化关键性度量所无法获取的安全相关信息，从而能够有针对性地挖掘最关键的感知失效案例。

摘要 (Abstract)

Criticality metrics such as time-to-collision (TTC) quantify collision urgency but conflate the consequences of false-positive (FP) and false-negative (FN) perception errors. We propose two novel effort-based metrics: False Speed Reduction (FSR), the cumulative velocity loss from persistent phantom detections, and Maximum Deceleration Rate (MDR), the peak braking demand from missed objects under a constant-acceleration model. These longitudinal metrics are complemented by Lateral Evasion Acceleration (LEA), adapted from prior lateral evasion kinematics and coupled with reachability-based collision timing to quantify the minimum steering effort to avoid a predicted collision. A reachability-based ellipsoidal collision filter ensures only dynamically plausible threats are scored, with frame-level matching and track-level aggregation. Evaluation of different perception pipelines on nuScenes and Argoverse~2 shows that 65-93% of errors are non-critical, and Spearman correlation analysis confirms that all three metrics capture safety-relevant information inaccessible to established time-based, deceleration-based, or normalized criticality measures, enabling targeted mining of the most critical perception failures.

关键词: autonomous driving, perception errors, criticality metrics, effort-based metrics, collision avoidance, reachability analysis, safety evaluation, 3D perception

253. ❌ Efficient Domain Adaptation for Text Line Recognition via Decoupled Language Models

作者: Arundhathi Dev, Justin Zhan 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28028v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究光学字符识别（OCR）的领域适应问题，提出了一种解耦的检测-校正框架，使用预训练序列模型（T5、ByT5、BART）进行语言校正。与关键词的相关性分析如下：1）与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（8分），因为论文核心是使用预训练模型进行领域适应，特别是无标注目标图像的领域适应；2）其他关键词（如LLMs、MoE、RLHF等）与论文内容无关（0分），因为论文聚焦于OCR和序列模型应用，而非大模型技术原理或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文针对光学字符识别领域适应计算成本高的问题，提出了一种解耦的检测-校正框架，使用预训练序列模型进行语言校正，在保持接近最优精度的同时将计算需求减少约95%。

摘要翻译

光学字符识别仍是文档数字化的关键基础设施，但当前最先进的性能常因高昂的计算成本而局限于资源充足的机构。端到端Transformer架构虽能实现高精度，但需要数百GPU小时进行领域适配，限制了实践者与数字人文学者的可及性。本文提出一种模块化的检测-校正框架，通过单GPU训练即可实现接近最先进的精度。该方法将轻量级视觉字符检测（领域无关）与基于预训练序列模型（包括T5、ByT5和BART）的领域特定语言校正相解耦。通过完全在合成噪声数据上训练校正器，我们实现了无需标注目标图像的零样本领域适应。在现代工整手写体、草书及历史文献上的评估揭示出架构选择中的关键“帕累托边界”：T5-Base在标准词汇的现代文本上表现卓越，而ByT5-Base凭借字节级重构能力在历史文献中占据优势。实验结果表明，这种解耦范式在匹配端到端Transformer精度的同时，将计算需求降低约95%，为整体式OCR架构提供了可行且资源高效的替代方案。

摘要 (Abstract)

Optical character recognition remains critical infrastructure for document digitization, yet state-of-the-art performance is often restricted to well-resourced institutions by prohibitive computational barriers. End-to-end transformer architectures achieve strong accuracy but demand hundreds of GPU hours for domain adaptation, limiting accessibility for practitioners and digital humanities scholars. We present a modular detection-and-correction framework that achieves near-SOTA accuracy with single-GPU training. Our approach decouples lightweight visual character detection (domain-agnostic) from domain-specific linguistic correction using pretrained sequence models including T5, ByT5, and BART. By training the correctors entirely on synthetic noise, we enable annotation-free domain adaptation without requiring labeled target images. Evaluating across modern clean handwriting, cursive script, and historical documents, we identify a critical “Pareto frontier” in architecture selection: T5-Base excels on modern text with standard vocabulary, whereas ByT5-Base dominates on historical documents by reconstructing archaic spellings at the byte level. Our results demonstrate that this decoupled paradigm matches end-to-end transformer accuracy while reducing compute by approximately 95%, establishing a viable, resource-efficient alternative to monolithic OCR architectures.

关键词: Optical Character Recognition, Domain Adaptation, Decoupled Framework, Pretrained Sequence Models, Synthetic Noise Training, Resource Efficiency, T5, ByT5

作者: Jingze Su, Tianle Zhu, Jiaxin Cai, Zhiyi Wang, Qi Li, Xiao Zhang, Tong Tong, Shu Wang, Wenxi Liu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28027v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出了一种参数高效微调框架（Cooperative Fine-Grained Refinement of SAM），用于将Segment Anything Model（SAM）适配到医学图像中的细胞核实例分割任务。核心创新在于参数高效微调（PEFT）技术，通过Multi-scale Adaptive Local-aware Adapter等组件，以最小参数增强冻结的SAM主干，实现能力迁移。这与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分）。研究属于医学图像分析，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文涉及将预训练模型（SAM）迁移到下游任务，与’Pre-training OR Continual Pre-training OR Domain Adaptation’和’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（各5分），但非核心焦点。其他关键词主要涉及大语言模型（LLMs）的特定技术（如推理、对齐、代理等），与本文的计算机视觉和医学图像分割任务完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种参数高效微调框架，通过增强局部感知、保留空间细节和优化边界，将Segment Anything Model（SAM）成功适配到医学图像中的细胞核实例分割任务，实现了准确的分割性能。

摘要翻译

细胞核实例分割在计算病理学中对癌症诊断与预后至关重要。近期，Segment Anything Model（SAM）凭借其在自然图像大规模预训练中获得的丰富先验知识与强大的全局上下文建模能力，已在多种分割任务中展现出卓越性能。然而，直接将SAM应用于医学影像领域存在显著局限：其缺乏对细胞核分割至关重要的局部结构特征的充分感知，且为下游任务进行全面微调需高昂计算成本。为将SAM的鲁棒先验知识高效迁移至细胞核实例分割任务，同时补充其任务感知的局部理解能力，我们提出一种参数高效的微调框架，命名为SAM的协同细粒度优化框架，该框架包含三个核心组件：1）多尺度自适应局部感知适配器，通过以极少的参数增强冻结的SAM主干网络，并利用动态生成的多尺度卷积核注入强大的局部结构感知能力，实现有效的知识迁移；2）层级调制融合模块，动态聚合多层级编码器特征以保留细粒度空间细节；3）边界引导掩码优化模块，通过显式监督将多上下文边界线索与语义特征相融合，生成聚焦边界的信号以优化初始掩码预测，实现更清晰的分割轮廓。这三个组件协同工作，共同增强局部感知、保留空间细节并优化边界，使SAM能够直接执行精确的细胞核实例分割。

摘要 (Abstract)

Nuclei instance segmentation is critical in computational pathology for cancer diagnosis and prognosis. Recently, the Segment Anything Model has demonstrated exceptional performance in various segmentation tasks, leveraging its rich priors and powerful global context modeling capabilities derived from large-scale pre-training on natural images. However, directly applying SAM to the medical imaging domain faces significant limitations: it lacks sufficient perception of the local structural features that are crucial for nuclei segmentation, and full fine-tuning for downstream tasks requires substantial computational costs. To efficiently transfer SAM’s robust prior knowledge to nuclei instance segmentation while supplementing its task-aware local perception, we propose a parameter-efficient fine-tuning framework, named Cooperative Fine-Grained Refinement of SAM, consisting of three core components: 1) a Multi-scale Adaptive Local-aware Adapter, which enables effective capability transfer by augmenting the frozen SAM backbone with minimal parameters and instilling a powerful perception of local structures through dynamically generated, multi-scale convolutional kernels; 2) a Hierarchical Modulated Fusion Module, which dynamically aggregates multi-level encoder features to preserve fine-grained spatial details; and 3) a Boundary-Guided Mask Refinement, which integrates multi-context boundary cues with semantic features through explicit supervision, producing a boundary-focused signal to refine initial mask predictions for sharper delineation. These three components work cooperatively to enhance local perception, preserve spatial details, and refine boundaries, enabling SAM to perform accurate nuclei instance segmentation directly.

关键词: SAM, nuclei instance segmentation, parameter-efficient fine-tuning, medical imaging, computational pathology, local perception, mask refinement, domain adaptation

255. ❌ SegRGB-X: General RGB-X Semantic Segmentation Model

作者: Jiong Liu, Yingjie Xu, Xingcheng Zhou, Rui Song, Walter Zimmer, Alois Knoll, Hu Cao 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28023v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多模态语义分割，与绝大多数大模型/深度学习技术关键词无关。仅与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’有一定关联（5分），因摘要提到使用LoRA微调MA-CLIP。其他关键词均未涉及，故评0分。加权总分仅5.0分，远低于及格分26.6分，表明论文与大模型/深度学习技术原理创新或科学领域应用的核心主题不相关。

!!! tip deepseek-chat TL;DR

该论文提出了一种通用的任意模态语义分割框架SegRGB-X，通过模态感知CLIP、模态对齐嵌入和领域特定细化模块，在五种互补模态数据集上实现了65.03% mIoU的最先进性能。

摘要翻译

由于不同传感器特性各异，跨任意传感器模态的语义分割面临显著挑战，而针对该任务的传统配置会导致冗余的开发工作。为解决这些问题，我们提出了一种通用的任意模态语义分割框架，以统一多模态分割任务。我们的方法包含三项关键创新：（1）模态感知CLIP（MA-CLIP），通过LoRA微调提供针对特定模态的场景理解指导；（2）模态对齐嵌入，用于捕捉细粒度特征；（3）动态特征调整的领域特定优化模块（DSRM）。在涵盖五种不同互补模态（事件、热成像、深度、偏振和光场）的多样化数据集上进行评估后，我们的模型超越了专用多模态方法，并以65.03%的平均交并比（mIoU）实现了最先进的性能。相关代码将在论文录用后公开。

摘要 (Abstract)

Semantic segmentation across arbitrary sensor modalities faces significant challenges due to diverse sensor characteristics, and the traditional configurations for this task result in redundant development efforts. We address these challenges by introducing a universal arbitrary-modal semantic segmentation framework that unifies segmentation across multiple modalities. Our approach features three key innovations: (1) the Modality-aware CLIP (MA-CLIP), which provides modality-specific scene understanding guidance through LoRA fine-tuning; (2) Modality-aligned Embeddings for capturing fine-grained features; and (3) the Domain-specific Refinement Module (DSRM) for dynamic feature adjustment. Evaluated on five diverse datasets with different complementary modalities (event, thermal, depth, polarization, and light field), our model surpasses specialized multi-modal methods and achieves state-of-the-art performance with a mIoU of 65.03%. The codes will be released upon acceptance.

关键词: semantic segmentation, multi-modal, universal framework, LoRA fine-tuning, modality-aware CLIP, domain-specific refinement, arbitrary sensor modalities, state-of-the-art performance

256. ❌ Energy-Aware Imitation Learning for Steering Prediction Using Events and Frames

作者: Hu Cao, Jiong Liu, Xingzhuo Yan, Rui Song, Yan Xia, Walter Zimmer, Guang Chen, Alois Knoll 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28008v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究自动驾驶中的转向预测，使用事件相机和帧相机的多模态融合方法，属于计算机视觉和机器人领域。所有评分关键词均与大语言模型、深度学习技术原理、AI for Science等主题相关，而本文完全不涉及这些内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种能量感知的模仿学习框架，通过融合事件相机和帧相机的数据来改进自动驾驶中的转向预测，在公开数据集上超越了现有方法。

摘要翻译

在自动驾驶领域，仅依赖基于帧的相机可能因长曝光时间、高速运动及复杂光照条件等因素导致感知不准确。为应对这些问题，我们引入一种受生物启发的视觉传感器——事件相机（event camera）。与传统相机不同，事件相机捕获稀疏、异步的事件流，为解决上述挑战提供了互补的感知模态。本研究提出一种面向转向预测的能量感知模仿学习框架，该框架同时利用事件流与帧数据。具体而言，我们设计了能量驱动的跨模态融合模块（Energy-driven Cross-modality Fusion Module, ECFM）以及能量感知解码器，以生成可靠且安全的预测。在DDD20和DRFuser两个公开真实世界数据集上的大量实验表明，本方法优于现有的先进（state-of-the-art, SOTA）方法。相关代码与训练模型将在论文录用后公开。

摘要 (Abstract)

In autonomous driving, relying solely on frame-based cameras can lead to inaccuracies caused by factors like long exposure times, high-speed motion, and challenging lighting conditions. To address these issues, we introduce a bio-inspired vision sensor known as the event camera. Unlike conventional cameras, event cameras capture sparse, asynchronous events that provide a complementary modality to mitigate these challenges. In this work, we propose an energy-aware imitation learning framework for steering prediction that leverages both events and frames. Specifically, we design an Energy-driven Cross-modality Fusion Module (ECFM) and an energy-aware decoder to produce reliable and safe predictions. Extensive experiments on two public real-world datasets, DDD20 and DRFuser, demonstrate that our method outperforms existing state-of-the-art (SOTA) approaches. The codes and trained models will be released upon acceptance.

关键词: autonomous driving, steering prediction, event camera, imitation learning, cross-modality fusion, energy-aware, DDD20, DRFuser

257. ❌ Physically Inspired Gaussian Splatting for HDR Novel View Synthesis

作者: Huimin Zeng, Yue Bai, Hailing Wang, Yun Fu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28020v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的高动态范围新视角合成（HDR-NVS），提出了一种基于物理启发的Gaussian Splatting框架（PhysHDR-GS）。论文的核心贡献在于通过建模内在反射率和可调环境光照来改进HDR场景重建，并提出了跨分支HDR一致性损失和光照引导梯度缩放策略。虽然论文涉及AI技术（深度学习在计算机视觉中的应用），但其研究内容与所有评分关键词（均围绕大语言模型、模型训练、推理优化、对齐、代理系统等自然语言处理或通用AI主题）完全无关。论文未提及任何语言模型、MoE、缩放定律、训练方法、对齐技术、推理加速、代理系统或科学AI应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种物理启发的Gaussian Splatting框架（PhysHDR-GS），通过建模内在反射率和环境光照来改进高动态范围新视角合成，在重建HDR细节方面优于现有方法（如PSNR提升2.04 dB），同时保持实时渲染速度（高达76 FPS）。

摘要翻译

高动态范围新视角合成（HDR-NVS）通过融合多曝光低动态范围（LDR）视图重建具有动态细节的场景，但难以捕捉依赖于环境光照的外观表现。通过约束色调映射结果对HDR内容进行隐式监督，无法纠正异常的HDR数值，并导致欠曝光/过曝光区域的高斯体梯度受限。为此，我们提出PhysHDR-GS——一个受物理学启发的HDR-NVS框架，通过本征反射率和可调环境光照对场景外观进行建模。PhysHDR-GS采用互补的图像曝光（IE）分支与高斯光照（GI）分支，分别忠实复现标准相机观测结果并捕捉光照依赖的外观变化。在训练过程中，所提出的跨分支HDR一致性损失为HDR内容提供显式监督，而光照引导的梯度缩放策略缓解了曝光偏差导致的梯度匮乏问题，并减少了欠致密化表示。在真实与合成数据集上的实验结果表明，本方法在重建HDR细节方面具有优越性（例如相较HDR-GS获得2.04 dB的PSNR提升），同时保持实时渲染速度（最高达76 FPS）。代码与模型已发布于https://huimin-zeng.github.io/PhysHDR-GS/。

摘要 (Abstract)

High dynamic range novel view synthesis (HDR-NVS) reconstructs scenes with dynamic details by fusing multi-exposure low dynamic range (LDR) views, yet it struggles to capture ambient illumination-dependent appearance. Implicitly supervising HDR content by constraining tone-mapped results fails in correcting abnormal HDR values, and results in limited gradients for Gaussians in under/over-exposed regions. To this end, we introduce PhysHDR-GS, a physically inspired HDR-NVS framework that models scene appearance via intrinsic reflectance and adjustable ambient illumination. PhysHDR-GS employs a complementary image-exposure (IE) branch and Gaussian-illumination (GI) branch to faithfully reproduce standard camera observations and capture illumination-dependent appearance changes, respectively. During training, the proposed cross-branch HDR consistency loss provides explicit supervision for HDR content, while an illumination-guided gradient scaling strategy mitigates exposure-biased gradient starvation and reduces under-densified representations. Experimental results across realistic and synthetic datasets demonstrate our superiority in reconstructing HDR details (e.g., a PSNR gain of 2.04 dB over HDR-GS), while maintaining real-time rendering speed (up to 76 FPS). Code and models are available at https://huimin-zeng.github.io/PhysHDR-GS/.

关键词: High Dynamic Range Novel View Synthesis, Gaussian Splatting, Physically Inspired Modeling, Intrinsic Reflectance, Ambient Illumination, HDR Consistency Loss, Real-time Rendering, Multi-exposure Fusion

258. ❌ DipGuava: Disentangling Personalized Gaussian Features for 3D Head Avatars from Monocular Video

作者: Jeonghaeng Lee, Seok Keun Choi, Zhixuan Li, Weisi Lin, Sanghoon Lee 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28003v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《DipGuava: Disentangling Personalized Gaussian Features for 3D Head Avatars from Monocular Video》专注于计算机视觉和图形学领域，提出了一种基于3D高斯分布的头像生成方法，用于从单目视频中创建具有个性化细节的3D头像。其核心贡献在于两阶段解耦训练流程、动态外观融合等技术，以提高重建保真度和真实感。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用直接相关，而本论文研究的是3D头像生成的特定计算机视觉问题，未涉及任何大模型、深度学习创新原理或AI在生物医药等科学领域的应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为DipGuava的新型3D高斯头像生成方法，通过解耦面部外观为几何驱动的基础外观和个性化残差细节的两阶段训练流程，成功从单目视频中创建出具有高保真度和身份保持性的逼真3D头像，在视觉质量和定量性能上均优于现有方法。

摘要翻译

尽管近期三维头部虚拟形象生成方法尝试模拟面部动态，其往往难以捕捉个性化细节，从而限制了真实感与表现力。为填补这一空白，我们提出DipGuava（解耦个性化高斯UV虚拟形象），这是一种新颖的三维高斯头部虚拟形象生成方法，能够从单目视频中成功生成具备个性化属性的虚拟形象。DipGuava是首个将面部外观显式解耦为两个互补组件的方法，通过结构化的两阶段训练流程显著降低学习歧义并提升重建保真度。在第一阶段，我们学习稳定的几何驱动基础外观，以捕捉全局面部结构及粗粒度表情依赖变化。在第二阶段，预测第一阶段未能捕获的个性化残差细节，包括高频成分及非线性变化特征（如皱纹与细微皮肤形变）。这些组件通过动态外观融合进行整合，该融合机制在形变后集成残差细节，确保空间与语义对齐。这种解耦设计使DipGuava能够生成具有照片级真实感且保持身份特征的虚拟形象，在视觉质量与量化性能上均持续超越现有方法，大量实验已验证其优越性。

摘要 (Abstract)

While recent 3D head avatar creation methods attempt to animate facial dynamics, they often fail to capture personalized details, limiting realism and expressiveness. To fill this gap, we present DipGuava (Disentangled and Personalized Gaussian UV Avatar), a novel 3D Gaussian head avatar creation method that successfully generates avatars with personalized attributes from monocular video. DipGuava is the first method to explicitly disentangle facial appearance into two complementary components, trained in a structured two-stage pipeline that significantly reduces learning ambiguity and enhances reconstruction fidelity. In the first stage, we learn a stable geometry-driven base appearance that captures global facial structure and coarse expression-dependent variations. In the second stage, the personalized residual details not captured in the first stage are predicted, including high-frequency components and nonlinearly varying features such as wrinkles and subtle skin deformations. These components are fused via dynamic appearance fusion that integrates residual details after deformation, ensuring spatial and semantic alignment. This disentangled design enables DipGuava to generate photorealistic, identity-preserving avatars, consistently outperforming prior methods in both visual quality and quantitativeperformance, as demonstrated in extensive experiments.

关键词: 3D head avatar, Gaussian features, monocular video, disentangled appearance, personalized attributes, two-stage pipeline, dynamic appearance fusion, photorealistic avatars

259. ❌ Temporal Credit Is Free

作者: Aur Shalev Merin 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28750v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于循环神经网络（RNN）的训练算法优化，特别是通过前向传播中的隐状态传递时间信用，使用即时导数和RMSprop来替代传统的RTRL方法，以降低内存消耗。研究内容与所有提供的大模型（LLM）相关关键词（如预训练、微调、对齐、推理加速、幻觉缓解等）以及科学AI应用关键词均无直接关联。论文的核心是RNN训练方法，而非大模型技术或其在科学领域的应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需雅可比传播的循环神经网络在线适应方法，通过前向传播中的隐状态传递时间信用，使用即时导数和RMSprop在降低内存消耗的同时达到或超过传统RTRL的性能。

摘要翻译

循环网络无需通过雅可比传播实现在线适应。其隐态在前向传播过程中已承载时序信用分配信息；只要停止使用过时的迹记忆干扰梯度，并对不同参数组的梯度尺度进行归一化，即时导数便已足够。一项架构规则可预测何时需要归一化：当梯度必须通过无输出旁路的非线性状态更新时需引入\b{eta}2参数，反之则无需。在十种网络架构、真实灵长类神经元数据及流式机器学习基准测试中，采用RMSprop的即时导数方法在内存消耗降低1000倍的情况下（可扩展至n=1024单元），其性能达到或超越了完整实时循环学习算法。

摘要 (Abstract)

Recurrent networks do not need Jacobian propagation to adapt online. The hidden state already carries temporal credit through the forward pass; immediate derivatives suffice if you stop corrupting them with stale trace memory and normalize gradient scales across parameter groups. An architectural rule predicts when normalization is needed: \b{eta}2 is required when gradients must pass through a nonlinear state update with no output bypass, and unnecessary otherwise. Across ten architectures, real primate neural data, and streaming ML benchmarks, immediate derivatives with RMSprop match or exceed full RTRL, scaling to n = 1024 at 1000x less memory.

关键词: Recurrent networks, Temporal credit, Jacobian propagation, RTRL, Immediate derivatives, RMSprop, Memory efficiency, Online adaptation

260. ❌ CLIP-AUTT: Test-Time Personalization with Action Unit Prompting for Fine-Grained Video Emotion Recognition

作者: Muhammad Osama Zeeshan, Masoumeh Sharafi, Benoît Savary, Alessandro Lameiras Koerich, Marco Pedersoli, Eric Granger 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.27999v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	3.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	2.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	2.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	2.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究基于CLIP视觉语言模型的细粒度视频情感识别，通过动作单元（AUs）作为结构化文本提示来建模面部表情。与大多数关键词相关性较低，因为论文聚焦于视觉语言模型（VLM）而非大语言模型（LLM）技术。相关关键词：1）‘Large Language Models’等得3分，因为CLIP是视觉语言模型，属于基础模型范畴；2）‘Pre-training’等得2分，涉及CLIP的预训练表示利用；3）‘PEFT’等得2分，因为方法避免CLIP微调，采用轻量级提示调整；4）‘Explainable AI’得2分，因为AUs提供可解释语义；5）‘AI for Science’得5分，属于生物信息学应用（情感识别数据集如BioVid）。其他关键词如MoE、SFT、RAG等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出CLIP-AUTT方法，利用动作单元作为结构化文本提示在CLIP视觉语言模型中实现细粒度视频情感识别，通过测试时个性化适应提升对未见主体的识别性能，在多个挑战性数据集上超越现有方法。

摘要翻译

情感识别中的个性化对于准确解读细微且具有主体特异性的表达模式至关重要。视觉语言模型（如CLIP）的最新进展展现了利用联合图像-文本表征进行情感识别的强大潜力。然而，基于CLIP的方法要么依赖CLIP的对比预训练，要么依赖大语言模型生成描述性文本提示，这些方法存在噪声大、计算成本高且无法捕捉细粒度表情的问题，导致性能下降。在本研究中，我们利用动作单元作为CLIP中的结构化文本提示来建模细粒度面部表情。动作单元编码了表情背后细微的肌肉激活，为更鲁棒的情感识别提供了局部化且可解释的语义线索。我们提出了CLIP-AU，一种轻量级的、由动作单元引导的时间学习方法，它将可解释的动作单元语义整合到CLIP中。该方法通过对齐动作单元提示与面部动态来学习通用的、与主体无关的表征，从而无需微调CLIP或使用大语言模型生成的文本监督即可实现细粒度情感识别。尽管CLIP-AU建模了细粒度的动作单元语义，但它未能适应细微表情中主体特异性的变化。为了解决这一局限，我们提出了CLIP-AUTT，一种基于视频的测试时个性化方法，它能动态地将动作单元提示适配到未见主体的视频中。通过将熵引导的时间窗口选择与提示调优相结合，CLIP-AUTT在保持时间一致性的同时实现了主体特异性适配。我们在三个具有挑战性的基于视频的细微情感识别数据集（BioVid、StressID和BAH）上进行的广泛实验表明，CLIP-AU和CLIP-AUTT优于当前最先进的基于CLIP的面部表情识别和测试时适应方法，实现了鲁棒且个性化的细微情感识别。

摘要 (Abstract)

Personalization in emotion recognition (ER) is essential for an accurate interpretation of subtle and subject-specific expressive patterns. Recent advances in vision-language models (VLMs) such as CLIP demonstrate strong potential for leveraging joint image-text representations in ER. However, CLIP-based methods either depend on CLIP’s contrastive pretraining or on LLMs to generate descriptive text prompts, which are noisy, computationally expensive, and fail to capture fine-grained expressions, leading to degraded performance. In this work, we leverage Action Units (AUs) as structured textual prompts within CLIP to model fine-grained facial expressions. AUs encode the subtle muscle activations underlying expressions, providing localized and interpretable semantic cues for more robust ER. We introduce CLIP-AU, a lightweight AU-guided temporal learning method that integrates interpretable AU semantics into CLIP. It learns generic, subject-agnostic representations by aligning AU prompts with facial dynamics, enabling fine-grained ER without CLIP fine-tuning or LLM-generated text supervision. Although CLIP-AU models fine-grained AU semantics, it does not adapt to subject-specific variability in subtle expressions. To address this limitation, we propose CLIP-AUTT, a video-based test-time personalization method that dynamically adapts AU prompts to videos from unseen subjects. By combining entropy-guided temporal window selection with prompt tuning, CLIP-AUTT enables subject-specific adaptation while preserving temporal consistency. Our extensive experiments on three challenging video-based subtle ER datasets, BioVid, StressID, and BAH, indicate that CLIP-AU and CLIP-AUTT outperform state-of-the-art CLIP-based FER and TTA methods, achieving robust and personalized subtle ER.

关键词: CLIP, Action Units, video emotion recognition, test-time personalization, vision-language models, fine-grained facial expressions, prompt tuning, temporal learning

261. ❌ Stop Probing, Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation

作者: Vitória Barin Pacela, Shruti Joshi, Isabela Camacho, Simon Lacoste-Julien, David Klindt 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28744v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究稀疏自编码器（SAEs）在组合泛化方面的失败机制，与’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分），因为SAEs是稀疏模型的一种具体实现；同时与’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为论文探讨神经网络表示的可解释性机制。其他关键词与论文内容无关（0分），论文未涉及大模型应用、训练技术、推理优化、对齐、代理系统等主题。

!!! tip deepseek-chat TL;DR

该论文研究发现稀疏自编码器在组合泛化失败的根本原因是字典学习问题而非摊销推理问题，通过实验证明即使使用相同字典，摊销编码器与逐样本优化方法之间的性能差距仍然存在。

摘要翻译

线性表征假说认为，神经网络激活通过线性混合的方式编码高级概念。然而，在叠加态下，这种编码是从高维概念空间到低维激活空间的投影，而概念空间中的线性决策边界在投影后未必保持线性。在此背景下，传统的稀疏编码方法通过逐样本迭代推理，利用压缩感知的保证来恢复潜在因子。相比之下，稀疏自编码器将稀疏推理摊销为一个固定编码器，从而引入了系统性差距。我们证明，这种摊销差距在不同训练集规模、潜在维度和稀疏度水平下持续存在，导致稀疏自编码器在分布外组合偏移场景中失效。通过分解失效机制的受控实验，我们发现词典学习——而非推理过程——是根本性约束：稀疏自编码器学得的词典指向存在显著偏差的方向，即使在同一词典上用逐样本FISTA算法替换编码器也无法弥补差距。一个预言机基线证明，在所有测试规模下，只要拥有良好的词典，该问题均可解决。我们的研究结果将稀疏自编码器的失效重新界定为词典学习挑战，而非摊销问题，并指出可扩展的词典学习是叠加态下稀疏推理领域亟待解决的关键难题。

摘要 (Abstract)

The linear representation hypothesis states that neural network activations encode high-level concepts as linear mixtures. However, under superposition, this encoding is a projection from a higher-dimensional concept space into a lower-dimensional activation space, and a linear decision boundary in the concept space need not remain linear after projection. In this setting, classical sparse coding methods with per-sample iterative inference leverage compressed sensing guarantees to recover latent factors. Sparse autoencoders (SAEs), on the other hand, amortise sparse inference into a fixed encoder, introducing a systematic gap. We show this amortisation gap persists across training set sizes, latent dimensions, and sparsity levels, causing SAEs to fail under out-of-distribution (OOD) compositional shifts. Through controlled experiments that decompose the failure, we identify dictionary learning – not the inference procedure – as the binding constraint: SAE-learned dictionaries point in substantially wrong directions, and replacing the encoder with per-sample FISTA on the same dictionary does not close the gap. An oracle baseline proves the problem is solvable with a good dictionary at all scales tested. Our results reframe the SAE failure as a dictionary learning challenge, not an amortisation problem, and point to scalable dictionary learning as the key open problem for sparse inference under superposition.

关键词: sparse autoencoders, compositional generalisation, linear representation hypothesis, superposition, dictionary learning, amortisation gap, out-of-distribution, sparse coding

262. ❌ Expectation Error Bounds for Transfer Learning in Linear Regression and Linear Neural Networks

作者: Meitong Liu, Christopher Jung, Rui Li, Xue Feng, Han Zhao 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28739v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是线性回归和线性神经网络中的迁移学习理论分析，包括期望泛化误差的闭式表达、偏差-方差分解、任务权重优化等。所有评分关键词都直接与大模型、深度学习技术原理或特定应用领域相关，而本文专注于经典的线性模型理论分析，不涉及任何大模型技术、深度学习创新或AI科学应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了线性回归和线性神经网络中迁移学习的理论条件，推导了期望泛化误差的闭式表达式和任务权重优化方法，为辅助数据何时能改善主任务泛化提供了理论保证。

摘要翻译

在迁移学习中，学习者利用辅助数据提升在主任务上的泛化性能。然而，关于辅助数据何时以及如何改善泛化性能的精确理论理解仍不完善。本文针对两种经典线性设定——普通最小二乘回归与欠参数化线性神经网络——就该问题提供了新的理论洞见。对于线性回归，我们通过偏差-方差分解推导了期望泛化误差的精确闭式表达式，从而得到辅助任务改善主任务泛化性能的充分必要条件。同时，我们推导出可通过可求解优化程序输出的全局最优任务权重，并为其经验估计提供了理论一致性保证。对于具有共享表征宽度 $q \leq K$（其中 $K$ 为辅助任务数量）的线性神经网络，我们推导了泛化误差的非渐近期望界，首次在该设定下建立了非平凡的辅助学习有益性充分条件，并为任务权重优化提供了理论指导。这一成果的取得源于我们证明了一个新的随机矩阵列方向低秩扰动界，该结论通过保持细粒度的列结构改进了现有理论界。我们在参数受控的合成数据实验中验证了所有理论结果。

摘要 (Abstract)

In transfer learning, the learner leverages auxiliary data to improve generalization on a main task. However, the precise theoretical understanding of when and how auxiliary data help remains incomplete. We provide new insights on this issue in two canonical linear settings: ordinary least squares regression and under-parameterized linear neural networks. For linear regression, we derive exact closed-form expressions for the expected generalization error with bias-variance decomposition, yielding necessary and sufficient conditions for auxiliary tasks to improve generalization on the main task. We also derive globally optimal task weights as outputs of solvable optimization programs, with consistency guarantees for empirical estimates. For linear neural networks with shared representations of width $q \leq K$, where $K$ is the number of auxiliary tasks, we derive a non-asymptotic expectation bound on the generalization error, yielding the first non-vacuous sufficient condition for beneficial auxiliary learning in this setting, as well as principled directions for task weight curation. We achieve this by proving a new column-wise low-rank perturbation bound for random matrices, which improves upon existing bounds by preserving fine-grained column structures. Our results are verified on synthetic data simulated with controlled parameters.

关键词: transfer learning, linear regression, linear neural networks, generalization error, bias-variance decomposition, task weights, auxiliary data, theoretical analysis

263. ❌ Rethinking Language Model Scaling under Transferable Hypersphere Optimization

作者: Liliang Ren, Yang Liu, Yelong Shen, Weizhu Chen 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28743v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	10.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的缩放定律（Scaling Laws）和混合专家（Mixture of Experts, MoE）模型，提出了一种新的参数化框架HyperP，在Frobenius球约束下，实现了跨模型宽度、深度、训练token数和MoE粒度的最优学习率迁移，并显著提升了计算效率和训练稳定性。因此，与’Large Language Models’、‘Mixture of Experts’、‘Scaling Laws’高度相关（10分）。论文涉及预训练阶段的优化器参数化，与’Pre-training’有一定关联（5分）。论文未涉及其他关键词，如小型模型、后训练、对齐、推理加速、AI for Science等，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了在大语言模型缩放中，如何通过引入超球面参数化框架（HyperP）和SqrtGate门控机制，实现跨模型规模（包括MoE粒度）的最优学习率迁移和稳定的训练扩展，从而显著提升计算效率。

摘要翻译

大语言模型的缩放定律关键取决于优化器与参数化方式。现有的超参数迁移定律主要针对一阶优化器开发，且其结构上无法防止大规模训练中的不稳定性。近期提出的超球面优化方法将权重矩阵约束在固定范数的超球面上，为更稳定的缩放提供了有前景的替代方案。我们引入HyperP（超球面参数化），这是首个在Frobenius球面约束下、使用Muon优化器，实现跨模型宽度、深度、训练词元量以及混合专家（Mixture-of-Experts, MoE）粒度迁移最优学习率的框架。我们证明了在Frobenius球面上权重衰减是一阶无效操作，指出Depth-μP仍然必要，并发现最优学习率遵循与先前在AdamW中观察到的、具有“神奇指数”0.32相同的数据缩放幂律。在HyperP下，于最小规模调优的单个基础学习率可迁移至所有计算规模，在6×10²¹ FLOPs的计算量下，相比强大的Muon基线实现了1.58倍的计算效率提升。此外，HyperP提供了可迁移的稳定性：所有监测的不稳定性指标，包括Z值、输出RMS和激活异常值，在训练FLOPs缩放过程中均保持有界且非递增。我们还提出了SqrtGate，一种源自超球面约束的MoE门控机制，它能在不同MoE粒度间保持输出RMS，以改进粒度缩放；并证明超球面优化允许使用显著更大的辅助负载均衡权重，从而同时实现强劲的性能和良好的专家平衡。我们在https://github.com/microsoft/ArchScale 发布了训练代码库。

摘要 (Abstract)

Scaling laws for large language models depend critically on the optimizer and parameterization. Existing hyperparameter transfer laws are mainly developed for first-order optimizers, and they do not structurally prevent training instability at scale. Recent hypersphere optimization methods constrain weight matrices to a fixed-norm hypersphere, offering a promising alternative for more stable scaling. We introduce HyperP (Hypersphere Parameterization), the first framework for transferring optimal learning rates across model width, depth, training tokens, and Mixture-of-Experts (MoE) granularity under the Frobenius-sphere constraint with the Muon optimizer. We prove that weight decay is a first-order no-op on the Frobenius sphere, show that Depth-$μ$P remains necessary, and find that the optimal learning rate follows the same data-scaling power law with the “magic exponent” 0.32 previously observed for AdamW. A single base learning rate tuned at the smallest scale transfers across all compute budgets under HyperP, yielding $1.58\times$ compute efficiency over a strong Muon baseline at $6\times10^{21}$ FLOPs. Moreover, HyperP delivers transferable stability: all monitored instability indicators, including $Z$-values, output RMS, and activation outliers, remain bounded and non-increasing under training FLOPs scaling. We also propose SqrtGate, an MoE gating mechanism derived from the hypersphere constraint that preserves output RMS across MoE granularities for improved granularity scaling, and show that hypersphere optimization enables substantially larger auxiliary load-balancing weights, yielding both strong performance and good expert balance. We release our training codebase at https://github.com/microsoft/ArchScale.

关键词: Large Language Models, Scaling Laws, Mixture of Experts, Hypersphere Optimization, Learning Rate Transfer, Training Stability, Parameterization, Compute Efficiency

264. ❌ See it to Place it: Evolving Macro Placements with Vision-Language Models

作者: Ikechukwu Uchendu, Swati Goel, Karly Hou, Ebrahim Songhori, Kuang-Huei Lee, Joe Wenjie Jiang, Vijay Janapa Reddi, Vincent Zhuang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28733v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文提出使用视觉语言模型（VLMs）进行芯片布局规划，属于大模型在电子设计自动化（EDA）领域的应用创新。与关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（8分），因为VLMs是基础模型的一种，论文明确提到’leverage foundation models’。与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（8分），因为芯片设计属于科学计算和工程领域，是AI for Science在电子工程中的具体应用。其他关键词主要涉及大模型的技术细节（如MoE、量化、对齐等）或特定应用领域（如生物信息学），论文未涉及这些具体技术或领域，因此得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种名为VeoPlace的新框架，利用视觉语言模型（VLM）指导芯片宏单元布局，在开源基准测试中显著优于现有基于学习的方法，并将性能提升推广到解析布局器中。

摘要翻译

我们提出利用视觉语言模型进行芯片布局规划中的宏单元摆放，这是一项复杂的优化任务，近期通过机器学习方法已展现出显著进展。由于人类设计师高度依赖空间推理能力在芯片画布上排布元件，我们假设具有强大视觉推理能力的视觉语言模型可以有效补充现有基于学习的方法。我们引入了VeoPlace（视觉进化优化布局）这一新颖框架，该框架使用未经微调的视觉语言模型，通过将基础布局器的操作限制在芯片画布的子区域内来指导其行为。视觉语言模型生成的布局方案会基于最终布局质量，通过进化搜索策略进行迭代优化。在开源基准测试中，VeoPlace在10个基准中的9个上超越了先前最佳的基于学习的方法，峰值线长减少超过32%。我们进一步证明VeoPlace可推广至解析布局器，在全部8个评估基准上提升了DREAMPlace的性能，最高增益达4.3%。我们的方法为电子设计自动化工具开辟了新路径，使其能够利用基础模型解决复杂的物理设计问题。

摘要 (Abstract)

We propose using Vision-Language Models (VLMs) for macro placement in chip floorplanning, a complex optimization task that has recently shown promising advancements through machine learning methods. Because human designers rely heavily on spatial reasoning to arrange components on the chip canvas, we hypothesize that VLMs with strong visual reasoning abilities can effectively complement existing learning-based approaches. We introduce VeoPlace (Visual Evolutionary Optimization Placement), a novel framework that uses a VLM, without any fine-tuning, to guide the actions of a base placer by constraining them to subregions of the chip canvas. The VLM proposals are iteratively optimized through an evolutionary search strategy with respect to resulting placement quality. On open-source benchmarks, VeoPlace outperforms the best prior learning-based approach on 9 of 10 benchmarks with peak wirelength reductions exceeding 32%. We further demonstrate that VeoPlace generalizes to analytical placers, improving DREAMPlace performance on all 8 evaluated benchmarks with gains up to 4.3%. Our approach opens new possibilities for electronic design automation tools that leverage foundation models to solve complex physical design problems.

关键词: Vision-Language Models, chip floorplanning, macro placement, evolutionary optimization, electronic design automation, physical design, wirelength reduction, foundation models

265. ❌ Functional Natural Policy Gradients

作者: Aurelien Bibaut, Houssam Zenati, Thibaud Rahier, Nathan Kallus 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28681v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究离线数据中的策略学习，提出一种交叉拟合去偏方法，属于强化学习/统计学习领域，与所有关键词（均聚焦大模型、深度学习技术及其应用）完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于离线数据策略学习的交叉拟合去偏方法，实现了即使对于复杂度超过Donsker的策略类也能达到√N遗憾的后悔界。

摘要翻译

我们提出一种交叉拟合去偏装置，用于从离线数据中进行策略学习。该学习原理的一个关键结果是，即使对于复杂度超过Donsker类的策略类别，只要误差乘积形式的干扰项余项为$O(N^{-1/2})$，仍能实现$\sqrt N$级别的遗憾界。该遗憾界可分解为两个因子：一个是由策略类别复杂度决定的插件策略误差因子，另一个是由环境动态复杂度决定的环境干扰因子，从而明确揭示了两者之间如何相互权衡。

摘要 (Abstract)

We propose a cross-fitted debiasing device for policy learning from offline data. A key consequence of the resulting learning principle is $\sqrt N$ regret even for policy classes with complexity greater than Donsker, provided a product-of-errors nuisance remainder is $O(N^{-1/2})$. The regret bound factors into a plug-in policy error factor governed by policy-class complexity and an environment nuisance factor governed by the complexity of the environment dynamics, making explicit how one may be traded against the other.

关键词: policy learning, offline data, debiasing, regret bound, cross-fitted, nuisance remainder, policy-class complexity, environment dynamics

266. ❌ GPU-Accelerated Optimization of Transformer-Based Neural Networks for Real-Time Inference

作者: Soutrik Mukherjee, Sangwhan Cha 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28708v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于Transformer模型的GPU加速推理优化，使用TensorRT和混合精度技术。与"Speculative Decoding OR Inference Acceleration"高度相关（10分），因为核心是推理加速；与"Quantization OR Model Compression OR Low-bit Weights"相关（8分），因为涉及FP16混合精度优化以减少内存使用；与"Large Language Models OR LLMs OR Foundation Models"有一定关联（5分），因为评估了GPT-2（一种基础模型）。其他关键词（如MoE、SFT、RAG等）未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了使用NVIDIA TensorRT和混合精度优化技术加速Transformer模型（如BERT和GPT-2）的实时推理，实现了高达64.4倍的加速、亚10毫秒延迟和63%的内存减少，同时保持了高数值保真度。

摘要翻译

本文提出了一种基于NVIDIA TensorRT混合精度优化的GPU加速Transformer模型推理流水线设计与评估方案。我们评估了BERT-base（1.1亿参数）和GPT-2（1.24亿参数）模型在批处理规模1至32、序列长度32至512范围内的性能。该系统相比CPU基线实现了最高64.4倍的加速，单样本推理延迟低于10毫秒，内存使用量降低63%。我们提出了一种混合精度策略：对softmax和层归一化等数值敏感操作保留FP32精度，而对线性层应用FP16精度。该方法保持了较高的数值保真度（与基线输出的余弦相似度≥0.9998），并消除了NaN不稳定性。该流水线采用模块化容器化实现，支持超过360种配置的可复现基准测试。在NVIDIA A100上的跨GPU验证显示，FP16加速比稳定在1.84倍至2.00倍之间，且数值行为稳定。在SST-2数据集的下游任务评估中，混合精度未导致准确率下降。在WikiText-2上的验证表明，随机输入会低估全FP16精度下NaN不稳定性达6倍，同时证实了混合方法的鲁棒性（NaN出现率0.0%，余弦相似度≥0.9998）。这些结果详细刻画了不同GPU架构下性能与精度的权衡关系，为在延迟敏感环境中部署Transformer模型提供了实用指导。

摘要 (Abstract)

This paper presents the design and evaluation of a GPU-accelerated inference pipeline for transformer models using NVIDIA TensorRT with mixed-precision optimization. We evaluate BERT-base (110M parameters) and GPT-2 (124M parameters) across batch sizes from 1 to 32 and sequence lengths from 32 to 512. The system achieves up to 64.4x speedup over CPU baselines, sub-10 ms latency for single-sample inference, and a 63 percent reduction in memory usage. We introduce a hybrid precision strategy that preserves FP32 for numerically sensitive operations such as softmax and layer normalization, while applying FP16 to linear layers. This approach maintains high numerical fidelity (cosine similarity >= 0.9998 relative to baseline outputs) and eliminates NaN instability. The pipeline is implemented as a modular, containerized system that enables reproducible benchmarking across more than 360 configurations. Cross-GPU validation on an NVIDIA A100 shows consistent FP16 speedup ratios between 1.84x and 2.00x, along with stable numerical behavior. Downstream evaluation on SST-2 demonstrates no accuracy degradation under hybrid precision. Validation on WikiText-2 shows that random inputs underestimate NaN instability by up to 6x for full FP16, while confirming the robustness of the hybrid approach (0.0 percent NaN, cosine similarity >= 0.9998). These results provide a detailed characterization of performance and accuracy trade-offs across GPU architectures and offer practical guidance for deploying transformer models in latency-critical environments.

关键词: GPU-accelerated inference, Transformer models, NVIDIA TensorRT, mixed-precision optimization, real-time inference, latency reduction, memory usage reduction, numerical fidelity

267. ❌ FL-PBM: Pre-Training Backdoor Mitigation for Federated Learning

作者: Osama Wehbi, Sarhad Arisdakessian, Omar Abdel Wahab, Azzam Mourad, Hadi Otrok, Jamal Bentahar 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28673v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于联邦学习中的后门攻击防御，与大多数关键词无关。仅与’Pre-training’有微弱关联（提及预训练阶段防御），但非核心内容。论文不涉及大模型、深度学习技术原理创新或科学领域应用，不符合研究背景要求。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为FL-PBM的联邦学习预训练阶段后门攻击防御机制，通过数据过滤和净化技术，在保持模型准确性的同时将攻击成功率降低高达95%。

摘要翻译

后门攻击对人工智能（AI）模型的完整性和可靠性构成重大威胁，攻击者可通过注入带有隐藏触发器的污染数据来操纵模型行为。此类攻击可能导致严重后果，尤其在自动驾驶、医疗保健和金融等关键应用中。在模型生命周期的各个阶段（包括预训练、训练中和训练后）检测并缓解后门攻击至关重要。本文提出面向联邦学习（Federated Learning, FL）的预训练后门缓解方法（FL-PBM），这是一种在联邦学习环境中于模型训练前主动在客户端过滤污染数据的新型防御机制。该方法包含四个阶段：（1）向数据中插入良性触发器以建立受控基线；（2）应用主成分分析（Principal Component Analysis, PCA）提取判别性特征并评估数据的可分离性；（3）执行高斯混合模型（Gaussian Mixture Model, GMM）聚类，根据数据在PCA转换空间中的分布识别潜在恶意样本；（4）应用定向模糊技术以破坏潜在的后门触发器。这些步骤共同确保可疑数据被早期检测并有效净化，从而最小化后门触发器对全局模型的影响。在图像数据集上的实验评估表明，相较于基线联邦学习（FedAvg），FL-PBM将攻击成功率降低了高达95%；相较于前沿防御方法（RDFL和LPSF），攻击成功率降低了30%至80%。同时，在大多数实验中该方法保持了超过90%的干净模型准确率，在实现更优缓解效果的同时未降低模型性能。

摘要 (Abstract)

Backdoor attacks pose a significant threat to the integrity and reliability of Artificial Intelligence (AI) models, enabling adversaries to manipulate model behavior by injecting poisoned data with hidden triggers. These attacks can lead to severe consequences, especially in critical applications such as autonomous driving, healthcare, and finance. Detecting and mitigating backdoor attacks is crucial across the lifespan of model’s phases, including pre-training, in-training, and post-training. In this paper, we propose Pre-Training Backdoor Mitigation for Federated Learning (FL-PBM), a novel defense mechanism that proactively filters poisoned data on the client side before model training in a federated learning (FL) environment. The approach consists of three stages: (1) inserting a benign trigger into the data to establish a controlled baseline, (2) applying Principal Component Analysis (PCA) to extract discriminative features and assess the separability of the data, (3) performing Gaussian Mixture Model (GMM) clustering to identify potentially malicious data samples based on their distribution in the PCA-transformed space, and (4) applying a targeted blurring technique to disrupt potential backdoor triggers. Together, these steps ensure that suspicious data is detected early and sanitized effectively, thereby minimizing the influence of backdoor triggers on the global model. Experimental evaluations on image-based datasets demonstrate that FL-PBM reduces attack success rates by up to 95% compared to baseline federated learning (FedAvg) and by 30 to 80% relative to state-of-the-art defenses (RDFL and LPSF). At the same time, it maintains over 90% clean model accuracy in most experiments, achieving better mitigation without degrading model performance.

关键词: Federated Learning, Backdoor Attacks, Pre-training Defense, Data Poisoning, Gaussian Mixture Model, Principal Component Analysis, Model Security, Attack Mitigation

268. ❌ Mitigating Backdoor Attacks in Federated Learning Using PPA and MiniMax Game Theory

作者: Osama Wehbi, Sarhad Arisdakessian, Omar Abdel Wahab, Anderson Avila, Azzam Mourad, Hadi Otrok 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28652v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于联邦学习中的后门攻击防御，使用声誉系统、激励机制和博弈论方法，未涉及大模型、深度学习技术原理或科学AI应用。所有关键词均与大模型技术、训练方法、推理优化、AI代理或科学AI相关，而本文研究的是传统联邦学习安全防御，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出FedBBA方法，通过声誉系统、激励机制和博弈论模型来防御联邦学习中的后门攻击，在交通标志数据集上显著降低了攻击成功率至1.1%-11%，同时保持高正常任务准确率。

摘要翻译

联邦学习（Federated Learning, FL）因其能够利用大量分散数据的同时保护隐私，正获得日益广泛的应用。然而，尽管具备这些优势，联邦学习仍存在若干缺陷，直接影响其生成的全局模型的准确性与完整性。其中一个缺陷是恶意客户端的存在，这些客户端试图通过向本地模型中注入后门数据来破坏全局模型，同时规避检测。此类客户端的目的是诱导全局模型在推理过程中做出错误预测，从而损害诚实参与者所依赖的全局模型的完整性和可信度。为抑制此类恶意行为，我们提出了FedBBA（联邦后门与行为分析模型）。该模型旨在削弱此类客户端对最终准确性的影响，构建更具韧性的联邦学习环境。我们通过结合以下三种机制设计方法：（1）用于评估和追踪客户端行为的信誉系统；（2）奖励诚实参与并惩罚恶意行为的激励机制；（3）结合投影寻踪分析（Projection Pursuit Analysis, PPA）的博弈论模型，以动态识别并最小化恶意客户端对全局模型的影响。在德国交通标志识别基准（GTSRB）和比利时交通标志分类（BTSC）数据集上的大量仿真实验表明，FedBBA在各种攻击场景下将后门攻击成功率降低至约1.1%–11%，显著优于RDFL和RoPE等先进防御方案（其攻击成功率介于23%至76%之间），同时保持了较高的正常任务准确率（约95%–98%）。

摘要 (Abstract)

Federated Learning (FL) is witnessing wider adoption due to its ability to benefit from large amounts of scattered data while preserving privacy. However, despite its advantages, federated learning suffers from several setbacks that directly impact the accuracy, and the integrity of the global model it produces. One of these setbacks is the presence of malicious clients who actively try to harm the global model by injecting backdoor data into their local models while trying to evade detection. The objective of such clients is to trick the global model into making false predictions during inference, thereby compromising the integrity and trustworthiness of the global model on which honest stakeholders rely. To mitigate such mischievous behavior, we propose FedBBA (Federated Backdoor and Behavior Analysis). The proposed model aims to dampen the effect of such clients on the final accuracy, creating more resilient federated learning environments. We engineer our approach through the combination of (1) a reputation system to evaluate and track client behavior, (2) an incentive mechanism to reward honest participation and penalize malicious behavior, and (3) game theoretical models with projection pursuit analysis (PPA) to dynamically identify and minimize the impact of malicious clients on the global model. Extensive simulations on the German Traffic Sign Recognition Benchmark (GTSRB) and Belgium Traffic Sign Classification (BTSC) datasets demonstrate that FedBBA reduces the backdoor attack success rate to approximately 1.1%–11% across various attack scenarios, significantly outperforming state-of-the-art defenses like RDFL and RoPE, which yielded attack success rates between 23% and 76%, while maintaining high normal task accuracy (~95%–98%).

关键词: Federated Learning, Backdoor Attacks, Game Theory, Reputation System, Incentive Mechanism, Projection Pursuit Analysis, Traffic Sign Recognition, Model Integrity

269. ❌ LACE: Loss-Adaptive Capacity Expansion for Continual Learning

作者: Shivnath Tathe 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28611v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种用于持续学习的自适应容量扩展方法LACE，主要涉及持续学习、模型容量扩展和在线训练机制。与大多数关键词无关，仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为论文涉及持续学习和领域适应概念，但并非大模型特定技术。其他关键词主要针对大模型技术、对齐、推理、代理等具体方向，论文未涉及。

!!! tip deepseek-chat TL;DR

论文提出了一种基于损失信号的自适应容量扩展方法LACE，解决了持续学习中固定模型容量无法适应新数据域的问题，实验表明该方法能准确识别领域边界并高效扩展模型维度，同时保持高性能。

摘要翻译

固定表征容量是持续学习中的一个基本限制：实践者必须在训练前猜测合适的模型宽度，而无法预知数据包含多少独立概念。我们提出LACE（损失自适应容量扩展），这是一种简单的在线机制，通过监控模型自身的损失信号在训练过程中动态扩展模型的表征能力。当持续损失偏差超过阈值——表明当前容量不足以处理新遇到的数据时——LACE会向投影层添加新的维度，并将这些新维度与现有参数进行联合训练。在合成数据与真实数据的实验中，LACE仅在领域边界处触发扩展（边界精确度100%，零误报），以初始维度仅为大型固定容量模型一小部分的配置达到了同等精度，且生成的适配器维度对整体性能具有关键作用（移除所有适配器会导致准确率下降3%）。我们进一步通过分层聚类展示了GPT-2激活中的无监督领域分离现象，揭示了跨网络层的U形可分离性曲线，这为深度网络中的自适应容量分配提供了理论依据。LACE无需标签、无需回放缓冲区、无需外部控制器，使其特别适合资源受限环境下的设备端持续学习。

摘要 (Abstract)

Fixed representational capacity is a fundamental constraint in continual learning: practitioners must guess an appropriate model width before training, without knowing how many distinct concepts the data contains. We propose LACE (Loss-Adaptive Capacity Expansion), a simple online mechanism that expands a model’s representational capacity during training by monitoring its own loss signal. When sustained loss deviation exceeds a threshold - indicating that the current capacity is insufficient for newly encountered data - LACE adds new dimensions to the projection layer and trains them jointly with existing parameters. Across synthetic and real-data experiments, LACE triggers expansions exclusively at domain boundaries (100% boundary precision, zero false positives), matches the accuracy of a large fixed-capacity model while starting from a fraction of its dimensions, and produces adapter dimensions that are collectively critical to performance (3% accuracy drop when all adapters removed). We further demonstrate unsupervised domain separation in GPT-2 activations via layer-wise clustering, showing a U-shaped separability curve across layers that motivates adaptive capacity allocation in deep networks. LACE requires no labels, no replay buffers, and no external controllers, making it suitable for on-device continual learning under resource constraints.

关键词: Continual Learning, Capacity Expansion, Loss-Adaptive, Online Mechanism, Domain Boundaries, Representational Capacity, Unsupervised Domain Separation, On-device Learning

270. ❌ Position: Explainable AI is Causality in Disguise

作者: Amir-Hossein Karimi 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28597v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是一篇关于可解释人工智能（XAI）的立场论文，主张将XAI问题重新定义为因果推理问题，并证明因果模型对于实现真正的可解释性是必要且充分的。论文的核心主题是XAI和因果发现，与大多数关键词（如LLM、MoE、SFT、RLHF、RAG、量化等）完全无关，因为这些关键词涉及大模型的具体技术、训练方法、优化或应用。唯一相关的关键词是’Mechanistic Interpretability OR Explainable AI’，因为论文直接讨论XAI，评分为10分（高度相关，核心内容）。其他关键词均评分为0分（完全无关）。

!!! tip deepseek-chat TL;DR

这篇立场论文认为，可解释人工智能（XAI）领域长期存在的分歧源于缺乏因果基础，提出将XAI问题重新定义为因果推理问题，并证明因果模型是实现真正可解释性的必要且充分条件。

摘要翻译

对可解释人工智能（XAI）的需求引发了方法论的爆炸式增长，导致当前研究领域高度碎片化，以至于我们不得不依赖对综述的综述来把握全局。然而，根本性挑战依然存在：相互矛盾的评估指标、未能通过的有效性检验，以及围绕鲁棒性与公平性的未决争议。关于如何实现可解释性的唯一共识，就是缺乏共识。这使得许多人将矛头指向了定义“正确”解释所需基准真值的缺失，视其为问题的主要根源。
本立场论文提出，XAI领域持续存在的分歧并非源于基准真值的缺失，而是源于一个虽存在却难以企及的基准真值：即支配相关系统的因果模型。通过将关于数据、模型或决策的XAI查询重新定义为因果性探究，我们论证了因果模型对于XAI的必要性和充分性。我们认为，若缺乏这种因果基础，XAI将始终处于无根浮萍的状态。最终，我们鼓励研究社群围绕先进的概念发现与因果发现方法凝聚共识，以摆脱当前根深蒂固的不确定性。

摘要 (Abstract)

The demand for Explainable AI (XAI) has triggered an explosion of methods, producing a landscape so fragmented that we now rely on surveys of surveys. Yet, fundamental challenges persist: conflicting metrics, failed sanity checks, and unresolved debates over robustness and fairness. The only consensus on how to achieve explainability is a lack of one. This has led many to point to the absence of a ground truth for defining ``the’’ correct explanation as the main culprit. This position paper posits that the persistent discord in XAI arises not from an absent ground truth but from a ground truth that exists, albeit as an elusive and challenging target: the causal model that governs the relevant system. By reframing XAI queries about data, models, or decisions as causal inquiries, we prove the necessity and sufficiency of causal models for XAI. We contend that without this causal grounding, XAI remains unmoored. Ultimately, we encourage the community to converge around advanced concept and causal discovery to escape this entrenched uncertainty.

关键词: Explainable AI, XAI, Causality, Causal Models, Interpretability, Causal Discovery, Ground Truth, Position Paper

271. ❌ Optimistic Actor-Critic with Parametric Policies for Linear Markov Decision Processes

作者: Max Qiushi Lin, Reza Asad, Kevin Tan, Haque Ishfaq, Csaba Szepesvari, Sharan Vaswani 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28595v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是强化学习中的actor-critic方法在有限时域线性马尔可夫决策过程中的理论改进，提出了使用参数化对数线性策略的乐观actor-critic框架。论文内容完全聚焦于强化学习的理论算法分析，与所有评分关键词（均涉及大模型、深度学习技术原理或AI在科学领域的应用）无直接关联。论文未提及任何语言模型、模型训练技术、推理方法、代理系统或科学AI应用相关内容。

!!! tip deepseek-chat TL;DR

该论文针对有限时域线性马尔可夫决策过程，提出了一种使用参数化对数线性策略的乐观actor-critic框架，在理论和实践之间取得平衡，实现了与现有理论工作相匹配的最优样本复杂度。

摘要翻译

尽管行动者-评论者方法在实践中取得了成功，但其理论分析仍存在若干局限。具体而言，现有理论研究要么通过强假设规避探索问题，要么分析经过复杂算法修改的不切实际的方法。此外，针对线性马尔可夫决策过程（linear MDPs）分析的行动者-评论者方法通常采用自然策略梯度（Natural Policy Gradient, NPG）并构建无显式参数化的“隐式”策略。此类策略在采样时计算成本高昂，导致环境交互效率低下。为此，我们聚焦于有限时域线性MDPs，提出一种采用参数化对数线性策略的乐观行动者-评论者框架。特别地，我们为行动者引入了一种易于处理的“对数匹配”回归目标。对于评论者，我们通过朗之万蒙特卡洛（Langevin Monte Carlo）进行近似汤普森采样，以获得乐观值估计。我们证明所得算法在在线策略和离线策略设置下分别达到$\widetilde{\mathcal{O}}(ε^{-4})$和$\widetilde{\mathcal{O}}(ε^{-2})$的样本复杂度。我们的结果与先前理论研究一致，达到了最优样本复杂度，同时所提算法更贴合实际应用。

摘要 (Abstract)

Although actor-critic methods have been successful in practice, their theoretical analyses have several limitations. Specifically, existing theoretical work either sidesteps the exploration problem by making strong assumptions or analyzes impractical methods with complicated algorithmic modifications. Moreover, the actor-critic methods analyzed for linear MDPs often employ natural policy gradient (NPG) and construct “implicit” policies without explicit parameterization. Such policies are computationally expensive to sample from, making the environment interactions inefficient. To that end, we focus on the finite-horizon linear MDPs and propose an optimistic actor-critic framework that uses parametric log-linear policies. In particular, we introduce a tractable \textit{logit-matching} regression objective for the actor. For the critic, we use approximate Thompson sampling via Langevin Monte Carlo to obtain optimistic value estimates. We prove that the resulting algorithm achieves $\widetilde{\mathcal{O}}(ε^{-4})$ and $\widetilde{\mathcal{O}}(ε^{-2})$ sample complexity in the on-policy and off-policy setting, respectively. Our results match prior theoretical works in achieving the state-of-the-art sample complexity, while our algorithm is more aligned with practice.

关键词: actor-critic methods, linear MDPs, parametric policies, log-linear policies, optimistic value estimates, sample complexity, theoretical analysis, finite-horizon

272. ❌ Physics-Informed Framework for Impact Identification in Aerospace Composites

作者: Natália Ribeiro Marinho, Richard Loendersloot, Jan Willem Wiegman, Frank Grooteman, Tiedo Tinga 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28593v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文提出了一种物理信息驱动的冲击识别框架（Phy-ID），属于AI在科学工程领域的应用，但具体聚焦于航空航天复合材料的结构健康监测，而非通用大模型技术。论文核心是结合物理知识与数据驱动方法进行物理一致性建模，涉及物理信息机器学习（Physics-Informed Machine Learning），但未明确使用大语言模型（LLM）、深度学习架构创新（如MoE、注意力机制）、训练对齐技术（如RLHF、SFT）、推理优化（如量化、加速）、或智能体系统等。唯一的相关关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为该研究属于AI在科学（具体是工程力学）领域的应用，但并非生物信息学或化学信息学，因此给予中等相关度5分。其他关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种物理信息驱动的冲击识别框架，通过整合物理知识与数据驱动方法，在航空航天复合材料中实现了物理一致且数据高效的冲击参数（如速度、质量、能量）推断，实验误差低于10%，并在数据受限和噪声条件下表现稳定。

摘要翻译

本文提出了一种新型物理信息驱动的冲击识别（Phy-ID）框架。该方法通过整合观测性、归纳性与学习性偏置，将物理知识与数据驱动推理融合于统一建模策略中，实现了物理一致且数值稳定的冲击识别。该物理信息驱动方法通过基于物理的能量指标构建输入空间，借助架构设计约束可行解空间，并利用混合损失函数强化控制关系。这些机制共同限制了非物理解的出现，并在测量条件退化时稳定推理过程。研究以分离式推理模型作为代表性用例展示框架能力：通过解耦的代理模型分别推断冲击速度与冲击体质量，并借助动能一致性约束计算冲击能量。实验评估表明，推断的冲击速度与冲击体质量平均绝对百分比误差低于8%，冲击能量误差低于10%。进一步分析证实，该方法在数据有限和测量噪声增强条件下仍保持稳定性能，且当训练数据包含损伤响应时，能够对原始状态与损伤状态下的分布外案例实现有效泛化。这些结果表明，物理信息偏置的系统性整合能够实现可靠、物理一致且数据高效的冲击识别，凸显了该方法在实际监测系统中的应用潜力。

摘要 (Abstract)

This paper introduces a novel physics-informed impact identification (Phy-ID) framework. The proposed method integrates observational, inductive, and learning biases to combine physical knowledge with data-driven inference in a unified modelling strategy, achieving physically consistent and numerically stable impact identification. The physics-informed approach structures the input space using physics-based energy indicators, constrains admissible solutions via architectural design, and enforces governing relations via hybrid loss formulations. Together, these mechanisms limit non-physical solutions and stabilise inference under degraded measurement conditions. A disjoint inference formulation is used as a representative use case to demonstrate the framework capabilities, in which impact velocity and impactor mass are inferred through decoupled surrogate models, and impact energy is computed by enforcing kinetic energy consistency. Experimental evaluations show mean absolute percentage errors below 8% for inferred impact velocity and impactor mass and below 10% for impact energy. Additional analyses confirm stable performance under reduced data availability and increased measurement noise, as well as generalisation for out-of-distribution cases across pristine and damaged regimes when damaged responses are included in training. These results indicate that the systematic integration of physics-informed biases enables reliable, physically consistent, and data-efficient impact identification, highlighting the potential of the approach for practical monitoring systems.

关键词: physics-informed, impact identification, aerospace composites, data-driven inference, hybrid loss, surrogate models, kinetic energy consistency, structural health monitoring

273. ❌ Universal Approximation Constraints of Narrow ResNets: The Tunnel Effect

作者: Christian Kuehn, Sara-Viola Kuntz, Tobias Wöhrer 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28591v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究窄残差神经网络（ResNets）的通用逼近约束，属于传统深度学习理论分析范畴，不涉及大语言模型（LLMs）、大模型技术原理创新或AI在科学领域的应用。所有关键词均与大模型相关，而本文专注于经典神经网络架构的理论分析，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文从理论和数值上分析了窄残差神经网络（ResNets）的通用逼近约束，证明了其无法表示输入输出映射的临界点，并量化了不同通道信号比下的逼近误差。

摘要翻译

我们从理论与数值角度分析了窄残差神经网络（ResNets）的通用逼近约束。对于未进行输入空间增广的深度神经网络，一个核心约束在于其无法表示输入-输出映射的临界点。我们证明了这对目标函数逼近具有全局性影响，并指出这一缺陷的典型表现是将临界点推移至无穷远处，在分类任务背景下我们称之为“隧道效应”。尽管残差网络比标准多层感知机（MLPs）具有更强的表达能力，但其性能高度依赖于跳跃连接通道与残差通道之间的信号比例。我们针对残差主导（接近MLP）与跳跃连接主导（接近神经微分方程）两种机制建立了定量逼近界。这些估计值明确依赖于通道比例和网络权重的统一有界性。通过低维示例，我们进一步详细分析了不同残差网络机制，以及架构与目标函数的不兼容性如何影响逼近误差。

摘要 (Abstract)

We analyze the universal approximation constraints of narrow Residual Neural Networks (ResNets) both theoretically and numerically. For deep neural networks without input space augmentation, a central constraint is the inability to represent critical points of the input-output map. We prove that this has global consequences for target function approximations and show that the manifestation of this defect is typically a shift of the critical point to infinity, which we call the ``tunnel effect’’ in the context of classification tasks. While ResNets offer greater expressivity than standard multilayer perceptrons (MLPs), their capability strongly depends on the signal ratio between the skip and residual channels. We establish quantitative approximation bounds for both the residual-dominant (close to MLP) and skip-dominant (close to neural ODE) regimes. These estimates depend explicitly on the channel ratio and uniform network weight bounds. Low-dimensional examples further provide a detailed analysis of the different ResNet regimes and how architecture-target incompatibility influences the approximation error.

关键词: Residual Neural Networks, Universal Approximation, Tunnel Effect, Critical Points, Approximation Bounds, Channel Ratio, Neural ODE, MLP

274. ❌ Unrestrained Simplex Denoising for Discrete Data. A Non-Markovian Approach Applied to Graph Generation

作者: Yoann Boget, Alexandros Kalousis 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28572v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于离散数据的生成建模，特别是图生成，提出了一种基于概率单纯形的非马尔可夫去噪方法。虽然属于深度学习在科学领域的应用（图生成），但论文内容与所有评分关键词（均围绕大模型技术原理、训练方法、推理优化、对齐、应用等）无直接关联。论文未涉及大模型、语言模型、训练技术、推理方法、对齐、压缩、科学AI应用等任何关键词主题。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于概率单纯形的非马尔可夫去噪框架，用于离散数据的生成建模，并在图生成任务上超越了现有的离散扩散和流匹配基线。

摘要翻译

诸如扩散模型或流匹配模型等去噪模型近年来推动了离散结构生成建模的发展，但大多数方法直接在离散状态空间中操作，导致状态突变。我们引入单纯形去噪，这是一种在概率单纯形上操作的简单而有效的生成框架。其核心思想是一种非马尔可夫噪声化方案：对于一个给定的干净数据点，不同时间步的噪声表示是条件独立的。我们的方法在保留基于去噪的生成模型理论保证的同时，移除了不必要的约束，从而提升了性能并简化了模型构建。经验表明，在合成与真实世界图数据基准测试中，\emph{无约束单纯形去噪}方法超越了多种强大的离散扩散模型和流匹配基线模型。这些结果凸显了概率单纯形作为离散生成建模有效框架的潜力。

摘要 (Abstract)

Denoising models such as Diffusion or Flow Matching have recently advanced generative modeling for discrete structures, yet most approaches either operate directly in the discrete state space, causing abrupt state changes. We introduce simplex denoising, a simple yet effective generative framework that operates on the probability simplex. The key idea is a non-Markovian noising scheme in which, for a given clean data point, noisy representations at different times are conditionally independent. While preserving the theoretical guarantees of denoising-based generative models, our method removes unnecessary constraints, thereby improving performance and simplifying the formulation. Empirically, \emph{unrestrained simplex denoising} surpasses strong discrete diffusion and flow-matching baselines across synthetic and real-world graph benchmarks. These results highlight the probability simplex as an effective framework for discrete generative modeling.

关键词: simplex denoising, discrete data, non-Markovian, graph generation, generative modeling, probability simplex, denoising models, diffusion models

275. ❌ Multimodal Analytics of Cybersecurity Crisis Preparation Exercises: What Predicts Success?

作者: Conrad Borchers, Valdemar Švábenský, Sandesh K. Kafle, Kevin K. Tang, Jan Vykopal 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28553v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究网络安全危机准备演习中的教学对齐（instructional alignment）预测，使用Bloom分类法分析团队电子邮件和日志特征，属于教育技术和网络安全交叉领域。论文未涉及任何大模型、深度学习技术原理或AI for Science应用，所有关键词均与大模型技术、训练方法、推理优化、代理系统、科学AI应用等无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了网络安全模拟演习中教学对齐（目标认知与实际活动匹配度）如何预测团队成功，发现多模态数据（文本嵌入和日志特征）比仅用Bloom分类法能更好地预测性能，而对齐度提供了可解释的诊断洞察。

摘要翻译

教学一致性，即预期认知与实施活动之间的匹配，是有效教学的核心，但难以大规模操作化。本研究利用来自五个练习环节中23支团队（76名学生）的多模态数据追踪，对网络安全模拟训练中的一致性进行考察。研究一采用布鲁姆分类法对教学目标与团队邮件进行编码，并使用广义线性混合模型对关键练习任务的完成情况进行建模。一致性被定义为所需布鲁姆认知层级与实际实施层级之间的差异，该差异能够预测任务成功，而一旦考虑差异因素，单独的布鲁姆认知类别则无法预测成功。研究二通过分组交叉验证和L1正则化逻辑回归比较不同预测特征族的表现。文本嵌入特征和日志特征的表现优于仅基于布鲁姆特征的模型（AUC分别约为0.74和0.71，对比0.55），而二者的组合模型达到最佳性能（测试集AUC约0.80），布鲁姆特征频率的加入贡献甚微。总体而言，本研究为模拟训练提供了一种一致性度量方法，并表明多模态数据追踪最能预测表现，而一致性分析则提供了可解释的诊断性洞见。

摘要 (Abstract)

Instructional alignment, the match between intended cognition and enacted activity, is central to effective instruction but hard to operationalize at scale. We examine alignment in cybersecurity simulations using multimodal traces from 23 teams (76 students) across five exercise sessions. Study 1 codes objectives and team emails with Bloom’s taxonomy and models the completion of key exercise tasks with generalized linear mixed models. Alignment, defined as the discrepancy between required and enacted Bloom levels, predicts success, whereas the Bloom category alone does not predict success once discrepancy is considered. Study 2 compares predictive feature families using grouped cross-validation and l1-regularized logistic regression. Text embeddings and log features outperform Bloom-only models (AUC~~0.74 and 0.71 vs. 0.55), and their combination performs best (Test AUC~~0.80), with Bloom frequencies adding little. Overall, the work offers a measure of alignment for simulations and shows that multimodal traces best forecast performance, while alignment provides interpretable diagnostic insight.

关键词: cybersecurity simulations, instructional alignment, Bloom’s taxonomy, multimodal analytics, predictive modeling, team performance, educational technology, logistic regression

276. ❌ With a Little Help From My Friends: Collective Manipulation in Risk-Controlling Recommender Systems

作者: Giovanni De Toni, Cristian Consonni, Erasmo Purificato, Emilia Gomez, Bruno Lepri 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28476v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究推荐系统中的集体操纵行为及其缓解策略，属于传统机器学习/推荐系统领域，而非大模型或深度学习技术。论文内容涉及风险控制推荐系统、用户协调行为、对抗性攻击和缓解策略，与所有评分关键词（均聚焦于大模型技术、架构、训练方法、推理优化、对齐、应用等）完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了风险控制推荐系统中，小规模用户群体通过协调操纵反馈信号（如“不感兴趣”）来降低整体推荐质量（nDCG下降达20%）的漏洞，并提出了一种将安全保证从群体层面转移到用户层面的缓解策略。

摘要翻译

推荐系统已成为在线信息的关键守门人，广泛塑造着用户行为。作为回应，用户越来越多地借助平台可供性（如点赞、评论或评分）进行组织和协调，以引导算法结果实现多样化目标，例如推广相关内容或限制有害材料。尽管这些机制可服务于有益目的，但它们也可能被用于对抗性操纵，尤其是在用户反馈直接关乎安全保证的系统中。本文针对近期提出的风险控制推荐系统研究了这一脆弱性，该系统利用二元用户反馈（例如“不感兴趣”），通过保形风险控制可证明地限制不良内容的曝光。我们通过实证证明，其对聚合反馈信号的依赖使其本质上易受协调对抗性用户行为的影响。基于大规模在线视频共享平台的数据，我们表明，一个仅占用户总数1%的小型协调群体，通过利用风险控制推荐系统提供的可供性，可导致非对抗性用户的nDCG指标下降高达20%。我们评估了简单且现实的攻击策略，这些策略几乎无需了解底层推荐算法，并发现尽管协调用户能显著损害整体推荐质量，但仅通过报告行为无法选择性地压制特定内容群体。最后，我们提出一种缓解策略，将保证从群体层面转移到用户层面，并通过实证展示了该策略如何在确保个体个性化安全的同时，减少对抗性协调行为的影响。

摘要 (Abstract)

Recommendation systems have become central gatekeepers of online information, shaping user behaviour across a wide range of activities. In response, users increasingly organize and coordinate to steer algorithmic outcomes toward diverse goals, such as promoting relevant content or limiting harmful material, relying on platform affordances – such as likes, reviews, or ratings. While these mechanisms can serve beneficial purposes, they can also be leveraged for adversarial manipulation, particularly in systems where such feedback directly informs safety guarantees. In this paper, we study this vulnerability in recently proposed risk-controlling recommender systems, which use binary user feedback (e.g., “Not Interested”) to provably limit exposure to unwanted content via conformal risk control. We empirically demonstrate that their reliance on aggregate feedback signals makes them inherently susceptible to coordinated adversarial user behaviour. Using data from a large-scale online video-sharing platform, we show that a small coordinated group (comprising only 1% of the user population) can induce up to a 20% degradation in nDCG for non-adversarial users by exploiting the affordances provided by risk-controlling recommender systems. We evaluate simple, realistic attack strategies that require little to no knowledge of the underlying recommendation algorithm and find that, while coordinated users can significantly harm overall recommendation quality, they cannot selectively suppress specific content groups through reporting alone. Finally, we propose a mitigation strategy that shifts guarantees from the group level to the user level, showing empirically how it can reduce the impact of adversarial coordinated behaviour while ensuring personalized safety for individuals.

关键词: recommender systems, risk-controlling recommender systems, collective manipulation, adversarial user behavior, conformal risk control, nDCG degradation, mitigation strategy, personalized safety

277. ❌ Yau’s Affine Normal Descent: Algorithmic Framework and Convergence Analysis

作者: Yi-Shuai Niu, Artan Sheshmani, Shing-Tung Yau 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28448v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是数学优化算法（Yau’s Affine Normal Descent），属于纯数学和计算数学领域，涉及几何框架、收敛分析等。所有评分关键词均与大模型、深度学习、AI应用或相关技术原理相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于仿射微分几何的优化算法YAND，用于光滑无约束优化问题，并证明了其在多种条件下的收敛性。

摘要翻译

我们提出Yau仿射法向下降法（YAND），这是一种用于光滑无约束优化的几何框架，其搜索方向由水平集超曲面的等仿射法向定义。所得方向在保体积仿射变换下保持不变，且本质地适应各向异性曲率。利用仿射微分几何中仿射法向的解析表示，我们证明了在凸性条件下其与经典切片-质心构造的等价性。对于严格凸二次目标函数，仿射法向方向与牛顿方向共线，这意味着在精确线搜索下可实现一步收敛。对于一般光滑（可能非凸）目标函数，我们精确刻画了仿射法向方向何时产生严格下降，并开发了基于线搜索的YAND算法。我们在标准光滑性假设下建立了全局收敛性，在强凸性和Polyak-Lojasiewicz条件下建立了线性收敛性，并在非退化极小值点附近建立了二次局部收敛性。我们进一步证明仿射法向方向在仿射缩放下具有鲁棒性，对任意病态变换保持不敏感。数值实验展示了该方法的几何行为及其在强各向异性缩放下的鲁棒性。

摘要 (Abstract)

We propose Yau’s Affine Normal Descent (YAND), a geometric framework for smooth unconstrained optimization in which search directions are defined by the equi-affine normal of level-set hypersurfaces. The resulting directions are invariant under volume-preserving affine transformations and intrinsically adapt to anisotropic curvature. Using the analytic representation of the affine normal from affine differential geometry, we establish its equivalence with the classical slice-centroid construction under convexity. For strictly convex quadratic objectives, affine-normal directions are collinear with Newton directions, implying one-step convergence under exact line search. For general smooth (possibly nonconvex) objectives, we characterize precisely when affine-normal directions yield strict descent and develop a line-search-based YAND. We establish global convergence under standard smoothness assumptions, linear convergence under strong convexity and Polyak-Lojasiewicz conditions, and quadratic local convergence near nondegenerate minimizers. We further show that affine-normal directions are robust under affine scalings, remaining insensitive to arbitrarily ill-conditioned transformations. Numerical experiments illustrate the geometric behavior of the method and its robustness under strong anisotropic scaling.

关键词: Yau’s Affine Normal Descent, geometric framework, smooth unconstrained optimization, affine differential geometry, convergence analysis, Newton directions, anisotropic curvature, affine scaling robustness

278. ❌ Mixture-Model Preference Learning for Many-Objective Bayesian Optimization

作者: Manisha Dubey, Sebastiaan De Peuter, Wanrong Wang, Samuel Kaski 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28410v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是贝叶斯优化框架中的偏好学习问题，具体针对多目标优化场景，提出了一种基于狄利克雷过程混合模型的偏好原型学习方法。虽然论文涉及机器学习、优化算法和偏好建模，但所有给定的关键词都专门针对大语言模型（LLMs）及其相关技术（如训练方法、推理优化、对齐、应用等），而本文完全不涉及任何语言模型、深度学习架构或大模型技术。论文的核心是贝叶斯优化和偏好学习，属于传统机器学习优化领域，与评分关键词列表中的大模型主题无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对多目标优化中偏好空间复杂且异构的问题，提出了一个基于狄利克雷过程混合模型的贝叶斯框架来学习潜在偏好原型，并通过混合感知的优化方法在合成和真实基准测试中优于标准基线。

摘要翻译

基于偏好的多目标优化面临两大障碍：不断扩展的权衡空间以及异构的、依赖情境的人类价值结构。为此，我们提出一个贝叶斯框架，该框架学习一小部分潜在偏好原型，而非假设单一固定的效用函数，并通过狄利克雷过程混合模型对这些原型及其权重的不确定性进行建模。为实现高效查询，我们设计了混合查询策略，旨在获取（i）模态身份与（ii）模态内权衡两方面的信息。在温和假设下，我们为所提出的混合感知贝叶斯优化过程提供了简单遗憾界保证。实验表明，我们的方法在合成与真实世界的多目标基准测试中均优于标准基线方法，且混合感知诊断揭示了仅靠遗憾值无法捕捉的结构特征。

摘要 (Abstract)

Preference-based many-objective optimization faces two obstacles: an expanding space of trade-offs and heterogeneous, context-dependent human value structures. Towards this, we propose a Bayesian framework that learns a small set of latent preference archetypes rather than assuming a single fixed utility function, modelling them as components of a Dirichlet-process mixture with uncertainty over both archetypes and their weights. To query efficiently, we designing hybrid queries that target information about (i) mode identity and (ii) within-mode trade-offs. Under mild assumptions, we provide a simple regret guarantee for the resulting mixture-aware Bayesian optimization procedure. Empirically, our method outperforms standard baselines on synthetic and real-world many-objective benchmarks, and mixture-aware diagnostics reveal structure that regret alone fails to capture.

关键词: Bayesian optimization, preference learning, many-objective optimization, Dirichlet-process mixture, latent preference archetypes, hybrid queries, simple regret guarantee, mixture-aware diagnostics

279. ❌ Label-efficient Training Updates for Malware Detection over Time

作者: Luca Minnei, Cristian Manca, Giorgio Piras, Angelo Sotgiu, Maura Pintor, Daniele Ghiani, Davide Maiorca, Giorgio Giacinto, Battista Biggio 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28396v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于恶意软件检测中的机器学习方法，特别是针对分布漂移问题的主动学习和半监督学习技术。论文内容涉及传统机器学习（如模型更新、标注成本降低、特征漂移分析），但完全不涉及大语言模型（LLMs）、深度学习技术原理创新或任何评分关键词中列出的具体大模型技术（如MoE、Scaling Laws、RLHF、RAG等）。论文属于网络安全领域的AI应用，但未涉及大模型在科学领域的应用或大模型技术本身的创新，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种模型无关的框架，通过结合主动学习和半监督学习技术，在Android和Windows恶意软件检测中减少高达90%的标注成本，同时保持与全标注重新训练相当的检测性能，并引入了特征级漂移分析方法。

摘要翻译

基于机器学习（ML）的检测器对于应对恶意软件的扩散正变得至关重要。然而，常见的机器学习算法并非为应对现实环境中动态变化的特性而设计，其中合法软件与恶意软件均在不断演化。这种分布漂移会导致基于静态假设训练的模型性能随时间下降，除非对其进行持续更新。然而，定期重新训练这些模型成本高昂，因为对新获取的数据进行标注需要安全专家进行昂贵的人工分析。为降低标注成本并应对恶意软件检测中的分布漂移，先前的研究探索了主动学习（Active Learning, AL）与半监督学习（Semi-Supervised Learning, SSL）技术。然而，现有研究存在以下局限：（i）它们与特定检测器架构紧密耦合，且局限于特定的恶意软件领域，导致比较标准不统一；（ii）尽管恶意软件领域对时间变化极为敏感，但缺乏分析分布漂移的一致方法。在本研究中，我们通过提出一个与模型无关的框架来弥补这一差距，该框架针对安卓（Android）和视窗（Windows）恶意软件检测，评估了大量单独及组合使用的主动学习与半监督学习技术。我们证明，这些技术结合使用时，能在两个领域中减少高达90%的人工标注成本，同时达到与全标注重新训练相当的检测性能。我们还引入了一种特征级漂移分析方法，用于衡量特征随时间变化的稳定性，并展示了其与检测器性能的相关性。总体而言，本研究深入揭示了主动学习与半监督学习在分布漂移下的表现及其成功结合的方式，为设计长期有效的检测器提供了实用见解。

摘要 (Abstract)

Machine Learning (ML)-based detectors are becoming essential to counter the proliferation of malware. However, common ML algorithms are not designed to cope with the dynamic nature of real-world settings, where both legitimate and malicious software evolve. This distribution drift causes models trained under static assumptions to degrade over time unless they are continuously updated. Regularly retraining these models, however, is expensive, since labeling new acquired data requires costly manual analysis by security experts. To reduce labeling costs and address distribution drift in malware detection, prior work explored active learning (AL) and semi-supervised learning (SSL) techniques. Yet, existing studies (i) are tightly coupled to specific detector architectures and restricted to a specific malware domain, resulting in non-uniform comparisons; and (ii) lack a consistent methodology for analyzing the distribution drift, despite the critical sensitivity of the malware domain to temporal changes. In this work, we bridge this gap by proposing a model-agnostic framework that evaluates an extensive set of AL and SSL techniques, isolated and combined, for Android and Windows malware detection. We show that these techniques, when combined, can reduce manual annotation costs by up to 90% across both domains while achieving comparable detection performance to full-labeling retraining. We also introduce a methodology for feature-level drift analysis that measures feature stability over time, showing its correlation with the detector performance. Overall, our study provides a detailed understanding of how AL and SSL behave under distribution drift and how they can be successfully combined, offering practical insights for the design of effective detectors over time.

关键词: malware detection, active learning, semi-supervised learning, distribution drift, label-efficient training, Android malware, Windows malware, feature-level drift analysis

280. ❌ Machine Learning-Assisted High-Dimensional Matrix Estimation

作者: Wan Tian, Hui Yang, Zhouhui Lian, Lingyue Zhang, Yijie Peng 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28346v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究高维矩阵估计的机器学习辅助优化方法，使用LADMM和神经网络改进估计精度和收敛速度，属于传统机器学习优化领域，与所有关键词（均涉及大模型、深度学习技术原理或特定AI应用）完全无关，无任何匹配内容。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于线性化交替方向乘子法和神经网络重参数化的机器学习辅助方法，用于高维协方差和精度矩阵估计，提高了估计精度并加速了收敛。

摘要翻译

高维矩阵——包括协方差矩阵与精度矩阵——的高效估计是现代多元统计学的基石。现有研究大多聚焦于估计量的理论性质（如一致性与稀疏性），而很大程度上忽视了高维背景下固有的计算挑战。受近年来基于学习的优化方法进展的启发——该方法将数据驱动结构与经典优化算法相结合——我们探索在机器学习辅助下的高维矩阵估计。具体而言，针对高维矩阵估计的优化问题，我们首先提出了一种基于线性化交替方向乘子法（Linearized Alternating Direction Method of Multipliers, LADMM）的求解流程。随后，我们引入可学习参数，并使用神经网络对迭代方案中的邻近算子进行建模，从而提升估计精度并加速收敛。在理论上，我们首先证明了LADMM的收敛性，进而为其重参数化版本建立了收敛性、收敛速率及单调性；重要的是，我们证明了重参数化的LADMM具有更快的收敛速率。值得注意的是，所提出的重参数化理论与方法同时适用于高维协方差矩阵与精度矩阵的估计。通过在不同结构与维度的高维矩阵上，将所提方法与多种经典优化算法进行比较，我们验证了该方法的有效性。

摘要 (Abstract)

Efficient estimation of high-dimensional matrices-including covariance and precision matrices-is a cornerstone of modern multivariate statistics. Most existing studies have focused primarily on the theoretical properties of the estimators (e.g., consistency and sparsity), while largely overlooking the computational challenges inherent in high-dimensional settings. Motivated by recent advances in learning-based optimization method-which integrate data-driven structures with classical optimization algorithms-we explore high-dimensional matrix estimation assisted by machine learning. Specifically, for the optimization problem of high-dimensional matrix estimation, we first present a solution procedure based on the Linearized Alternating Direction Method of Multipliers (LADMM). We then introduce learnable parameters and model the proximal operators in the iterative scheme with neural networks, thereby improving estimation accuracy and accelerating convergence. Theoretically, we first prove the convergence of LADMM, and then establish the convergence, convergence rate, and monotonicity of its reparameterized counterpart; importantly, we show that the reparameterized LADMM enjoys a faster convergence rate. Notably, the proposed reparameterization theory and methodology are applicable to the estimation of both high-dimensional covariance and precision matrices. We validate the effectiveness of our method by comparing it with several classical optimization algorithms across different structures and dimensions of high-dimensional matrices.

关键词: high-dimensional matrix estimation, covariance matrices, precision matrices, LADMM, neural networks, reparameterization, convergence rate, optimization algorithms

281. ❌ Key-Embedded Privacy for Decentralized AI in Biomedical Omics

作者: Rongyu Zhang, Hongyu Dong, Gaole Dai, Ziqi Qiao, Shenli Zheng, Yuan Zhang, Aosong Cheng, Xiaowei Chi, Jincai Luo, Pin Li, Li Du, Dan Wang, Yuan Du, Xudong Xing, Jianxu Chen, Shanghang Zhang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28334v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于生物医学组学领域的联邦学习隐私保护方法（INFL），与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文明确应用AI于生物医学组学（如蛋白质组学、单细胞转录组学、空间转录组学），属于AI for Science范畴，因此该关键词评分为10分（高度相关）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于隐式神经表示的轻量级联邦学习方法INFL，通过将密钥嵌入模型架构，解决了生物医学组学中数据隐私与AI模型性能之间的权衡问题，在多种组学任务中实现了强隐私保护且保持实用性能。

摘要翻译

生物医学领域数据驱动方法的快速普及加剧了人们对隐私、治理和监管的担忧，这限制了原始数据共享，并阻碍了为临床相关人工智能组建具有代表性的队列。这一现状亟需实用高效的隐私解决方案，因为密码学防御通常带来沉重开销，而差分隐私可能降低性能，导致实际应用场景中的效果欠佳。本文提出一种基于隐式神经表示（Implicit Neural Representations）的轻量级联邦学习方法INFL，以应对这些挑战。我们的方法将即插即用、坐标条件化模块集成到客户端模型中，将密钥直接嵌入架构，并支持跨异构站点的无缝聚合。在多种生物医学组学任务中——包括批量蛋白质组学的队列规模分类、单细胞转录组学的扰动预测回归、空间转录组学及公私数据混合的多组学聚类——我们证明INFL在保持下游科学和临床应用所需性能的同时，实现了强大且可控的隐私保护。

摘要 (Abstract)

The rapid adoption of data-driven methods in biomedicine has intensified concerns over privacy, governance, and regulation, limiting raw data sharing and hindering the assembly of representative cohorts for clinically relevant AI. This landscape necessitates practical, efficient privacy solutions, as cryptographic defenses often impose heavy overhead and differential privacy can degrade performance, leading to sub-optimal outcomes in real-world settings. Here, we present a lightweight federated learning method, INFL, based on Implicit Neural Representations that addresses these challenges. Our approach integrates plug-and-play, coordinate-conditioned modules into client models, embeds a secret key directly into the architecture, and supports seamless aggregation across heterogeneous sites. Across diverse biomedical omics tasks, including cohort-scale classification in bulk proteomics, regression for perturbation prediction in single-cell transcriptomics, and clustering in spatial transcriptomics and multi-omics with both public and private data, we demonstrate that INFL achieves strong, controllable privacy while maintaining utility, preserving the performance necessary for downstream scientific and clinical applications.

关键词: federated learning, privacy preservation, biomedical omics, implicit neural representations, data-driven methods, heterogeneous sites, clinical applications, secret key embedding

282. ❌ Physics-Informed Neural Networks for Predicting Hydrogen Sorption in Geological Formations: Thermodynamically Constrained Deep Learning Integrating Classical Adsorption Theory

作者: Mohammad Nooraiepour, Mohammad Masoudi, Zezhang Song, Helge Hellevang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28328v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于应用物理信息神经网络（PINN）预测地质储层中的氢吸附，属于深度学习在科学领域的应用（具体为地球科学/能源领域）。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理等）完全无关，因为这些关键词特指大语言模型（LLM）及相关技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，该论文是AI for Science（科学人工智能）在地质和能源领域的直接应用，因此给予10分（高度相关，核心内容）。

!!! tip deepseek-chat TL;DR

该研究针对传统等温线模型在预测异质地层中氢吸附时泛化性差的问题，提出了一种物理信息神经网络框架，通过嵌入热力学约束和课程学习策略，显著提升了预测精度和跨岩性泛化能力，在测试集上达到了R²=0.9544的高性能。

摘要翻译

准确预测细粒地质材料中的氢吸附行为，对于评估地下储氢容量、评价盖层完整性以及表征地下能源系统中氢气运移至关重要。经典等温线模型在单一样本尺度上表现良好，但在跨非均质样本群体泛化时失效，其决定系数从单样本拟合的0.80-0.90骤降至聚合多样本数据集的0.09-0.38。本文提出一种多尺度物理信息神经网络框架，通过将经典吸附理论和热力学约束直接嵌入学习过程以解决此局限性。该框架整合了黏土、页岩、煤等介质的1,987组氢吸附等温线测量数据，并辅以224组特征吸附量测量值。通过七类物理信息特征工程方案，从原始材料表征数据中生成62个具有热力学意义的描述符。损失函数通过惩罚加权机制强制满足饱和极限、单调压力响应及范特霍夫温度依赖性，同时采用三阶段课程式训练策略确保竞争性物理约束的稳定整合。由十个异构成员组成的架构多样化集成模型提供了校准的不确定性量化，后验温度缩放实现了目标预测区间覆盖度。优化后的物理信息神经网络在保留测试集上达到R² = 0.9544、RMSE = 0.0484 mmol/g、MAE = 0.0231 mmol/g的性能指标，单调性满足率达98.6%且无非物理性负预测。在留一岩性交叉验证中，物理信息正则化相比优化随机森林模型获得10-15%的跨岩性泛化优势，证实热力学约束能够有效跨越地质边界传递。

摘要 (Abstract)

Accurate prediction of hydrogen sorption in fine-grained geological materials is essential for evaluating underground hydrogen storage capacity, assessing caprock integrity, and characterizing hydrogen migration in subsurface energy systems. Classical isotherm models perform well at the individual-sample level but fail when generalized across heterogeneous populations, with the coefficient of determination collapsing from 0.80-0.90 for single-sample fits to 0.09-0.38 for aggregated multi-sample datasets. We present a multi-scale physics-informed neural network framework that addresses this limitation by embedding classical adsorption theory and thermodynamic constraints directly into the learning process. The framework utilizes 1,987 hydrogen sorption isotherm measurements across clays, shales, coals, supplemented by 224 characteristic uptake measurements. A seven-category physics-informed feature engineering scheme generates 62 thermodynamically meaningful descriptors from raw material characterization data. The loss function enforces saturation limits, a monotonic pressure response, and Van’t Hoff temperature dependence via penalty weighting, while a three-phase curriculum-based training strategy ensures stable integration of competing physical constraints. An architecture-diverse ensemble of ten members provides calibrated uncertainty quantification, with post-hoc temperature scaling achieving target prediction interval coverage. The optimized PINN achieves R2 = 0.9544, RMSE = 0.0484 mmol/g, and MAE = 0.0231 mmol/g on the held-out test set, with 98.6% monotonicity satisfaction and zero non-physical negative predictions. Physics-informed regularization yields a 10-15% cross-lithology generalization advantage over a well-tuned random forest under leave-one-lithology-out validation, confirming that thermodynamic constraints transfer meaningfully across geological boundaries.

关键词: Physics-Informed Neural Networks, Hydrogen Sorption, Geological Formations, Thermodynamic Constraints, Deep Learning, Adsorption Theory, Multi-scale Modeling, Uncertainty Quantification

283. ❌ LDDMM stochastic interpolants: an application to domain uncertainty quantification in hemodynamics

作者: Sarah Katz, Francesco Romor, Jia-Jie Zhu, Alfonso Caiazzo 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28324v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文提出了一种基于LDDMM的条件随机插值框架，用于三维形状的生成建模，并应用于心血管模拟中的主动脉形状生成和不确定性量化。论文的核心是几何形状生成和医学图像分析，属于AI在生物医学领域的应用。所有关键词中，只有’AI for Science OR Bioinformatics OR Cheminformatics’与论文主题有一定关联，因为论文涉及生物医学形状生成和心血管模拟，属于AI在科学（特别是生物信息学相关）领域的应用。其他关键词均与大型语言模型、模型训练、推理优化、代理系统等无关，因此评分为0。加权总分仅为5.0，远低于动态及格分26.6，表明论文与评审关注的大模型和深度学习技术原理创新高度不相关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于LDDMM的条件随机插值框架，用于生成三维生物医学形状（如主动脉），并应用于心血管模拟中的不确定性量化，以评估医学图像分割对生物标志物估计的影响。

摘要翻译

我们提出了一种新颖的条件随机插值框架，用于三维形状的生成建模。该方法基于近期一种基于LDDMM的配准方法，以学习几何结构间的条件漂移。通过利用由此产生的拉回与推前算子，我们将该框架从标准笛卡尔网格扩展到复杂形状及定义在不同域上的随机变量。我们在心血管模拟的背景下展示了应用实例，即从初始患者队列生成主动脉形状。条件变量是一种由一组中心线点及相应内切球半径定义的潜在几何表征。该方法不仅促进了三维生物医学形状的数据增强，还能为给定形状生成受控幅度的随机扰动。这些功能对于量化医学图像分割引起的域不确定性对相关生物标志物估计的影响至关重要。

摘要 (Abstract)

We introduce a novel conditional stochastic interpolant framework for generative modeling of three-dimensional shapes. The method builds on a recent LDDMM-based registration approach to learn the conditional drift between geometries. By leveraging the resulting pull-back and push-forward operators, we extend this formulation beyond standard Cartesian grids to complex shapes and random variables defined on distinct domains. We present an application in the context of cardiovascular simulations, where aortic shapes are generated from an initial cohort of patients. The conditioning variable is a latent geometric representation defined by a set of centerline points and the radii of the corresponding inscribed spheres. This methodology facilitates both data augmentation for three-dimensional biomedical shapes, and the generation of random perturbations of controlled magnitude for a given shape. These capabilities are essential for quantifying the impact of domain uncertainties arising from medical image segmentation on the estimation of relevant biomarkers.

关键词: stochastic interpolants, generative modeling, three-dimensional shapes, cardiovascular simulations, aortic shapes, domain uncertainty quantification, biomedical shapes, data augmentation

284. ❌ FairGC: Fairness-aware Graph Condensation

作者: Yihan Gao, Chenxi Huang, Wen Shi, Ke Sun, Ziqi Xu, Xikun Zhang, Mingliang Hou, Renqiang Luo 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28321v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《FairGC: Fairness-aware Graph Condensation》专注于图神经网络（GNN）中的图压缩（Graph Condensation）技术，并引入公平性约束。研究内容涉及图数据压缩、公平性算法设计、图结构保持等，属于图机器学习领域。所有评分关键词均围绕大语言模型（LLM）、深度学习技术原理（如MoE、Scaling Laws、训练方法、推理优化、智能体等）或特定科学领域AI应用（如生物信息学）。论文未涉及任何大语言模型或深度学习基础技术，也未在生物/化学信息学等科学领域应用大模型，因此与所有关键词完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有图压缩方法忽略公平性约束的问题，提出了FairGC框架，通过在图蒸馏过程中嵌入公平性组件，在保持预测准确性的同时显著减少了统计奇偶性和机会均等方面的差异。

摘要翻译

图压缩（Graph Condensation，GC）已成为通过将海量数据集压缩为小型合成节点集来扩展图神经网络规模的关键策略。尽管当前的GC方法能有效保持预测准确性，但它们主要围绕实用性设计，往往忽略了公平性约束。由于这些技术对偏差不敏感，它们常常会捕捉甚至放大原始数据中存在的人口统计差异。这导致生成的合成代理数据不适用于信用评分或社交推荐等敏感应用场景。为解决这一问题，我们提出了FairGC，一个将公平性直接嵌入图蒸馏过程的统一框架。我们的方法包含三个核心组件。首先，分布保持压缩模块通过同步标签与敏感属性的联合分布来阻止偏差传播。其次，谱编码模块利用拉普拉斯特征分解来保持关键的全局结构模式。最后，公平性增强神经架构采用多域融合与标签平滑课程学习机制来生成公平的预测结果。在四个真实数据集上的严格评估表明，FairGC在准确性与公平性之间实现了更优的平衡。实验结果证实，与现有最先进的压缩模型相比，FairGC在统计均等与机会均等指标上显著降低了差异。代码已开源：https://github.com/LuoRenqiang/FairGC。

摘要 (Abstract)

Graph condensation (GC) has become a vital strategy for scaling Graph Neural Networks by compressing massive datasets into small, synthetic node sets. While current GC methods effectively maintain predictive accuracy, they are primarily designed for utility and often ignore fairness constraints. Because these techniques are bias-blind, they frequently capture and even amplify demographic disparities found in the original data. This leads to synthetic proxies that are unsuitable for sensitive applications like credit scoring or social recommendations. To solve this problem, we introduce FairGC, a unified framework that embeds fairness directly into the graph distillation process. Our approach consists of three key components. First, a Distribution-Preserving Condensation module synchronizes the joint distributions of labels and sensitive attributes to stop bias from spreading. Second, a Spectral Encoding module uses Laplacian eigen-decomposition to preserve essential global structural patterns. Finally, a Fairness-Enhanced Neural Architecture employs multi-domain fusion and a label-smoothing curriculum to produce equitable predictions. Rigorous evaluations on four real-world datasets, show that FairGC provides a superior balance between accuracy and fairness. Our results confirm that FairGC significantly reduces disparity in Statistical Parity and Equal Opportunity compared to existing state-of-the-art condensation models. The codes are available at https://github.com/LuoRenqiang/FairGC.

关键词: Graph Condensation, Fairness-aware, Graph Neural Networks, Synthetic Datasets, Demographic Disparities, Distribution-Preserving, Spectral Encoding, Multi-domain Fusion

285. ❌ Taming the Instability: A Robust Second-Order Optimizer for Federated Learning over Non-IID Data

作者: Yuanqiao Zhang, Tiantian He, Yuan Gao, Yixin Wang, Yew-Soon Ong, Maoguo Gong, A. K. Qin, Hui Li 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28316v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于联邦学习的二阶优化算法（FedRCO），旨在解决非独立同分布数据下的收敛速度和通信成本问题。所有评分关键词均围绕大模型、深度学习技术原理及其应用（如AI for Science），而本文研究的是联邦学习的优化方法，属于分布式机器学习领域，与评分关键词中的大模型技术、训练方法、推理优化、AI应用等主题均无直接关联。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为FedRCO的鲁棒二阶优化框架，用于解决联邦学习在非独立同分布数据下的不稳定性问题，实现了更快的收敛速度和更高的准确性。

摘要翻译

本文提出联邦鲁棒曲率优化算法（Federated Robust Curvature Optimization, FedRCO），这是一种新颖的二阶优化框架，旨在提升统计异构环境下联邦学习系统的收敛速度并降低通信成本。现有的二阶优化方法在分布式环境中通常计算成本高昂且数值不稳定。相比之下，FedRCO通过将高效的近似曲率优化器与可证明的稳定性机制相结合，有效应对了这些挑战。具体而言，FedRCO包含三个核心组件：（1）梯度异常监测器，用于实时检测并抑制梯度爆炸；（2）故障安全恢复协议，在数值不稳定时重置优化状态；（3）曲率保持自适应聚合策略，能够在安全整合全局知识的同时保留局部曲率几何结构。理论分析表明，FedRCO在保持优化效率的同时，能有效缓解不稳定性并防止无界更新。大量实验证明，相较于当前最先进的一阶与二阶方法，FedRCO在多种非独立同分布场景下均表现出更优的鲁棒性，同时实现了更高的精度与更快的收敛速度。

摘要 (Abstract)

In this paper, we present Federated Robust Curvature Optimization (FedRCO), a novel second-order optimization framework designed to improve convergence speed and reduce communication cost in Federated Learning systems under statistical heterogeneity. Existing second-order optimization methods are often computationally expensive and numerically unstable in distributed settings. In contrast, FedRCO addresses these challenges by integrating an efficient approximate curvature optimizer with a provable stability mechanism. Specifically, FedRCO incorporates three key components: (1) a Gradient Anomaly Monitor that detects and mitigates exploding gradients in real-time, (2) a Fail-Safe Resilience protocol that resets optimization states upon numerical instability, and (3) a Curvature-Preserving Adaptive Aggregation strategy that safely integrates global knowledge without erasing the local curvature geometry. Theoretical analysis shows that FedRCO can effectively mitigate instability and prevent unbounded updates while preserving optimization efficiency. Extensive experiments show that FedRCO achieves superior robustness against diverse non-IID scenarios while achieving higher accuracy and faster convergence than both state-of-the-art first-order and second-order methods.

关键词: Federated Learning, Second-order Optimization, Non-IID Data, Robustness, Convergence Speed, Communication Cost, Statistical Heterogeneity, Numerical Stability

286. ❌ LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models

作者: Chanyoung Kim, Minwoo Kim, Minseok Kang, Hyunwoo Kim, Dahuin Jung 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28301v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Vision-Language-Action (VLA)模型在机器人操作任务中对指令改写（paraphrasing）的鲁棒性问题，属于大模型在特定领域（机器人）的应用研究。与关键词的相关性分析：1）与"Post-training OR Supervised Fine-tuning OR SFT"高度相关（8分），因为论文核心关注VLA模型在下游机器人任务中的微调（fine-tuning）及其泛化问题；2）与"Pre-training OR Continual Pre-training OR Domain Adaptation"有一定关联（5分），因为VLA模型基于预训练的视觉-语言主干，但论文未深入探讨预训练本身；3）其他关键词（如LLMs、MoE、Scaling Laws等）与论文研究的VLA模型、机器人任务、指令改写鲁棒性等核心内容无直接关联，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了Vision-Language-Action (VLA)模型在机器人操作任务中对指令改写的鲁棒性问题，发现模型在微调后对指令改写（尤其是对象级词汇变化）表现出显著的性能下降（22-52个百分点），并提出了一个诊断基准和量化指标来评估这种鲁棒性。

摘要翻译

视觉-语言-动作（Vision-Language-Action, VLA）模型通过利用预训练的视觉-语言主干网络，在机器人操作任务中实现了强劲的性能。然而，在下游机器人应用场景中，这些模型通常仅使用有限数据进行微调，导致其过度适应特定的指令表述形式，而对改写指令的鲁棒性研究不足。为探究这一差距，我们引入了LIBERO-Para基准测试集，该数据集通过独立控制动作表达和对象指称的变化，以支持对语言泛化能力的细粒度分析。在七种不同规模（0.6B-7.5B参数）的VLA模型配置中，我们观察到在指令改写条件下模型性能出现22-52个百分点的持续下降。这种性能下降主要源于对象级词汇变化：即使简单的同义词替换也会导致性能大幅降低，表明模型依赖表层词汇匹配而非语义理解。此外，80-96%的失败案例源于规划层面的轨迹偏差而非执行错误，这说明指令改写干扰了模型对任务本身的识别。传统的二元成功率指标将所有改写指令等同对待，无法区分模型是在不同难度级别上表现一致，还是仅依赖于简单案例。为解决这一问题，我们提出了PRIDE评估指标，该指标通过语义和句法因素量化改写指令的难度。我们的基准测试集及相关代码已公开于：https://github.com/cau-hai-lab/LIBERO-Para

摘要 (Abstract)

Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation by leveraging pre-trained vision-language backbones. However, in downstream robotic settings, they are typically fine-tuned with limited data, leading to overfitting to specific instruction formulations and leaving robustness to paraphrased instructions underexplored. To study this gap, we introduce LIBERO-Para, a controlled benchmark that independently varies action expressions and object references for fine-grained analysis of linguistic generalization. Across seven VLA configurations (0.6B-7.5B), we observe consistent performance degradation of 22-52 pp under paraphrasing. This degradation is primarily driven by object-level lexical variation: even simple synonym substitutions cause large drops, indicating reliance on surface-level matching rather than semantic grounding. Moreover, 80-96% of failures arise from planning-level trajectory divergence rather than execution errors, showing that paraphrasing disrupts task identification. Binary success rate treats all paraphrases equally, obscuring whether models perform consistently across difficulty levels or rely on easier cases. To address this, we propose PRIDE, a metric that quantifies paraphrase difficulty using semantic and syntactic factors. Our benchmark and corresponding code are available at: https://github.com/cau-hai-lab/LIBERO-Para

关键词: Vision-Language-Action models, robotic manipulation, paraphrase robustness, fine-tuning, linguistic generalization, benchmark, semantic grounding, trajectory divergence

287. ❌ Learning from imperfect quantum data via unsupervised domain adaptation with classical shadows

作者: Kosuke Ito, Akira Tanji, Hiroshi Yano, Yudai Suzuki, Naoki Yamamoto 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28294v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文主要研究量子数据学习中的无监督领域自适应方法，使用经典阴影表示量子态。与关键词列表的相关性分析如下：1）与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分），因为论文核心是领域自适应（Domain Adaptation）在量子数据学习中的应用；2）与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（8分），属于AI在科学（量子物理）领域的应用，但非生物信息学或化学信息学；3）其他关键词均与论文内容无关（0分），因为论文未涉及大模型、深度学习技术原理、训练方法、推理优化、代理系统等主题。

!!! tip deepseek-chat TL;DR

该论文针对量子数据学习中目标域数据不完美的问题，提出了一种基于经典阴影的无监督领域自适应框架，在量子物质相和纠缠分类任务中验证了其优于非自适应基线和目标域无监督学习方法的性能。

摘要翻译

利用经典机器学习模型从量子数据中学习已成为实现量子优势的一种前景广阔的范式。尽管对其性能已有广泛分析，但在实际场景中，往往难以获得来自目标领域的、干净且完全标注的量子数据，这迫使模型必须在与部署环境存在差异的条件下收集的数据上进行训练。这种不匹配凸显了需要超越先前工作中常见假设的新方法。在本工作中，我们通过采用无监督域适应（unsupervised domain adaptation）框架来处理这一不完美量子数据的学习问题。具体而言，通过利用经由经典阴影（classical shadows）方法获得的量子态的经典表示，我们在完成对量子态的测量后，完全在经典计算流程中执行无监督域适应。我们在现实域偏移条件下，针对量子物态相分类和纠缠分类任务对该框架进行了数值评估。在这两项任务中，我们的方法均优于仅使用源域数据的非自适应基线方法以及仅使用目标域数据的无监督学习方法，证明了域适应技术在实际量子数据学习中的适用性。

摘要 (Abstract)

Learning from quantum data using classical machine learning models has emerged as a promising paradigm toward realizing quantum advantages. Despite extensive analyses on their performance, clean and fully labeled quantum data from the target domain are often unavailable in practical scenarios, forcing models to be trained on data collected under conditions that differ from those encountered at deployment. This mismatch highlights the need for new approaches beyond the common assumptions of prior work. In this work, we address this issue by employing an unsupervised domain adaptation framework for learning from imperfect quantum data. Specifically, by leveraging classical representations of quantum states obtained via classical shadows, we perform unsupervised domain adaptation entirely within a classical computational pipeline once measurements on the quantum states are executed. We numerically evaluate the framework on quantum phases of matter and entanglement classification tasks under realistic domain shifts. Across both tasks, our method outperforms source-only non-adaptive baselines and target-only unsupervised learning approaches, demonstrating the practical applicability of domain adaptation to realistic quantum data learning.

关键词: quantum data learning, unsupervised domain adaptation, classical shadows, quantum phases of matter, entanglement classification, domain shift, classical machine learning, quantum advantages

288. ❌ OptINC: Optical In-Network-Computing for Scalable Distributed Learning

作者: Sijie Fei, Grace Li Zhang, Bing Li, Ulf Schlichtmann 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28290v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种用于分布式学习的OptINC光学网络计算架构，通过光学设备（如MZI）在光域中执行梯度平均和量化，以减少通信开销。论文与’Large Language Models’相关（5分），因为它在LLaMA网络上进行了评估；与’Quantization’相关（5分），因为它涉及梯度量化。其他关键词与论文的光学硬件加速、分布式训练架构和通信优化主题无关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种光学网络计算（OptINC）架构，通过在光域中执行梯度平均和量化来减少分布式学习中的通信开销，并在ResNet50和LLaMA网络上实现了与基线相当的训练精度。

摘要翻译

分布式学习通过将模型或数据集的部分分布到多个设备上，并聚合计算结果以进行后续计算或参数更新，被广泛用于在大型数据集上训练大规模模型。现有的分布式学习通信算法（如环形全归约）会导致服务器间产生沉重的通信开销。由于大规模系统中的通信使用光纤，我们提出了一种光学网络内计算（OptINC）架构，将服务器中的计算任务卸载到光学互连设备上。为了在光域中执行梯度平均化和量化，我们在互连设备中集成了马赫-曾德尔干涉仪（MZI）等光学器件。这种实质上的光学神经网络（ONN）能有效减少现有分布式训练方案中的通信开销。为降低训练该神经网络所需的数据集复杂度，我们还提出了一种在光域中实现的预处理算法。通过使用酉矩阵和对角矩阵近似光学神经网络的权重矩阵，硬件成本得以降低，同时通过提出的硬件感知训练算法保持了准确性。所提出的解决方案在真实的分布式学习任务上进行了评估，包括在CIFAR-100数据集上的ResNet50模型和在Wikipedia-1B数据集上的基于LLaMA的网络。在两种情况下，所提出的框架均能达到与环形全归约基准相当的训练精度，同时消除了通信开销。

摘要 (Abstract)

Distributed learning is widely used for training large models on large datasets by distributing parts of the model or dataset across multiple devices and aggregating the computed results for subsequent computations or parameter updates. Existing communication algorithms for distributed learning such as ring all-reduce result in heavy communication overhead between servers. Since communication in large-scale systems uses optical fibers, we propose an Optical In-Network-Computing (OptINC) architecture to offload the computation in servers onto the optical interconnects. To execute gradient averaging and quantization in the optical domain, we incorporate optical devices such as Mach-Zehnder-Interferometers (MZIs) into the interconnects. Such a de facto optical neural network (ONN) can effectively reduce the communication overhead in existing distributed training solutions. To reduce dataset complexity for training this neural network, a preprocessing algorithm implemented in the optical domain is also proposed. Hardware cost is lowered by approximating the weight matrices of the optical neural network with unitary and diagonal matrices, while the accuracy is maintained by a proposed hardware-aware training algorithm. The proposed solution was evaluated on real distributed learning tasks, including ResNet50 on CIFAR-100, and a LLaMA-based network on Wikipedia-1B. In both cases, the proposed framework can achieve comparable training accuracy to the ring all-reduce baseline, while eliminating communication overhead.

关键词: Optical In-Network-Computing, distributed learning, gradient averaging, quantization, optical neural network, communication overhead, LLaMA, ResNet50

289. ❌ Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback

作者: Andi Nika, Debmalya Mandal, Parameswaran Kamalaruban, Adish Singla, Goran Radanović 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28281v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于离线多智能体强化学习从人类反馈（MARLHF）中的数据鲁棒性问题，与关键词’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’高度相关（10分），因为论文明确研究从人类偏好反馈中学习；与’Multi-agent Systems OR Agent Coordination’高度相关（10分），因为论文研究多智能体环境中的纳什均衡和协调问题。其他关键词主要涉及大模型技术、训练方法、推理优化等，与论文的强化学习、博弈论、鲁棒性理论焦点无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文研究了离线多智能体强化学习从人类反馈（MARLHF）中对抗数据污染的鲁棒性问题，提出了在不同覆盖假设下保证纳什均衡或粗相关均衡误差界的算法。

摘要翻译

我们在强污染模型下研究离线多智能体人类反馈强化学习（MARLHF）中对抗数据损坏的鲁棒性问题：给定一个轨迹-偏好元组的数据集 $D$（每个偏好是一个 $n$ 维二元标签向量，代表 $n$ 个智能体中每个智能体的偏好），其中 $ε$ 比例的样本可能被任意篡改。我们使用线性马尔可夫博弈框架对该问题进行建模。首先，在均匀覆盖假设下——即所有相关策略在干净（污染前）数据中均有充分体现——我们提出了一种鲁棒估计器，其能保证纳什均衡差距的界为 $O(ε^{1 - o(1)})$。接着，我们转向更具挑战性的单边覆盖设定，其中仅覆盖纳什均衡及其单智能体偏离策略。在此情况下，我们提出的算法实现了纳什差距 $O(\sqrtε)$ 的界。然而，这两种方法均面临计算不可行的难题。为解决此问题，我们将解的概念放宽至粗相关均衡（coarse correlated equilibria, CCE）。在相同的单边覆盖条件下，我们推导出一种拟多项式时间算法，其 CCE 差距的尺度为 $O(\sqrtε)$。据我们所知，这是对离线 MARLHF 中对抗性数据损坏问题的首次系统性研究。

摘要 (Abstract)

We consider robustness against data corruption in offline multi-agent reinforcement learning from human feedback (MARLHF) under a strong-contamination model: given a dataset $D$ of trajectory-preference tuples (each preference being an $n$-dimensional binary label vector representing each of the $n$ agents’ preferences), an $ε$-fraction of the samples may be arbitrarily corrupted. We model the problem using the framework of linear Markov games. First, under a uniform coverage assumption - where every policy of interest is sufficiently represented in the clean (prior to corruption) data - we introduce a robust estimator that guarantees an $O(ε^{1 - o(1)})$ bound on the Nash equilibrium gap. Next, we move to the more challenging unilateral coverage setting, in which only a Nash equilibrium and its single-player deviations are covered. In this case, our proposed algorithm achieves an $O(\sqrtε)$ bound on the Nash gap. Both of these procedures, however, suffer from intractable computation. To address this, we relax our solution concept to coarse correlated equilibria (CCE). Under the same unilateral coverage regime, we derive a quasi-polynomial-time algorithm whose CCE gap scales as $O(\sqrtε)$. To the best of our knowledge, this is the first systematic treatment of adversarial data corruption in offline MARLHF.

关键词: offline multi-agent reinforcement learning, human feedback, data corruption, robustness, Nash equilibrium, coarse correlated equilibria, linear Markov games, adversarial corruption

290. ❌ Nonlinear Factor Decomposition via Kolmogorov-Arnold Networks: A Spectral Approach to Asset Return Analysis

作者: David Breazu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28257v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文提出了一种名为KAN-PCA的金融时间序列分析方法，该方法使用Kolmogorov-Arnold网络（KAN）作为编码器，线性映射作为解码器，以改进传统PCA在非线性市场条件下的性能。虽然KAN是一种神经网络架构，但论文的研究重点完全在于金融数据分析（资产回报分解），并未涉及任何大语言模型（LLM）、深度学习技术原理创新、或大模型在不同领域的应用。所有评分关键词均与大模型、深度学习技术、或AI在科学领域的应用直接相关，而本文属于纯粹的金融工程/计量经济学应用，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于Kolmogorov-Arnold网络（KAN）的非线性因子分解方法KAN-PCA，用于资产回报分析，实验证明其在市场危机期间比传统线性PCA能捕获更多方差，并实现了更高的重建R²（66.57% vs 62.99%）。

摘要翻译

KAN-PCA是一种使用KAN作为编码器、线性映射作为解码器的自编码器。它通过将线性投影替换为每条边上可学习的B样条函数，推广了经典的主成分分析（PCA）。其动机在于捕捉比经典PCA更多的方差，因为当市场危机期间线性假设失效、资产间相关性剧烈变化时，经典PCA会变得低效。我们证明，若强制样条激活函数为线性，KAN-PCA将得到与经典PCA完全相同的结果，从而确立PCA为其特例。在20只标普500股票（2015-2024年）上的实验表明，在使用相同3个因子的情况下，KAN-PCA实现了66.57%的重建R^2，而经典PCA为62.99%；同时在训练过程中修正数据泄露问题后，其样本外表现与PCA相当。

摘要 (Abstract)

KAN-PCA is an autoencoder that uses a KAN as encoder and a linear map as decoder. It generalizes classical PCA by replacing linear projections with learned B-spline functions on each edge. The motivation is to capture more variance than classical PCA, which becomes inefficient during market crises when the linear assumption breaks down and correlations between assets change dramatically. We prove that if the spline activations are forced to be linear, KAN-PCA yields exactly the same results as classical PCA, establishing PCA as a special case. Experiments on 20 S&P 500 stocks (2015-2024) show that KAN-PCA achieves a reconstruction R^2 of 66.57%, compared to 62.99% for classical PCA with the same 3 factors, while matching PCA out-of-sample after correcting for data leakage in the training procedure.

关键词: Kolmogorov-Arnold Networks, KAN-PCA, Nonlinear Factor Decomposition, Asset Return Analysis, Principal Component Analysis, B-spline functions, Financial Time Series, Market Crises

291. ❌ MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

作者: Da Chang, Qiankun Shi, Lvgang Zhang, Yu Li, Ruijie Zhang, Yao Lu, Yongxiang Liu, Ganzhao Yuan 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28254v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种用于Muon优化器的轻量级预正交化平衡方案（MuonEq），旨在改进矩阵值参数的训练。该研究在LLaMA2预训练任务上进行了验证，因此与’Pre-training OR Continual Pre-training OR Domain Adaptation’（权重1.0）高度相关（8分），因为论文直接涉及预训练过程。同时，由于论文在LLaMA2模型上进行了实验，与’Large Language Models OR LLMs OR Foundation Models’（权重1.0）有一定关联（5分）。其他关键词（如MoE、SFT、RLHF、RAG等）与论文核心内容（优化器设计）无关，因此得0分。加权总分计算为(51.0)+(81.0)=13.0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为MuonEq的轻量级预正交化平衡方案，用于改进Muon优化器在训练矩阵值参数时的性能，并在LLaMA2预训练中实现了更快的收敛和更低的验证困惑度。

摘要翻译

诸如Muon等正交化更新优化器改进了矩阵值参数的训练，但现有扩展大多通过重新缩放更新在正交化后发挥作用，或采用更复杂的基于白化的预条件器在其前发挥作用。我们提出了{\method}，这是一个轻量级的预正交化均衡方案家族，包含三种形式：双边行/列归一化（RC）、行归一化（R）和列归一化（C）。这些变体在使用行/列平方范数统计量进行有限步Newton–Schulz正交化前对动量矩阵进行再平衡，仅需$\mathcal{O}(m+n)$的辅助状态。我们证明有限步正交化受输入谱特性（特别是稳定秩和条件数）调控，而行/列归一化是一种零阶白化替代方法，可消除边缘尺度失配。对于{\method}所针对的隐藏矩阵权重，行归一化变体R是自然的默认选择，并保持了Muon类方法的$\widetilde{\mathcal{O}}(T^{-1/4})$平稳性保证。在C4数据集上的LLaMA2预训练中，默认的R变体在1.3亿和3.5亿参数模型上持续优于Muon，实现了更快的收敛速度和更低的验证困惑度。

摘要 (Abstract)

Orthogonalized-update optimizers such as Muon improve training of matrix-valued parameters, but existing extensions mostly act either after orthogonalization by rescaling updates or before it with heavier whitening-based preconditioners. We introduce {\method}, a lightweight family of pre-orthogonalization equilibration schemes for Muon in three forms: two-sided row/column normalization (RC), row normalization (R), and column normalization (C). These variants rebalance the momentum matrix before finite-step Newton–Schulz using row/column squared-norm statistics and only $\mathcal{O}(m+n)$ auxiliary state. We show that finite-step orthogonalization is governed by input spectral properties, especially stable rank and condition number, and that row/column normalization is a zeroth-order whitening surrogate that removes marginal scale mismatch. For the hidden matrix weights targeted by {\method}, the row-normalized variant R is the natural default and preserves the $\widetilde{\mathcal{O}}(T^{-1/4})$ stationarity guarantee of Muon-type methods. In LLaMA2 pretraining on C4, the default R variant consistently outperforms Muon on 130M and 350M models, yielding faster convergence and lower validation perplexity.

关键词: MuonEq, orthogonalization, optimizer, pre-training, LLaMA2, matrix-valued parameters, convergence, validation perplexity

292. ❌ Detecting the Unexpected: AI-Driven Anomaly Detection in Smart Bridge Monitoring

作者: Rahul Jaiswal, Joakim Hellum, Halvor Heiberg 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28225v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于使用传统机器学习方法（如DBSCAN）进行桥梁异常检测，属于AI在基础设施监测中的应用。所有关键词均与大模型、深度学习技术原理或相关高级AI方法（如LLM、MoE、RLHF、RAG等）相关，而本文未涉及这些技术，仅提及“AI-driven”和“machine learning”，但未具体说明是大模型或深度学习。唯一可能相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为桥梁监测可视为AI在工程科学领域的应用，但论文未明确关联生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于DBSCAN的AI驱动异常检测方法，用于智能桥梁监测，实验表明该方法能有效检测桥梁事故，提升公共安全。

摘要翻译

桥梁是国家基础设施与智慧城市的关键组成部分。因此，智能桥梁监测对于保障公共安全、防止灾难性结构失效或事故至关重要。传统的桥梁监测方法主要依赖人工目视检查，这种方式耗时且易受主观性和误差影响。本文提出了一种基于人工智能（AI）的异常检测方法，用于智能桥梁监测。具体而言，我们利用安装在挪威一座桥梁上的iBridge传感器设备所采集的实时数据，开发了一个简单的机器学习（ML）模型。所提出的模型与多种不同的机器学习模型进行了对比评估。实验结果表明，基于噪声应用密度空间聚类（DBSCAN）的模型在准确检测异常事件（桥梁事故）方面表现更优。这些发现表明，该模型非常适用于智能桥梁监测，并能通过及时检测突发事故来提升公共安全。

摘要 (Abstract)

Bridges are critical components of national infrastructure and smart cities. Therefore, smart bridge monitoring is essential for ensuring public safety and preventing catastrophic failures or accidents. Traditional bridge monitoring methods rely heavily on human visual inspections, which are time-consuming and prone to subjectivity and error. This paper proposes an artificial intelligence (AI)-driven anomaly detection approach for smart bridge monitoring. Specifically, a simple machine learning (ML) model is developed using real-time sensor data collected by the iBridge sensor devices installed on a bridge in Norway. The proposed model is evaluated against different ML models. Experimental results demonstrate that the density-based spatial clustering of applications with noise (DBSCAN)-based model outperforms in accurately detecting the anomalous events (bridge accident). These findings indicate that the proposed model is well-suited for smart bridge monitoring and can enhance public safety by enabling the timely detection of unforeseen incidents.

关键词: anomaly detection, smart bridge monitoring, DBSCAN, machine learning, sensor data, public safety, infrastructure

293. ❌ Variational Neurons in Transformers for Language Modeling

作者: Yves Ruffenach 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28219v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是改进Transformer架构，引入变分神经元以增强内部不确定性建模，属于大模型技术原理创新。与’Large Language Models’高度相关（8分），因为论文研究Transformer语言模型；与’Mechanistic Interpretability’相关（8分），因为论文分析模型内部行为、不确定性信号和校准，属于可解释性研究。其他关键词如MoE、量化、推理加速、对齐等均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

论文研究如何在Transformer语言模型中引入变分神经元以增强内部不确定性建模，实验表明该方法能稳定集成、保持预测性能并提供信息丰富的不确定性信号。

摘要翻译

用于语言建模的Transformer通常依赖于确定性的内部计算，其不确定性主要体现在输出层。我们在Transformer的前馈计算中引入变分神经元，使不确定性成为内部计算本身的一部分。具体而言，我们在保持Transformer整体架构不变的前提下，将确定性的前馈单元替换为基于EVE的局部变分单元。
我们在紧凑的下一个词预测语言建模场景中评估了这一设计。我们通过预测性和概率性两种标准，比较了确定性模型与变分模型。除了负对数似然、困惑度和准确率外，我们还分析了校准度、条件方差、互信息以及潜在使用统计量。结果呈现出清晰的结论：变分神经元能够稳定地集成到Transformer中，在保持强大预测性能的同时，产生信息丰富的不确定性信号。实验还表明，任务质量、有效深度和内部稳定性是彼此独立的属性。
这些结果确立了变分Transformer作为一种实用的不确定性感知语言建模形式。研究表明，Transformer能够通过显式的内部不确定性结构进行预测，这支持了更强大的概率评估以及对模型行为更具信息量的分析。

摘要 (Abstract)

Transformers for language modeling usually rely on deterministic internal computation, with uncertainty expressed mainly at the output layer. We introduce variational neurons into Transformer feed-forward computation so that uncertainty becomes part of the internal computation itself. Concretely, we replace deterministic feed-forward units with local variational units based on EVE while preserving the overall Transformer backbone. We evaluate this design in compact next-token language-modeling settings. We compare deterministic and variational variants with both predictive and probabilistic criteria. Alongside negative log-likelihood, perplexity and accuracy, we analyze calibration, conditional variance, mutual information and latent-usage statistics. The resulting picture is clear. Variational neurons integrate stably into Transformers, preserve strong predictive performance and produce informative uncertainty signals. The experiments also show that task quality, useful depth and internal stability are distinct properties. These results establish variational Transformers as a practical form of uncertainty-aware language modeling. They show that Transformers can predict with an explicit internal structure of uncertainty, which supports stronger probabilistic evaluation and a more informative analysis of model behavior.

关键词: Variational Neurons, Transformer, Language Modeling, Uncertainty, Probabilistic Evaluation, Calibration, Internal Computation, Feed-forward Units

294. ❌ A Perturbation Approach to Unconstrained Linear Bandits

作者: Andrew Jacobsen, Dorian Baudry, Shinji Ito, Nicolò Cesa-Bianchi 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28201v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究无约束线性赌博机问题，属于经典在线学习理论领域，主要贡献在于改进扰动方法、分析动态遗憾、提供高概率保证和证明下界。论文内容完全不涉及大模型、深度学习、AI for Science或任何评分关键词中的技术，与所有关键词均无关联。

!!! tip deepseek-chat TL;DR

该论文改进了无约束线性赌博机问题的扰动方法，将其简化为标准在线线性优化问题，并获得了静态和动态遗憾的新理论保证。

摘要翻译

我们重新审视了Abernethy等人（2008）在无约束老虎机线性优化（uBLO）背景下提出的标准基于扰动的分析方法。我们揭示了一个令人意外的结果：在无约束设定下，该方法能够有效地将老虎机线性优化（BLO）问题转化为标准的在线线性优化（OLO）问题。我们的框架在多个方面改进了先前的研究。首先，当我们的扰动方案与比较器自适应的OLO算法结合时，我们推导出了期望遗憾的保证，这为不同对抗模型对最终比较器自适应速率的影响提供了新的见解。我们还将分析扩展到动态遗憾，在无需预先知道路径长度$P_T$的情况下，获得了最优的$\sqrt{P_T}$路径长度依赖关系。随后，我们首次为uBLO中的静态遗憾和动态遗憾建立了高概率保证。最后，我们讨论了静态遗憾的下界，并独立证明了单位欧几里得球上对抗性线性老虎机的经典$Ω(\sqrt{dT})$下界，这一结果本身具有独立的研究意义。

摘要 (Abstract)

We revisit the standard perturbation-based approach of Abernethy et al. (2008) in the context of unconstrained Bandit Linear Optimization (uBLO). We show the surprising result that in the unconstrained setting, this approach effectively reduces Bandit Linear Optimization (BLO) to a standard Online Linear Optimization (OLO) problem. Our framework improves on prior work in several ways. First, we derive expected-regret guarantees when our perturbation scheme is combined with comparator-adaptive OLO algorithms, leading to new insights about the impact of different adversarial models on the resulting comparator-adaptive rates. We also extend our analysis to dynamic regret, obtaining the optimal $\sqrt{P_T}$ path-length dependencies without prior knowledge of $P_T$. We then develop the first high-probability guarantees for both static and dynamic regret in uBLO. Finally, we discuss lower bounds on the static regret, and prove the folklore $Ω(\sqrt{dT})$ rate for adversarial linear bandits on the unit Euclidean ball, which is of independent interest.

关键词: unconstrained linear bandits, perturbation approach, online linear optimization, dynamic regret, high-probability guarantees, regret bounds, adversarial linear bandits

295. ❌ A Deep Reinforcement Learning Framework for Closed-loop Guidance of Fish Schools via Virtual Agents

作者: Takato Shibayama, Hiroaki Kawashima 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28200v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究使用深度强化学习（PPO算法）通过虚拟智能体引导鱼群运动，属于AI在生物学/动物行为学领域的应用。论文未涉及任何大语言模型（LLM）、模型架构、训练方法、推理优化、对齐技术或代理系统等关键词。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在科学（具体是生物学）领域的应用，但并非核心的生物信息学或化学信息学，因此给予5分（有一定关联）。其他所有关键词均与论文内容完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于深度强化学习（PPO）的框架，使用虚拟智能体对鱼群进行闭环引导，实验表明系统能有效引导5条鱼的群体，但随着群体规模增大到8条，引导效果显著下降。

摘要翻译

引导生物群体的集体运动是理解社会互动规则和发展自动化动物管理系统的核心挑战。本研究提出了一种基于深度强化学习（RL）的框架，利用虚拟智能体对鱼群进行闭环引导。这些智能体通过近端策略优化（PPO）算法在仿真环境中训练策略，并部署于红鼻剪刀鱼（Petitella bleheri）的物理实验中，实现了人工智能体与活体个体之间的实时交互。为应对活体个体的随机行为，我们设计了一种复合奖励函数，以平衡方向性引导与社会凝聚力。我们对视觉参数的系统评估表明，在物理实验中，白色背景和较大的刺激尺寸能最大化引导效能。此外，对不同群体规模的评估显示，虽然该系统能有效引导由五个个体组成的鱼群，但当群体规模增加到八个个体时，其引导能力显著下降。本研究凸显了深度强化学习在生物群体自动化引导方面的潜力，并指出了在更大群体中维持人工影响力所面临的挑战。

摘要 (Abstract)

Guiding collective motion in biological groups is a fundamental challenge in understanding social interaction rules and developing automated systems for animal management. In this study, we propose a deep reinforcement learning (RL) framework for the closed-loop guidance of fish schools using virtual agents. These agents are controlled by policies trained via Proximal Policy Optimization (PPO) in simulation and deployed in physical experiments with rummy-nose tetras (Petitella bleheri), enabling real-time interaction between artificial agents and live individuals. To cope with the stochastic behavior of live individuals, we design a composite reward function to balance directional guidance with social cohesion. Our systematic evaluation of visual parameters shows that a white background and larger stimulus sizes maximize guidance efficacy in physical trials. Furthermore, evaluation across group sizes revealed that while the system demonstrates effective guidance for groups of five individuals, this capability markedly degrades as group size increases to eight. This study highlights the potential of deep RL for automated guidance of biological collectives and identifies challenges in maintaining artificial influence in larger groups.

关键词: deep reinforcement learning, closed-loop guidance, fish schools, virtual agents, Proximal Policy Optimization (PPO), collective motion, biological groups, real-time interaction

作者: Hongkai Hu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28198v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是严格在线预测中的动态遗憾优化问题，提出了Policy-Controlled Generalized Share框架及其Transformer实现PCGS-TF。虽然使用了Transformer作为更新控制器，但论文关注的是在线学习、专家切换、动态遗憾等传统机器学习问题，而非大模型技术原理、训练方法、应用或创新。所有关键词均与大模型、深度学习技术或科学AI应用相关，而本文的核心内容与这些领域无直接关联，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

论文针对非平稳环境下的严格在线预测问题，提出了Policy-Controlled Generalized Share框架及其Transformer实现PCGS-TF，通过自适应更新控制优化动态遗憾，在合成数据和实际基准测试中均取得了最低的动态遗憾。

摘要翻译

在非平稳性下的严格在线预测中，针对单一专家的静态遗憾通常并非合适的目标，因为最佳专家可能随时间反复切换。我们研究了策略控制广义共享（PCGS），这是一个通用的严格在线框架，其中广义共享递归是固定的，而允许损失后更新控制自适应变化。本文中的主要实例是PCGS-TF，它使用因果Transformer作为更新控制器：在第t轮结束并观察到损失向量后，Transformer输出将w_t映射至w_{t+1}的控制参数，而不改变已确定的决策w_t。在允许的损失后更新控制下，我们为一般时变学习率获得了路径加权遗憾保证，并在恒定学习率特化下，针对最多S次切换的任意专家路径获得了标准动态遗憾保证。实证方面，在一个具有精确动态规划切换预言机评估的受控合成测试集中，PCGS-TF在所有七个非平稳族中均取得了最低的平均动态遗憾，且其优势随着专家池规模增大而增加。在一个复现的家庭用电基准测试中，PCGS-TF在S = 5、10和20时也实现了最低的归一化动态遗憾。

摘要 (Abstract)

Static regret to a single expert is often the wrong target for strictly online prediction under non-stationarity, where the best expert may switch repeatedly over time. We study Policy-Controlled Generalized Share (PCGS), a general strictly online framework in which the generalized-share recursion is fixed while the post-loss update controls are allowed to vary adaptively. Its principal instantiation in this paper is PCGS-TF, which uses a causal Transformer as an update controller: after round t finishes and the loss vector is observed, the Transformer outputs the controls that map w_t to w_{t+1} without altering the already committed decision w_t. Under admissible post-loss update controls, we obtain a pathwise weighted regret guarantee for general time-varying learning rates, and a standard dynamic-regret guarantee against any expert path with at most S switches under the constant-learning-rate specialization. Empirically, on a controlled synthetic suite with exact dynamic-programming switching-oracle evaluation, PCGS-TF attains the lowest mean dynamic regret in all seven non-stationary families, with its advantage increasing for larger expert pools. On a reproduced household-electricity benchmark, PCGS-TF also achieves the lowest normalized dynamic regret for S = 5, 10, and 20.

关键词: online prediction, dynamic regret, expert switching, Transformer, non-stationary, policy-controlled, generalized share, strictly online

297. ❌ Automating Early Disease Prediction Via Structured and Unstructured Clinical Data

作者: Ane G Domingo-Aldama, Marcos Merino Prado, Alain García Olea, Josu Goikoetxea, Koldo Gojenola, Aitziber Atutxa 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28167v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于使用自然语言处理技术处理临床文本数据（出院报告）以改进心房颤动早期预测，属于AI在生物医学领域的应用。论文未涉及大模型技术原理、训练方法、推理优化、对齐技术、智能体系统等核心大模型技术主题。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于生物信息学/医疗AI应用范畴，但并非论文的核心技术焦点（论文重点在NLP数据处理流程而非AI模型本身）。

!!! tip deepseek-chat TL;DR

该研究提出了一种自动化方法，利用自然语言处理技术从非结构化出院报告中提取信息，以改进心房颤动的早期预测，实验表明该方法能提高预测模型的准确性和可靠性。

摘要翻译

本研究提出一种利用非结构化出院报告提取信息的临床早期预测全自动化方法。该流程通过出院报告支持早期预测的三个核心步骤：队列筛选、数据集构建与结局标注。借助自然语言处理技术对出院报告进行解析，我们能够高效识别相关患者队列，利用附加临床变量增强结构化数据集，并无需人工干预即可生成高质量标签。该方法解决了编码化电子健康记录中常见的数据缺失或不完整问题，能捕捉常被低估的临床相关信息。我们以心房颤动进展预测为场景对该方法进行评估，结果表明：相较于仅基于结构化电子健康记录数据训练的模型，利用出院报告信息增强数据集训练的预测模型不仅具有更高的准确度、与真实结局更强的相关性，且性能超越传统临床评分体系。这些结果证明，通过自动化整合非结构化临床文本，能够简化早期预测研究流程，提升数据质量，并增强临床决策支持预测模型的可靠性。

摘要 (Abstract)

This study presents a fully automated methodology for early prediction studies in clinical settings, leveraging information extracted from unstructured discharge reports. The proposed pipeline uses discharge reports to support the three main steps of early prediction: cohort selection, dataset generation, and outcome labeling. By processing discharge reports with natural language processing techniques, we can efficiently identify relevant patient cohorts, enrich structured datasets with additional clinical variables, and generate high-quality labels without manual intervention. This approach addresses the frequent issue of missing or incomplete data in codified electronic health records (EHR), capturing clinically relevant information that is often underrepresented. We evaluate the methodology in the context of predicting atrial fibrillation (AF) progression, showing that predictive models trained on datasets enriched with discharge report information achieve higher accuracy and correlation with true outcomes compared to models trained solely on structured EHR data, while also surpassing traditional clinical scores. These results demonstrate that automating the integration of unstructured clinical text can streamline early prediction studies, improve data quality, and enhance the reliability of predictive models for clinical decision-making.

关键词: early disease prediction, clinical data, unstructured discharge reports, natural language processing, atrial fibrillation, predictive models, electronic health records, data quality

298. ❌ ORACAL: A Robust and Explainable Multimodal Framework for Smart Contract Vulnerability Detection with Causal Graph Enrichment

作者: Tran Duong Minh Dai, Triet Huynh Minh Le, M. Ali Babar, Van-Hau Pham, Phan The Duy 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28128v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出ORACAL框架，核心创新在于将RAG和LLMs用于智能合约漏洞检测，通过RAG增强安全上下文，LLMs提供语义理解，因此与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（10分）。论文使用LLMs作为组件，与’Large Language Models OR LLMs OR Foundation Models’相关（8分）。框架采用PGExplainer提供可解释性，与’Mechanistic Interpretability OR Explainable AI’相关（8分）。应用领域为智能合约安全，属于AI在特定科学/工程领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。其他关键词如MoE、Scaling Laws、RLHF等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对智能合约漏洞检测中图神经网络模型缺乏语义理解和可解释性的问题，提出了ORACAL框架，通过集成RAG和LLMs增强安全上下文，并采用因果注意力和可解释方法，在多个基准测试中实现了最先进的检测性能（最高Macro F1达91.28%）和强鲁棒性。

摘要翻译

尽管图神经网络（GNNs）在智能合约漏洞检测中展现出潜力，但仍面临显著局限。同构图模型难以捕捉控制流与数据依赖之间的相互作用，而异构图方法通常缺乏深层语义理解，使其易受对抗性攻击影响。此外，大多数黑盒模型无法提供可解释的证据，阻碍了专业审计中的可信度。为应对这些挑战，我们提出ORACAL（基于因果推理的可观察检索增强分析框架），这是一个融合控制流图（CFG）、数据流图（DFG）和调用图（CG）的异质多模态图学习框架。ORACAL通过检索增强生成（RAG）和大语言模型（LLMs）选择性注入专家级安全上下文以丰富关键子图，并采用因果注意力机制分离真实漏洞指标与伪相关性。为提升透明度，该框架采用PGExplainer生成子图层级解释，以定位漏洞触发路径。在大规模数据集上的实验表明，ORACAL实现了最先进的性能，在主要基准测试中以最高91.28%的宏观F1值，超越MANDO-HGT、MTVHunter、GNN-SC和SCVHunter达39.6个百分点。在分布外数据集上，ORACAL展现出强大泛化能力，在CGT Weakness和DAppScan上分别达到91.8%和77.1%的准确率。在可解释性评估中，PGExplainer针对人工标注的漏洞触发路径实现了32.51%的平均交并比（MIoU）。在对抗攻击下，ORACAL将性能下降限制在约2.35%的F1值降低，攻击成功率（ASR）仅为3%，显著优于ASR介于10.91%至18.73%的SCVHunter和MANDO-HGT。

摘要 (Abstract)

Although Graph Neural Networks (GNNs) have shown promise for smart contract vulnerability detection, they still face significant limitations. Homogeneous graph models fail to capture the interplay between control flow and data dependencies, while heterogeneous graph approaches often lack deep semantic understanding, leaving them susceptible to adversarial attacks. Moreover, most black-box models fail to provide explainable evidence, hindering trust in professional audits. To address these challenges, we propose ORACAL (Observable RAG-enhanced Analysis with CausAL reasoning), a heterogeneous multimodal graph learning framework that integrates Control Flow Graph (CFG), Data Flow Graph (DFG), and Call Graph (CG). ORACAL selectively enriches critical subgraphs with expert-level security context from Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs), and employs a causal attention mechanism to disentangle true vulnerability indicators from spurious correlations. For transparency, the framework adopts PGExplainer to generate subgraph-level explanations identifying vulnerability triggering paths. Experiments on large-scale datasets demonstrate that ORACAL achieves state-of-the-art performance, outperforming MANDO-HGT, MTVHunter, GNN-SC, and SCVHunter by up to 39.6 percentage points, with a peak Macro F1 of 91.28% on the primary benchmark. ORACAL maintains strong generalization on out-of-distribution datasets with 91.8% on CGT Weakness and 77.1% on DAppScan. In explainability evaluation, PGExplainer achieves 32.51% Mean Intersection over Union (MIoU) against manually annotated vulnerability triggering paths. Under adversarial attacks, ORACAL limits performance degradation to approximately 2.35% F1 decrease with an Attack Success Rate (ASR) of only 3%, surpassing SCVHunter and MANDO-HGT which exhibit ASRs ranging from 10.91% to 18.73%.

关键词: smart contract vulnerability detection, Retrieval-Augmented Generation (RAG), Large Language Models (LLMs), heterogeneous multimodal graph, causal attention, explainable AI, adversarial robustness, graph neural networks

299. ❌ Graph Vector Field: A Unified Framework for Multimodal Health Risk Assessment from Heterogeneous Wearable and Environmental Data Streams

作者: Silvano Coletti, Francesca Fallucchi 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28115v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出Graph Vector Field (GVF)框架，核心创新在于将模态结构化的混合专家(Mixture of Experts)与离散微分几何算子结合，用于多模态健康风险评估，因此与’Mixture of Experts OR MoE OR Sparse Models’高度相关(10分)。论文属于数字健康研究，应用AI于生物信息学领域，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关(10分)。论文强调可解释性风险建模，与’Mechanistic Interpretability OR Explainable AI’有一定关联(5分)。论文未涉及大语言模型、训练方法、推理优化、智能体等其他关键词，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了Graph Vector Field (GVF)框架，通过将模态结构化的混合专家与离散微分几何算子结合，解决了多模态可穿戴和环境数据流中健康风险评估的建模问题，实现了可解释的、模态解析的风险建模。

摘要翻译

数字健康研究已发展出基于动态图的疾病模型、单纯复形上的拓扑学习以及多模态专家混合架构，但这些研究方向在很大程度上仍处于割裂状态。我们提出图向量场（Graph Vector Field, GVF）框架，该框架将健康风险建模为时变单纯复形上的向量值场，将离散微分几何算子与模态结构化的专家混合模型相耦合。风险被表示为向量值上链，其演化过程由霍奇拉普拉斯算子及离散外微积分算子参数化，通过亥姆霍兹-霍奇分解为势驱动（恰当）、类循环（余恰当）和拓扑约束（调和）三个分量，分别对应可解释的传播性、周期性和持续性风险机制。来自可穿戴传感器、行为/环境背景及临床/基因组数据的多模态输入，通过具有丛结构的专家混合模型进行整合——该模型将各模态特定的隐空间作为纤维附着于基复形之上。这种设计分离了模态特异性贡献与共享贡献，并为实现模态级可辨识性提供了理论路径。GVF将几何动力系统、高阶拓扑（通过几何正则化与霍奇分解间接实现）以及结构化多模态融合集成于统一框架中，实现了可解释且模态可解析的风险建模。本文阐述了其数学基础、架构设计与形式化保证；实证验证是当前正在进行的研究工作。

摘要 (Abstract)

Digital health research has advanced dynamic graph-based disease models, topological learning on simplicial complexes, and multimodal mixture-of-experts architectures, but these strands remain largely disconnected. We propose Graph Vector Field (GVF), a framework that models health risk as a vector-valued field on time-varying simplicial complexes, coupling discrete differential-geometric operators with modality-structured mixture-of-experts. Risk is represented as a vector-valued cochain whose evolution is parameterised with Hodge Laplacians and discrete exterior calculus operators, yielding a Helmholtz-Hodge decomposition into potential-driven (exact), circulation-like (coexact), and topologically constrained (harmonic) components linked to interpretable propagation, cyclic, and persistent risk mechanisms. Multimodal inputs from wearable sensors, behavioural/environmental context, and clinical/genomic data are incorporated through a bundle-structured mixture-of-experts in which modality-specific latent spaces are attached as fibres to the base complex. This separates modality-specific from shared contributions and offers a principled route toward modality-level identifiability. GVF integrates geometric dynamical systems, higher-order topology (enforced indirectly via geometric regularisation and Hodge decomposition), and structured multimodal fusion into a single framework for interpretable, modality-resolved risk modelling. This paper develops the mathematical foundations, architectural design, and formal guarantees; empirical validation is the subject of ongoing work.

关键词: Graph Vector Field, multimodal health risk assessment, mixture-of-experts, simplicial complexes, Hodge Laplacian, discrete exterior calculus, wearable sensors, interpretable risk modelling

300. ❌ Lipschitz verification of neural networks through training

作者: Simon Kuang, Yuezhu Xu, S. Sivaranjani, Xinfan Lin 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28113v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究神经网络Lipschitz常数的验证和训练方法，属于神经网络鲁棒性和泛化性的理论分析领域。论文内容聚焦于神经网络结构设计、训练正则化和验证技术，与所有评分关键词（均围绕大模型、深度学习技术原理创新及其在不同领域的应用）完全无关。论文未涉及大模型、语言模型、MoE、缩放定律、预训练、微调、对齐、RLHF、PEFT、RAG、推理加速、幻觉缓解、可解释AI、世界模型、模型合并、上下文学习或科学AI应用等主题。

!!! tip deepseek-chat TL;DR

该论文提出了一种通过训练直接惩罚Lipschitz常数上界的新范式，设计了可验证的神经网络结构，在MNIST上实现了比现有方法更小且更紧的Lipschitz边界。

摘要翻译

神经网络的全局Lipschitz常数同时主导其对抗鲁棒性与泛化能力。传统的“可验证训练”方法通常遵循“先训练后验证”范式：先训练网络，再尝试界定其Lipschitz常数。由于高效的“平凡上界”（即逐层Lipschitz常数的乘积）对任意网络呈指数级松弛，这些方法必须依赖计算代价高昂的技术，如半定规划、混合整数规划或分支定界法。我们提出一种新范式：不再为任意网络设计复杂验证器，而是设计能够通过快速平凡上界直接验证的网络。我们证明，在训练中直接对平凡上界施加惩罚会迫使其变得紧致，从而有效正则化真实的Lipschitz常数。为实现这一目标，我们识别了阻碍平凡上界紧致性的三个结构障碍（死亡神经元、偏置项与病态权重），并引入相应的架构缓解策略，包括一种新颖的范数饱和多重激活函数（norm-saturating polyactivations）以及无偏置正弦层。我们的方法避免了高级验证带来的运行时复杂度，同时取得了显著效果：我们在MNIST数据集上训练出具有小Lipschitz界（较同类工作低数个数量级）且紧致（与真实值误差在10%以内）的鲁棒网络。实验结果验证了理论保证，支持所提出的机制，并通过实证拓展至多种激活函数与非欧几里得范数。

摘要 (Abstract)

The global Lipschitz constant of a neural network governs both adversarial robustness and generalization. Conventional approaches to certified training" typically follow a train-then-verify paradigm: they train a network and then attempt to bound its Lipschitz constant. Because the efficient trivial bound" (the product of the layerwise Lipschitz constants) is exponentially loose for arbitrary networks, these approaches must rely on computationally expensive techniques such as semidefinite programming, mixed-integer programming, or branch-and-bound. We propose a different paradigm: rather than designing complex verifiers for arbitrary networks, we design networks to be verifiable by the fast trivial bound. We show that directly penalizing the trivial bound during training forces it to become tight, thereby effectively regularizing the true Lipschitz constant. To achieve this, we identify three structural obstructions to a tight trivial bound (dead neurons, bias terms, and ill-conditioned weights) and introduce architectural mitigations, including a novel notion of norm-saturating polyactivations and bias-free sinusoidal layers. Our approach avoids the runtime complexity of advanced verification while achieving strong results: we train robust networks on MNIST with Lipschitz bounds that are small (orders of magnitude lower than comparable works) and tight (within 10% of the ground truth). The experimental results validate the theoretical guarantees, support the proposed mechanisms, and extend empirically to diverse activations and non-Euclidean norms.

关键词: Lipschitz constant, neural network verification, certified training, adversarial robustness, trivial bound, norm-saturating polyactivations, bias-free sinusoidal layers, MNIST

301. ❌ Heddle: A Distributed Orchestration System for Agentic RL Rollout

作者: Zili Zhang, Yinmin Zhong, Chengxu Yang, Chao Jin, Bingyang Wu, Xinming Wei, Yuliang Liu, Xin Jin 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28101v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Agentic RL中LLM与外部工具交互的轨迹生成系统优化，与’Large Language Models’高度相关（LLMs是核心组件），与’LLM Agents’和’Tool Use’高度相关（研究Agentic RL中LLM代理使用工具的问题），其他关键词如MoE、SLMs、训练方法、推理技术、科学应用等均未涉及。

!!! tip deepseek-chat TL;DR

论文针对Agentic RL中LLM与工具交互时因长尾轨迹生成导致的系统瓶颈问题，提出了轨迹中心的分布式编排系统Heddle，通过轨迹级调度、轨迹感知放置和轨迹自适应资源管理，实现了高达2.5倍的端到端吞吐量提升。

摘要翻译

代理式强化学习（RL）使大语言模型（LLM）能够通过在数据收集的 rollout 阶段与策略训练阶段之间交替来求解复杂任务。在 rollout 过程中，智能体会生成轨迹，即大语言模型与外部工具之间的多步交互。然而，频繁的工具调用会导致生成长尾轨迹，从而成为 rollout 的瓶颈。这源于以单步为中心的设计忽略了轨迹上下文，进而引发了长尾轨迹生成的三个系统问题：排队延迟、干扰开销以及膨胀的每令牌处理时间。我们提出了 Heddle，一个以轨迹为中心的系统，用于优化代理式 rollout 执行的时机、位置与方式。Heddle 集成了三个核心机制：利用运行时预测和渐进优先级的轨迹级调度，以最小化累积排队；通过预排序动态规划及在空闲工具调用间隙进行机会性迁移的轨迹感知放置，以最小化干扰；以及轨迹自适应资源管理器，它能动态调整模型并行度，在加速长尾轨迹每令牌处理时间的同时，维持短轨迹的高吞吐量。在多种代理式强化学习工作负载上的评估表明，Heddle 有效消除了长尾瓶颈，与最先进的基线相比，实现了高达 2.5 倍的端到端 rollout 吞吐量提升。

摘要 (Abstract)

Agentic Reinforcement Learning (RL) enables LLMs to solve complex tasks by alternating between a data-collection rollout phase and a policy training phase. During rollout, the agent generates trajectories, i.e., multi-step interactions between LLMs and external tools. Yet, frequent tool calls induce long-tailed trajectory generation that bottlenecks rollouts. This stems from step-centric designs that ignore trajectory context, triggering three system problems for long-tail trajectory generation: queueing delays, interference overhead, and inflated per-token time. We propose Heddle, a trajectory-centric system to optimize the when, where, and how of agentic rollout execution. Heddle integrates three core mechanisms: trajectory-level scheduling using runtime prediction and progressive priority to minimize cumulative queueing; trajectory-aware placement via presorted dynamic programming and opportunistic migration during idle tool call intervals to minimize interference; and trajectory-adaptive resource manager that dynamically tunes model parallelism to accelerate the per-token time of long-tail trajectories while maintaining high throughput for short trajectories. Evaluations across diverse agentic RL workloads demonstrate that Heddle effectively neutralizes the long-tail bottleneck, achieving up to 2.5$\times$ higher end-to-end rollout throughput compared to state-of-the-art baselines.

关键词: Agentic Reinforcement Learning, LLMs, trajectory generation, distributed orchestration system, tool calls, rollout throughput, long-tail bottleneck, system optimization

302. ❌ InkDrop: Invisible Backdoor Attacks Against Dataset Condensation

作者: He Yang, Dongyi Lv, Song Ma, Wei Xi, Zhi Wang, Hanlin Gu, Yajie Wang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28092v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究数据集压缩（Dataset Condensation）中的隐形后门攻击（Invisible Backdoor Attacks），属于机器学习安全领域，与所有评分关键词（均聚焦于大模型、深度学习技术原理及其应用）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为InkDrop的隐形后门攻击方法，针对数据集压缩技术，通过利用模型决策边界附近的不确定性，在保持攻击效果和模型效用的同时，增强了恶意操作的不可感知性。

摘要翻译

数据集蒸馏（Dataset Condensation，DC）是一种数据高效的学习范式，它通过合成规模较小但信息丰富的训练数据集，使模型能够达到与使用完整数据训练相当的性能。然而，近期研究揭示了DC对后门攻击存在严重脆弱性：恶意模式（例如触发器）可被植入到蒸馏后的数据集中，导致模型在面对特定输入时产生有目标的错误分类。现有攻击方法往往优先考虑攻击有效性和模型效用，却忽视了隐蔽性这一关键维度。为弥补这一不足，我们提出了InkDrop方法，它在不降低攻击有效性和模型效用的前提下，显著增强了恶意操作的不可感知性。InkDrop利用模型决策边界附近固有的不确定性——此处的微小输入扰动即可引发语义偏移——来构建一种隐蔽且有效的后门攻击。具体而言，InkDrop首先在目标决策边界附近筛选出与目标类别具有潜在语义亲和性的候选样本；随后，在感知一致性与空间一致性的约束下，学习样本依赖的扰动，从而将目标恶意行为嵌入到蒸馏数据集中。跨多个数据集的广泛实验验证了InkDrop的整体有效性，证明了其能够将对抗性意图融入蒸馏数据集，同时保持模型效用并最小化可检测性。我们的代码公开于https://github.com/lvdongyi/InkDrop。

摘要 (Abstract)

Dataset Condensation (DC) is a data-efficient learning paradigm that synthesizes small yet informative datasets, enabling models to match the performance of full-data training. However, recent work exposes a critical vulnerability of DC to backdoor attacks, where malicious patterns (\textit{e.g.}, triggers) are implanted into the condensation dataset, inducing targeted misclassification on specific inputs. Existing attacks always prioritize attack effectiveness and model utility, overlooking the crucial dimension of stealthiness. To bridge this gap, we propose InkDrop, which enhances the imperceptibility of malicious manipulation without degrading attack effectiveness and model utility. InkDrop leverages the inherent uncertainty near model decision boundaries, where minor input perturbations can induce semantic shifts, to construct a stealthy and effective backdoor attack. Specifically, InkDrop first selects candidate samples near the target decision boundary that exhibit latent semantic affinity to the target class. It then learns instance-dependent perturbations constrained by perceptual and spatial consistency, embedding targeted malicious behavior into the condensed dataset. Extensive experiments across diverse datasets validate the overall effectiveness of InkDrop, demonstrating its ability to integrate adversarial intent into condensed datasets while preserving model utility and minimizing detectability. Our code is available at https://github.com/lvdongyi/InkDrop.

关键词: Dataset Condensation, Backdoor Attacks, Stealthiness, Decision Boundaries, Instance-dependent Perturbations, Adversarial Intent, Model Utility, Detectability

303. ❌ Transformer-Based Prognostics: Enhancing Network Availability by Improved Monitoring of Optical Fiber Amplifiers

作者: Dominic Schneider, Lutz Rapp, Christoph Ament 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28081v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文使用轻量级Transformer模型预测光纤放大器寿命，属于AI在工程领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为该关键词涵盖AI在科学和工程领域的应用。但论文未涉及大语言模型（LLMs）、深度学习技术原理创新或关键词列表中的其他具体技术（如MoE、SFT、RAG等），因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该研究通过一个轻量级Transformer模型，基于状态监测数据预测光纤放大器的寿命，从而提升光网络的可用性和可靠性，实现了实时、边缘级的预测性维护。

摘要翻译

我们通过一种轻量级Transformer模型提升光网络的可用性与可靠性，该模型基于状态监测数据预测光纤放大器寿命，从而实现实时、边缘级的预测性维护，并推动可部署人工智能在自主网络运维中的应用。

摘要 (Abstract)

We enhance optical network availability and reliability through a lightweight transformer model that predicts optical fiber amplifier lifetime from condition-based monitoring data, enabling real-time, edge-level predictive maintenance and advancing deployable AI for autonomous network operation.

关键词: Transformer model, optical fiber amplifiers, predictive maintenance, network availability, condition-based monitoring, edge-level AI, real-time prediction, autonomous network operation

304. ❌ Koopman-based surrogate modeling for reinforcement-learning-control of Rayleigh-Benard convection

作者: Tim Plotzki, Sebastian Peitz 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28074v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究强化学习控制流体动力学系统，使用Koopman理论和线性循环自编码网络（LRANs）作为替代模型加速训练。论文主题是强化学习在科学计算（流体力学）中的应用，属于AI for Science范畴，因此与’AI for Science OR Bioinformatics OR Cheminformatics’关键词有一定关联（评分5分）。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新或任何其他评分关键词中的具体技术（如MoE、SFT、RAG等），因此其他所有关键词均评0分。

!!! tip deepseek-chat TL;DR

该论文研究使用Koopman理论和线性循环自编码网络作为替代模型来加速强化学习控制二维Rayleigh-Bénard对流，结果表明策略感知训练能缓解分布偏移，结合替代模型与直接数值模拟的预训练方案可恢复最优性能并减少40%以上训练时间。

摘要翻译

训练强化学习（RL）智能体以控制流体动力学系统在计算上代价高昂，这源于直接数值模拟（DNS）控制方程的高成本。代理模型通过以较低计算成本近似动力学过程，提供了一种有前景的替代方案，但其作为RL训练环境的可行性受到分布偏移的限制，因为策略会导致状态分布超出代理模型训练数据的覆盖范围。在本研究中，我们探讨了使用线性循环自编码器网络（LRANs）来加速基于RL的二维瑞利-贝纳德对流控制。我们评估了两种训练策略：一种是在随机动作生成的预计算数据上训练的代理模型，另一种是利用从演化策略收集的数据进行迭代训练的策略感知代理模型。我们的结果表明，尽管仅使用代理模型训练会导致控制性能下降，但在预训练方案中将代理模型与DNS结合，能够恢复最先进的性能，同时将训练时间减少40%以上。我们证明，策略感知训练能够缓解分布偏移的影响，从而在状态空间中策略相关的区域实现更准确的预测。

摘要 (Abstract)

Training reinforcement learning (RL) agents to control fluid dynamics systems is computationally expensive due to the high cost of direct numerical simulations (DNS) of the governing equations. Surrogate models offer a promising alternative by approximating the dynamics at a fraction of the computational cost, but their feasibility as training environments for RL is limited by distribution shifts, as policies induce state distributions not covered by the surrogate training data. In this work, we investigate the use of Linear Recurrent Autoencoder Networks (LRANs) for accelerating RL-based control of 2D Rayleigh-Bénard convection. We evaluate two training strategies: a surrogate trained on precomputed data generated with random actions, and a policy-aware surrogate trained iteratively using data collected from an evolving policy. Our results show that while surrogate-only training leads to reduced control performance, combining surrogates with DNS in a pretraining scheme recovers state-of-the-art performance while reducing training time by more than 40%. We demonstrate that policy-aware training mitigates the effects of distribution shift, enabling more accurate predictions in policy-relevant regions of the state space.

关键词: reinforcement learning, surrogate modeling, Koopman theory, Rayleigh-Bénard convection, fluid dynamics control, Linear Recurrent Autoencoder Networks, policy-aware training, computational acceleration

305. ❌ SIMR-NO: A Spectrally-Informed Multi-Resolution Neural Operator for Turbulent Flow Super-Resolution

作者: Muhammad Abid, Omer San 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28073v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于湍流超分辨率的神经算子方法，属于科学机器学习领域，与大多数关键词（主要关于大语言模型技术）完全无关。仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为论文涉及科学机器学习在流体动力学中的应用，但未直接涉及生物信息学或化学信息学。

!!! tip deepseek-chat TL;DR

该论文提出了一种光谱信息多分辨率神经算子（SIMR-NO），用于从严重欠分辨观测中重建高分辨率湍流场，在二维湍流测试中显著降低了重建误差并准确再现了能量谱。

摘要翻译

从严重欠解析的观测数据中重建高分辨率湍流场是计算流体力学与科学机器学习中的一个基本反问题。经典插值方法无法恢复缺失的细尺度结构，而现有的深度学习方法依赖于卷积架构，这些架构缺乏在大尺度放大因子下进行物理忠实重建所需的频谱与多尺度归纳偏置。本文提出谱感知多分辨率神经算子（SIMR-NO），这是一种分层算子学习框架，它将不适定反演映射分解到多个中间空间分辨率，在每个阶段结合确定性插值先验与谱门控傅里叶残差修正，并引入局部细化模块以恢复截断傅里叶基之外的细尺度空间特征。所提方法在科莫戈罗夫强迫二维湍流场景中进行了评估，从代表16倍降采样因子的极粗糙8×8观测数据重建128×128涡量场。在201个独立测试样本上，SIMR-NO取得了26.04%的平均相对ℓ₂误差，且在所有方法中误差方差最低，其重建误差相比FNO降低31.7%，相比EDSR降低26.0%，相比LapSRN降低9.3%。除逐点精度外，SIMR-NO是唯一能在全解析波数范围内忠实再现真实能量谱与拟能谱的方法，证明了其对湍流场进行物理一致超分辨率重建的能力。

摘要 (Abstract)

Reconstructing high-resolution turbulent flow fields from severely under-resolved observations is a fundamental inverse problem in computational fluid dynamics and scientific machine learning. Classical interpolation methods fail to recover missing fine-scale structures, while existing deep learning approaches rely on convolutional architectures that lack the spectral and multiscale inductive biases necessary for physically faithful reconstruction at large upscaling factors. We introduce the Spectrally-Informed Multi-Resolution Neural Operator (SIMR-NO), a hierarchical operator learning framework that factorizes the ill-posed inverse mapping across intermediate spatial resolutions, combines deterministic interpolation priors with spectrally gated Fourier residual corrections at each stage, and incorporates local refinement modules to recover fine-scale spatial features beyond the truncated Fourier basis. The proposed method is evaluated on Kolmogorov-forced two-dimensional turbulence, where $128\times128$ vorticity fields are reconstructed from extremely coarse $8\times8$ observations representing a $16\times$ downsampling factor. Across 201 independent test realizations, SIMR-NO achieves a mean relative $\ell_2$ error of $26.04%$ with the lowest error variance among all methods, reducing reconstruction error by $31.7%$ over FNO, $26.0%$ over EDSR, and $9.3%$ over LapSRN. Beyond pointwise accuracy, SIMR-NO is the only method that faithfully reproduces the ground-truth energy and enstrophy spectra across the full resolved wavenumber range, demonstrating physically consistent super-resolution of turbulent flow fields.

关键词: turbulent flow super-resolution, neural operator, spectrally-informed, multi-resolution, inverse problem, scientific machine learning, Fourier residual corrections, energy spectrum

306. ❌ From Vessel Trajectories to Safety-Critical Encounter Scenarios: A Generative AI Framework for Autonomous Ship Digital Testing

作者: Sijin Sun, Liangbin Zhao, Ming Deng, Xiuju Fu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28067v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于使用生成式AI（具体为变分自编码器）处理船舶轨迹数据以生成安全关键遭遇场景，用于自主船舶的数字测试。虽然属于AI应用，但论文内容与所有评分关键词（均围绕大语言模型、深度学习技术原理、AI for Science等特定子领域）无直接关联。论文未涉及LLMs、MoE、SLMs、Scaling Laws、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等任何关键词相关的技术或概念。论文的AI应用（轨迹生成）与AI for Science（生物/化学信息学）领域不同。因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于生成式AI的框架，将大规模船舶轨迹数据转换为结构化安全关键遭遇场景，以支持自主船舶数字测试，实验表明该方法能提高轨迹保真度并生成超出历史记录的多样化场景。

摘要翻译

数字测试已成为自主海上导航系统开发与验证的关键范式，然而真实且多样化的安全关键遭遇场景的可用性仍然有限。现有方法要么依赖缺乏真实性的手工模板，要么直接从历史数据中提取案例，无法系统性地扩展罕见的高风险情境。
本文提出一种数据驱动框架，能够将大规模自动识别系统（Automatic Identification System, AIS）轨迹转化为结构化的安全关键遭遇场景。该框架结合了生成式轨迹建模与自动化遭遇配对及时间参数化技术，在保持真实交通特征的同时实现了可扩展的场景构建。为提升轨迹在噪声AIS观测下的真实性与鲁棒性，本文引入了一种多尺度时间变分自编码器，以捕捉船舶在不同时间分辨率下的运动动态。
基于真实海上交通流的实验表明，所提方法提升了轨迹的保真度与平滑性，保持了与观测数据的统计一致性，并能生成超出直接记录范围的多样化安全关键遭遇场景。该框架为构建场景库提供了实用路径，可支持自主导航与智能海上交通管理系统的数字测试、基准评估及安全验证。代码发布于 https://anonymous.4open.science/r/traj-gen-anonymous-review。

摘要 (Abstract)

Digital testing has emerged as a key paradigm for the development and verification of autonomous maritime navigation systems, yet the availability of realistic and diverse safety-critical encounter scenarios remains limited. Existing approaches either rely on handcrafted templates, which lack realism, or extract cases directly from historical data, which cannot systematically expand rare high-risk situations. This paper proposes a data-driven framework that converts large-scale Automatic Identification System (AIS) trajectories into structured safety-critical encounter scenarios. The framework combines generative trajectory modeling with automated encounter pairing and temporal parameterization to enable scalable scenario construction while preserving real traffic characteristics. To enhance trajectory realism and robustness under noisy AIS observations, a multi-scale temporal variational autoencoder is introduced to capture vessel motion dynamics across different temporal resolutions. Experiments on real-world maritime traffic flows demonstrate that the proposed method improves trajectory fidelity and smoothness, maintains statistical consistency with observed data, and enables the generation of diverse safety-critical encounter scenarios beyond those directly recorded. The resulting framework provides a practical pathway for building scenario libraries to support digital testing, benchmarking, and safety assessment of autonomous navigation and intelligent maritime traffic management systems. Code is available at https://anonymous.4open.science/r/traj-gen-anonymous-review.

关键词: Generative AI, Autonomous Ship, Digital Testing, Vessel Trajectories, Safety-critical Encounter Scenarios, Variational Autoencoder, Maritime Navigation, AIS Data

307. ❌ Physics-Embedded Feature Learning for AI in Medical Imaging

作者: Pulock Das, Al Amin, Kamrul Hasan, Rohan Thompson, Azubike D. Okpalaeze, Liang Hong 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28057v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于医学影像中的深度学习，提出了一种物理嵌入的CNN框架（PhysNet），用于肿瘤分类和物理参数学习。它不涉及大语言模型（LLM）或相关技术（如MoE、SFT、RLHF、RAG等）。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’（评8分），因为它属于AI在科学/生物医学领域的应用，以及’Mechanistic Interpretability OR Explainable AI’（评5分），因为它强调可解释性和物理一致性，但并非核心LLM解释性。其他关键词均无关（评0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种物理嵌入的深度学习框架PhysNet，用于医学影像中的肿瘤分类，通过整合肿瘤生长动力学来提高准确性、可解释性和临床信任度，实验表明其在脑MRI数据集上优于现有基线模型。

摘要翻译

深度学习模型在智能医疗场景中已展现出强大的性能，但现有方法大多作为黑箱运行，忽略了控制肿瘤生长的物理过程，限制了其可解释性、鲁棒性及临床可信度。为应对这一局限，我们提出PhysNet——一种物理嵌入的深度学习框架，该框架将肿瘤生长动力学直接整合到卷积神经网络的特征学习过程中。与仅在输出层施加物理约束的传统物理信息方法不同，PhysNet将肿瘤生长的反应扩散模型嵌入到ResNet主干网络的中间特征表示内。该架构通过端到端训练，在完成多类别肿瘤分类的同时，联合学习潜在的肿瘤密度场、其时间演化过程以及具有生物学意义的物理参数（包括肿瘤扩散速率和生长速率）。这一设计是必要的，因为纯粹数据驱动的模型即使具有高准确性或基于集成方法，也无法保证预测结果符合物理一致性，或提供对肿瘤行为的深入理解。在大型脑部MRI数据集上的实验结果表明，PhysNet在分类准确率和F1分数上均优于多种先进的深度学习基线模型（包括MobileNetV2、VGG16、VGG19及集成模型）。除性能提升外，PhysNet还生成可解释的潜在表示及学习到的生物物理参数，这些参数与既有的医学知识相符，从而凸显了物理嵌入表示学习作为构建更可信、更具临床意义的医疗人工智能系统的一条可行路径。

摘要 (Abstract)

Deep learning (DL) models have achieved strong performance in an intelligence healthcare setting, yet most existing approaches operate as black boxes and ignore the physical processes that govern tumor growth, limiting interpretability, robustness, and clinical trust. To address this limitation, we propose PhysNet, a physics-embedded DL framework that integrates tumor growth dynamics directly into the feature learning process of a convolutional neural network (CNN). Unlike conventional physics-informed methods that impose physical constraints only at the output level, PhysNet embeds a reaction diffusion model of tumor growth within intermediate feature representations of a ResNet backbone. The architecture jointly performs multi-class tumor classification while learning a latent tumor density field, its temporal evolution, and biologically meaningful physical parameters, including tumor diffusion and growth rates, through end-to-end training. This design is necessary because purely data-driven models, even when highly accurate or ensemble-based, cannot guarantee physically consistent predictions or provide insight into tumor behavior. Experimental results on a large brain MRI dataset demonstrate that PhysNet outperforms multiple state-of-the-art DL baselines, including MobileNetV2, VGG16, VGG19, and ensemble models, achieving superior classification accuracy and F1-score. In addition to improved performance, PhysNet produces interpretable latent representations and learned bio-physical parameters that align with established medical knowledge, highlighting physics-embedded representation learning as a practical pathway toward more trustworthy and clinically meaningful medical AI systems.

关键词: physics-embedded deep learning, medical imaging, tumor classification, interpretability, reaction diffusion model, ResNet, brain MRI, bio-physical parameters

308. ❌ Diffusion Maps is not Dimensionality Reduction

作者: Julio Candanedo, Alejandro Patiño 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28037v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究扩散映射（Diffusion Maps）在流形学习中的几何表示特性，与瑞士卷数据集上的等距坐标恢复问题进行比较分析，属于传统的机器学习降维和流形学习领域。所有评分关键词均涉及大模型、深度学习技术原理及其在不同领域的应用创新，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文通过比较扩散映射、Isomap和UMAP在瑞士卷数据集上的表现，证明扩散映射本质上提供的是内在几何的谱表示而非完整的降维映射，正确的坐标图存在于扩散坐标的线性组合中。

摘要翻译

扩散映射（DMAP）常被用作降维工具，但更准确地说，它提供的是内在几何结构的谱表示，而非完整的坐标映射方法。为阐明这一区别，我们以一个已知等距坐标的瑞士卷结构为例，在潜在维度上比较了DMAP、Isomap和UMAP方法。针对每种表示，我们采用预设仿射读取器拟合真实坐标映射，并测量重构误差。Isomap能最有效地恢复低维坐标映射，UMAP提供了折中的表现，而DMAP仅在融合多个扩散模式后才达到精确重构。这表明正确的坐标映射存在于扩散坐标的张成空间中，但标准DMAP方法本身并不能自动识别出合适的组合方式。

摘要 (Abstract)

Diffusion maps (DMAP) are often used as a dimensionality-reduction tool, but more precisely they provide a spectral representation of the intrinsic geometry rather than a complete charting method. To illustrate this distinction, we study a Swiss roll with known isometric coordinates and compare DMAP, Isomap, and UMAP across latent dimensions. For each representation, we fit an oracle affine readout to the ground-truth chart and measure reconstruction error. Isomap most efficiently recovers the low-dimensional chart, UMAP provides an intermediate tradeoff, and DMAP becomes accurate only after combining multiple diffusion modes. Thus the correct chart lies in the span of diffusion coordinates, but standard DMAP do not by themselves identify the appropriate combination.

关键词: Diffusion Maps, dimensionality reduction, spectral representation, Swiss roll, Isomap, UMAP, intrinsic geometry, reconstruction error

309. ❌ Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

作者: Haochuan Kevin Wang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28013v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM代理的提示注入攻击防御机制，直接涉及’Large Language Models’、‘LLM Agents’和’Tool Use’等关键词（评分10分）。‘Multi-agent Systems’因涉及代理间交互（如Claude relay node）得5分。‘Hallucination Mitigation’与安全防御相关得5分。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文通过阶段分解分析，研究了五种前沿LLM代理对提示注入攻击的防御机制，发现模型安全性取决于对抗性内容是否在管道阶段间传播，而非是否被检测到。

摘要翻译

我们对五种前沿大语言模型智能体进行了分阶段分解的提示注入攻击分析。现有研究主要衡量任务级攻击成功率；我们则定位了每个模型防御机制激活的具体流程阶段。我们通过密码学标记（SECRET-[A-F0-9]{8}）对每次运行进行监测，追踪其跨越四个攻击面和五种防御条件（总计764次运行，其中428次为无防御受攻击状态）下四个攻击链阶段——暴露、持久化、中继、执行——的流转情况。我们的核心发现是：模型安全性并不取决于是否接触到对抗性内容，而取决于该内容是否在流程阶段间传播。具体而言：（1）在我们的评估中，所有五种模型的暴露率均为100%——安全差距完全存在于下游阶段；（2）Claude在write_memory摘要阶段清除了注入内容（攻击成功率为0/164），而GPT-4o-mini则完整传播了标记（攻击成功率53%，95%置信区间：41–65%）；（3）DeepSeek在内存攻击面上表现出0%攻击成功率，而在同一模型的工具流攻击面上达到100%攻击成功率——不同注入通道间存在完全逆转现象；（4）所有四种主动防御条件（写入过滤、提示注入检测器、高亮标记及其组合）均因威胁模型表面不匹配导致100%攻击成功率；（5）Claude中继节点能净化下游智能体——40个标记中无一进入共享内存。

摘要 (Abstract)

We present a stage-decomposed analysis of prompt injection attacks against five frontier LLM agents. Prior work measures task-level attack success rate (ASR); we localize the pipeline stage at which each model’s defense activates. We instrument every run with a cryptographic canary token (SECRET-[A-F0-9]{8}) tracked through four kill-chain stages – Exposed, Persisted, Relayed, Executed – across four attack surfaces and five defense conditions (764 total runs, 428 no-defense attacked). Our central finding is that model safety is determined not by whether adversarial content is seen, but by whether it is propagated across pipeline stages. Concretely: (1) in our evaluation, exposure is 100% for all five models – the safety gap is entirely downstream; (2) Claude strips injections at write_memory summarization (0/164 ASR), while GPT-4o-mini propagates canaries without loss (53% ASR, 95% CI: 41–65%); (3) DeepSeek exhibits 0% ASR on memory surfaces and 100% ASR on tool-stream surfaces from the same model – a complete reversal across injection channels; (4) all four active defense conditions (write_filter, pi_detector, spotlighting, and their combination) produce 100% ASR due to threat-model surface mismatch; (5) a Claude relay node decontaminates downstream agents – 0/40 canaries survived into shared memory.

关键词: prompt injection, LLM agents, attack surfaces, safety tiers, stage-level tracking, defense mechanisms, cryptographic canary token, kill-chain stages

310. ❌ FedDES: Graph-Based Dynamic Ensemble Selection for Personalized Federated Learning

作者: Brianna Mueller, W. Nick Street 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28006v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究个性化联邦学习（pFL）框架FedDES，使用图神经网络（GNN）进行动态集成选择，以解决统计异构性导致的负迁移问题。所有关键词均与大模型（LLM）技术、训练方法、推理优化、对齐、代理系统等直接相关，而论文未涉及任何大模型或深度学习技术原理的创新，仅应用GNN于联邦学习场景。唯一相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因论文在ICU医疗数据上实验，属于AI在科学/医疗领域的应用，但非核心创新点，故给5分（有一定关联）。其他关键词完全无关，均给0分。

!!! tip deepseek-chat TL;DR

论文提出FedDES框架，通过图神经网络动态集成选择实现实例级个性化联邦学习，有效解决非独立同分布数据下的负迁移问题，在CIFAR-10和ICU医疗数据上优于现有方法。

摘要翻译

联邦学习（FL）中的统计异质性常导致负迁移，即单一的全局模型难以适应多样化的客户端数据分布。个性化联邦学习（pFL）旨在通过为各客户端定制模型来解决这一问题。然而，现有大多数pFL方法中，客户端均等地整合其他客户端的贡献，忽略了并非所有客户端都能带来同等收益的现实。此外，尽管同一客户端内不同样本对于各对等模型的可靠性往往存在差异，实例层面的个性化潜力仍很大程度上未被探索。
我们提出FedDES（联邦动态集成选择），一种去中心化的pFL框架，通过动态集成选择实现实例级个性化。该方法的核心是一个基于异构图训练而成的图神经网络（GNN）元学习器，该图建模了数据样本与候选分类器之间的交互关系。对于每个测试查询，GNN动态选择并加权对等客户端模型，构建一个由最适配分类器组成的集成，同时有效抑制那些无关或可能损害性能的模型贡献。在CIFAR-10和真实世界ICU医疗数据上的实验表明，FedDES在非独立同分布（non-IID）场景下优于当前先进的pFL基线方法，并为抵抗负迁移提供了鲁棒性保障。

摘要 (Abstract)

Statistical heterogeneity in Federated Learning (FL) often leads to negative transfer, where a single global model fails to serve diverse client distributions. Personalized federated learning (pFL) aims to address this by tailoring models to individual clients. However, under most existing pFL approaches, clients integrate peer client contributions uniformly, which ignores the reality that not all peers are likely to be equally beneficial. Additionally, the potential for personalization at the instance level remains largely unexplored, even though the reliability of different peer models often varies across individual samples within the same client. We introduce FedDES (Federated Dynamic Ensemble Selection), a decentralized pFL framework that achieves instance-level personalization through dynamic ensemble selection. Central to our approach is a Graph Neural Network (GNN) meta-learner trained on a heterogeneous graph modeling interactions between data samples and candidate classifiers. For each test query, the GNN dynamically selects and weights peer client models, forming an ensemble of the most competent classifiers while effectively suppressing contributions from those that are irrelevant or potentially harmful for performance. Experiments on CIFAR-10 and real-world ICU healthcare data demonstrate that FedDES outperforms state-of-the-art pFL baselines in non-IID settings, offering robust protection against negative transfer.

关键词: Personalized Federated Learning, Dynamic Ensemble Selection, Graph Neural Network, Instance-level Personalization, Negative Transfer, Non-IID Data, Healthcare Data, Decentralized Framework

作者: Shaoheng Xu, Chunyi Sun, Jihui Zhang, Amy Bastine, Prasanga N. Samarasinghe, Thushara D. Abhayapala, Hongdong Li 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.27998v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究的是使用Transformer模型进行头相关脉冲响应（HRIR）的空间上采样重建，属于音频信号处理和计算机听觉领域。虽然使用了Transformer架构，但并非大语言模型（LLM）或通用基础模型，而是专门用于3D空间音频重建的特定模型。所有关键词中，只有"AI for Science OR Bioinformatics OR Cheminformatics"有一定关联（5分），因为该研究属于AI在科学计算/音频科学领域的应用。其他关键词均涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等，与本文的特定领域音频重建任务完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为BiFormer3D的时域、无网格双耳Transformer模型，用于从稀疏测量中重建任意方向的头相关脉冲响应（HRIR），在SONICOM数据集上相比先前方法降低了重建误差并证明了最小相位预处理的不必要性。

摘要翻译

个性化头相关脉冲响应（HRIR）能够实现双耳渲染，但针对每位听者进行密集测量成本高昂。本研究致力于从稀疏的个体测量数据中进行HRIR空间上采样：在给定听者少量已测量HRIR的基础上，预测未测量目标方向的HRIR。现有学习方法通常在频域工作，依赖最小相位假设或独立时序模型，并采用固定方向网格，这可能损害时域保真度与空间连续性。我们提出BiFormer3D——一种时域的、无网格的双耳Transformer模型，能够依据稀疏输入重建任意方向的HRIR。该模型采用正弦空间特征编码、一维卷积（Conv1D）细化模块，并辅以耳间时间差（ITD）与耳间声级差（ILD）辅助学习头。在SONICOM数据集上的实验表明，相较于现有方法，本模型在归一化均方误差（NMSE）、余弦距离及ITD/ILD误差指标上均有提升；消融实验验证了各模块的有效性，并证明最小相位预处理并非必要。

摘要 (Abstract)

Individualized head-related impulse responses (HRIRs) enable binaural rendering, but dense per-listener measurements are costly. We address HRIR spatial up-sampling from sparse per-listener measurements: given a few measured HRIRs for a listener, predict HRIRs at unmeasured target directions. Prior learning methods often work in the frequency domain, rely on minimum-phase assumptions or separate timing models, and use a fixed direction grid, which can degrade temporal fidelity and spatial continuity. We propose BiFormer3D, a time-domain, grid-free binaural Transformer for reconstructing HRIRs at arbitrary directions from sparse inputs. It uses sinusoidal spatial features, a Conv1D refinement module, and auxiliary interaural time difference (ITD) and interaural level difference (ILD) heads. On SONICOM, it improves normalized mean squared error (NMSE), cosine distance, and ITD/ILD errors over prior methods; ablations validate modules and show minimum-phase pre-processing is unnecessary.

关键词: head-related impulse responses, HRIR reconstruction, Transformer, spatial up-sampling, binaural rendering, time-domain, grid-free, interaural time difference

312. ❌ FedFG: Privacy-Preserving and Robust Federated Learning via Flow-Matching Generation

作者: Ruiyang Wang, Rong Pan, Zhengan Yao 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.27986v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究联邦学习（FL）中的隐私保护和鲁棒性问题，提出了一种基于流匹配生成的FedFG框架。论文内容专注于联邦学习、隐私保护、对抗攻击防御等传统机器学习安全领域，未涉及大模型、深度学习技术原理创新或大模型在不同领域的应用。所有关键词均与大模型技术、训练方法、推理优化、对齐、代理系统等主题相关，而本文研究的是联邦学习框架，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文针对联邦学习中隐私保护不足和对抗攻击脆弱性的问题，提出了一种基于流匹配生成的FedFG框架，在MNIST、FMNIST和CIFAR-10数据集上实现了更高的准确性和更强的隐私保护。

摘要翻译

联邦学习（Federated Learning, FL）使得分布式客户端能够利用本地私有数据协同训练全局模型。然而，近期研究表明，传统联邦学习算法在隐私保护方面仍存在不足，且服务器缺乏可靠稳定的聚合规则来更新全局模型。这种状况为攻击者创造了机会：一方面，他们可能窃听上传的梯度或模型参数，潜在地泄露良性客户端的私有数据；另一方面，他们可能通过攻陷客户端发起投毒攻击，破坏全局模型。为平衡准确性与安全性，我们提出FedFG——一种基于流匹配生成（flow-matching generation）的鲁棒联邦学习框架，该框架在保护客户端隐私的同时能够抵御复杂的投毒攻击。在客户端侧，每个本地网络被解耦为私有特征提取器和公共分类器。每个客户端还配备了一个流匹配生成器，在与服务器交互时代替特征提取器，从而在保护私有特征的同时学习底层数据分布的近似表示。为配合客户端设计，服务器采用了客户端更新验证方案以及一种由流匹配生成器产生的合成样本驱动的新型鲁棒聚合机制。在MNIST、FMNIST和CIFAR-10数据集上的实验表明，与现有工作相比，我们的方法能适应多种攻击策略，在保持强大隐私保护能力的同时实现了更高的准确率。

摘要 (Abstract)

Federated learning (FL) enables distributed clients to collaboratively train a global model using local private data. Nevertheless, recent studies show that conventional FL algorithms still exhibit deficiencies in privacy protection, and the server lacks a reliable and stable aggregation rule for updating the global model. This situation creates opportunities for adversaries: on the one hand, they may eavesdrop on uploaded gradients or model parameters, potentially leaking benign clients’ private data; on the other hand, they may compromise clients to launch poisoning attacks that corrupt the global model. To balance accuracy and security, we propose FedFG, a robust FL framework based on flow-matching generation that simultaneously preserves client privacy and resists sophisticated poisoning attacks. On the client side, each local network is decoupled into a private feature extractor and a public classifier. Each client is further equipped with a flow-matching generator that replaces the extractor when interacting with the server, thereby protecting private features while learning an approximation of the underlying data distribution. Complementing the client-side design, the server employs a client-update verification scheme and a novel robust aggregation mechanism driven by synthetic samples produced by the flow-matching generator. Experiments on MNIST, FMNIST, and CIFAR-10 demonstrate that, compared with prior work, our approach adapts to multiple attack strategies and achieves higher accuracy while maintaining strong privacy protection.

关键词: Federated Learning, Privacy Protection, Robust Aggregation, Poisoning Attacks, Flow-matching Generation, Client Privacy, Model Security, Distributed Learning

313. ❌ Principal Prototype Analysis on Manifold for Interpretable Reinforcement Learning

作者: Bodla Krishna Vamshi, Haizhao Yang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.27971v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于强化学习（RL）的可解释性方法，提出了一种自动选择原型的方法来增强RL模型的可解释性。论文与大多数关键词无关，因为这些关键词主要涉及大语言模型（LLMs）及其相关技术（如训练、对齐、推理、部署等），而论文的核心是RL的可解释性，并未涉及LLMs。唯一相关的关键词是’Mechanistic Interpretability OR Explainable AI’，因为论文直接研究可解释AI方法在RL中的应用，评分为10分（高度相关，核心内容）。其他关键词评分为0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文提出了一种自动选择原型的可解释性方法，用于增强强化学习模型的可解释性，在标准Gym环境中匹配了现有原型包装网络的性能，同时保持了与原始黑盒模型的竞争力。

摘要翻译

近年来，强化学习（RL）得到了广泛应用，从解决实时游戏问题到利用人类偏好数据微调大语言模型，显著提升了模型与用户期望的对齐程度。然而，随着模型复杂度呈指数级增长，这些系统的可解释性变得日益困难。尽管针对计算机视觉和自然语言处理领域已开发出多种可解释性方法，用于阐明局部和全局推理模式，但这些方法在强化学习中的应用仍十分有限。直接扩展这些方法往往难以在强化学习环境中保持可解释性与性能之间的微妙平衡。原型封装网络（Prototype-Wrapper Networks, PW-Nets）近期展现出弥合这一差距的潜力，它能在不牺牲原始黑盒模型效率的前提下，增强强化学习领域的可解释性。然而，这些方法通常需要手动定义参考原型，这往往依赖于专家的领域知识。在本研究中，我们提出一种方法，通过从可用数据中自动选择最优原型，消除了这一依赖。在标准Gym环境中的初步实验表明，我们的方法在性能上与现有PW-Nets相当，同时与原始黑盒模型相比仍具竞争力。

摘要 (Abstract)

Recent years have witnessed the widespread adoption of reinforcement learning (RL), from solving real-time games to fine-tuning large language models using human preference data significantly improving alignment with user expectations. However, as model complexity grows exponentially, the interpretability of these systems becomes increasingly challenging. While numerous explainability methods have been developed for computer vision and natural language processing to elucidate both local and global reasoning patterns, their application to RL remains limited. Direct extensions of these methods often struggle to maintain the delicate balance between interpretability and performance within RL settings. Prototype-Wrapper Networks (PW-Nets) have recently shown promise in bridging this gap by enhancing explainability in RL domains without sacrificing the efficiency of the original black-box models. However, these methods typically require manually defined reference prototypes, which often necessitate expert domain knowledge. In this work, we propose a method that removes this dependency by automatically selecting optimal prototypes from the available data. Preliminary experiments on standard Gym environments demonstrate that our approach matches the performance of existing PW-Nets, while remaining competitive with the original black-box models.

关键词: reinforcement learning, interpretability, prototype selection, explainable AI, black-box models, Gym environments, performance, automated methods

314. ❌ Cardiovascular-Kidney-Metabolic Health: Insights from Wearables and Blood Biomarkers

作者: Zeinab Esmaeilpour, A. Ali Heydari, Daniel McDuff, Anthony Z Faranesh, Conor Heneghan, Shwetak Patel, Mark Malhotra, Cathy Speed, Javier L. Prieto, Ahmed A. Metwally 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27787v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training	0.0	0.0/10	0.0
Post-training	0.0	0.0/10	0.0
Instruction Tuning	0.0	0.0/10	0.0
RLHF	0.0	0.0/10	0.0
PEFT	0.0	0.0/10	0.0
Retrieval-Augmented Generation	0.0	0.0/10	0.0
Context Window Extension	0.0	0.0/10	0.0
KV Cache Compression	0.0	0.0/10	0.0
Chain of Thought	0.0	0.0/10	0.0
System 2 Thinking	0.0	0.0/10	0.0
Monte Carlo Tree Search AND LLM	0.0	0.0/10	0.0
Self-Correction	0.0	0.0/10	0.0
LLM Agents	0.0	0.0/10	0.0
Tool Use	0.0	0.0/10	0.0
Multi-agent Systems	0.0	0.0/10	0.0
Quantization	0.0	0.0/10	0.0
Speculative Decoding	0.0	0.0/10	0.0
Hallucination Mitigation	0.0	0.0/10	0.0
Mechanistic Interpretability	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging	0.0	0.0/10	0.0
In-context Learning	0.0	0.0/10	0.0
AI for Science	0.0	5.0/10	0.0

评分理由: 该论文研究心血管-肾脏-代谢综合征（CKM）的早期检测，通过整合可穿戴设备数据和临床生物标志物来表征患病率和系统间相互作用。论文内容完全聚焦于医学健康研究，未涉及任何大模型、深度学习技术原理或AI技术应用。所有技术相关关键词（如LLMs、MoE、训练方法、推理技术、模型优化等）均与论文内容无关，得0分。唯一可能相关的关键词是’AI for Science’，但论文中未明确使用AI方法，仅使用了统计分析和特征提取，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究通过整合可穿戴设备数据和临床生物标志物，揭示了心血管-肾脏-代谢综合征（CKM）的亚临床异质性，发现心血管偏差是最常见的单一表型，并确定步数、活动区域分钟数和静息心率是预测心血管和代谢衰退的最有效可穿戴指标。

摘要翻译

心血管-肾脏-代谢（Cardiovascular-Kidney-Metabolic, CKM）综合征日益成为一项公共卫生危机，但其各组成系统的亚临床异质性仍未得到充分探索。早期检测生理性偏离对于预防不可逆的器官损伤和死亡至关重要。本研究通过整合连续可穿戴设备数据与临床生物标志物，在一个美国队列（N=841）中描述了CKM功能损害的患病率及其相互关联。我们通过临床生物标志物评估心血管（总胆固醇/高密度脂蛋白胆固醇比值，Chol/HDL）和肾脏（估算肾小球滤过率，eGFR）功能，并通过胰岛素抵抗稳态模型评估（Homeostatic Model Assessment of Insulin Resistance, HOMA-IR）评估代谢健康风险。研究发现，尽管代谢与心血管功能紊乱显著相关（r=0.26, p<0.001），但早期肾脏损害却独立显现。通过使用标准化偏离评分，我们在29.0%的队列中识别出显著的健康损害。心血管偏离是最常见的单一表型（13.3%），其次是代谢偏离（9.1%）和肾脏偏离（6.25%），而代谢-心血管双重损害仅出现在2.2%的参与者中。这些发现表明，高度的系统特异性偏离可能作为相应器官系统内生理衰老加速的指标。此外，特征消融分析显示，步数、有效运动时长和静息心率是可穿戴设备数据中预测心血管与代谢功能下降的最有效指标。这些结果强调了一种多系统亚型分型方法的必要性，并证明可穿戴设备衍生的表型有助于实现对CKM综合征复杂局面的早期、精准干预。

摘要 (Abstract)

Cardiovascular-Kidney-Metabolic (CKM) syndrome represents a growing public health crisis, yet the subclinical heterogeneity of its component systems remains underexplored. Early detection of physiological deviation is critical for preventing irreversible organ damage and mortality. Here, we characterize the prevalence and interplay of CKM impairment in a US cohort (N=841) by integrating continuous wearable data with clinical biomarkers. We assessed cardiovascular, kidney via clinical biomarkers, namely Chol/HDL, eGFR, as well as metabolic health risk through Homeostatic Model Assessment of Insulin Resistance (HOMA-IR). We show that while metabolic and cardiovascular disruptions are significantly associated (r=0.26, p<0.001), early-stage kidney impairment manifests independently. Utilizing a normalized deviance score, we identified significant health impairments in 29.0% of the cohort. Cardiovascular deviation was the most prevalent singular phenotype (13.3%), followed by metabolic (9.1%) and renal (6.25%) deviations, with dual metabolic-cardiovascular impairment occurring in only 2.2% of participants. These findings suggest that high system-specific deviance may serve as an indicator for accelerated physiological aging within the respective organ system. Furthermore, feature ablation analysis revealed that step count, Active Zone Minutes, and resting heart rate are the most potent wearable-derived predictors of cardiovascular and metabolic decline. These findings underscore the necessity of a multi-system subtyping approach, demonstrating that wearable-derived phenotypes can facilitate the early, targeted interventions required to manage the complex landscape of CKM syndrome.

关键词: Cardiovascular-Kidney-Metabolic syndrome, wearable data, clinical biomarkers, subclinical heterogeneity, physiological deviation, early detection, multi-system subtyping, health impairment prediction

315. ❌ Quantitative mapping of dynamic 3D transport in growing cells via volumetric spatio-temporal image correlation spectroscopy (vSTICS)

作者: Ahmad Mahmood, Paul W. Wiseman 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27484v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于生物成像和细胞运输分析，属于生物物理和显微镜技术领域。论文内容与绝大多数关键词（涉及大模型、深度学习、AI技术原理等）完全无关，因此评分为0。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及科学应用（细胞生物学），但并非AI驱动的研究，仅属于传统计算分析，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文开发了一种名为vSTICS的体积极时空图像相关光谱方法，用于从三维荧光时间序列中定量映射活细胞内的流动、扩散和粒子密度，并应用于山茶花花粉管中的线粒体运输，揭示了不对称的反向喷泉模式。

摘要翻译

在拥挤的活细胞中对三维流动、扩散及粒子密度进行定量成像仍具挑战性，因为多数动态光学显微镜测量实质上是二维的，且现有分析方法难以处理密集、含噪声的三维体数据。本文提出体素时空图像相关光谱法（volumetric spatio-temporal image correlation spectroscopy, vSTICS），该框架可从三维荧光时间序列中重建体素分辨的流动、扩散系数及粒子密度分布。通过场合成晶格光片显微镜对生长中的山茶花粉管进行成像，并对重叠的三维样本进行局部时空相关分析，从而生成速度、扩散与密度分布图。基于合成流动-扩散模拟的验证表明，该方法能准确还原预设的输运参数，包括接近$3$ $μ$m s$^{-1}$的速度与接近$10^{-3}$ $μ$m$^2$ s$^{-1}$的扩散系数。荧光微球实验验证了粒子数与点扩散函数的读出结果，测得凝胶中扩散系数为$0.3 \pm 0.1$ $μ$m$^2$ s$^{-1}$，与成像荧光相关光谱法测得的$0.5 \pm 0.2$ $μ$m$^2$ s$^{-1}$结果一致。将vSTICS应用于花粉管线粒体研究，解析出双向反向喷泉式运动模式：较慢的顺向运输（$0.1$-$1$ $μ$m s$^{-1}$）与较快的逆向运动（峰值约$3$ $μ$m s$^{-1}$），并发现约$2$ $μ$m宽的逆向运输通道。密度与扩散分布图显示存在更密集、更多平流运动的核心区域与更高的外围扩散率。针对高密度亚衍射囊泡的成像获得了相似的速度分布特征，其粒子密度约高出十倍。这些结果证实vSTICS是一种实用的细胞内三维输运定量成像方法，并通过揭示非对称且以横向为主的循环运动，完善了反向喷泉模型。

摘要 (Abstract)

Quantitatively mapping three-dimensional (3D) flow, diffusion, and particle density in crowded living cells remains challenging because most dynamic optical microscopy measurements are effectively planar and existing analysis methods struggle with dense, noisy volumetric data. We introduce volumetric spatio-temporal image correlation spectroscopy (vSTICS), a framework that recovers voxel-resolved flow, diffusion coefficients, and particle densities from 3D fluorescence time series. Growing Camellia japonica pollen tubes were imaged with field-synthesis lattice light-sheet microscopy, and localized 3D spatio-temporal correlation analysis was applied to overlapping volumetric samples to generate maps of velocity, diffusion, and density. Validation with synthetic flow-diffusion simulations showed accurate recovery of seeded transport parameters, including velocities near $3$ $μ$m s$^{-1}$ and diffusion near $10^{-3}$ $μ$m$^2$ s$^{-1}$. Fluorescent microsphere experiments verified particle number and point spread function readouts and measured diffusion coefficients of $0.3 \pm 0.1$ $μ$m$^2$ s$^{-1}$ in gel, consistent with imaging-FCS measurements of $0.5 \pm 0.2$ $μ$m$^2$ s$^{-1}$. Applied to mitochondria in pollen tubes, vSTICS resolved a bidirectional reverse-fountain pattern with slower anterograde transport ($0.1$-$1$ $μ$m s$^{-1}$) and faster retrograde motion peaking near $3$ $μ$m s$^{-1}$, plus a retrograde corridor about $2$ $μ$m wide. Density and diffusion maps indicated a denser, more advective core and higher peripheral diffusion. High-density sub-diffraction vesicle mapping produced similar velocity landscapes with about ten-fold higher particle densities. These results establish vSTICS as a practical method for quantitative 3D mapping of intracellular transport and refines the reverse-fountain model by revealing asymmetric, predominantly transverse circulation.

关键词: volumetric spatio-temporal image correlation spectroscopy, 3D intracellular transport, fluorescence microscopy, pollen tubes, mitochondria, flow-diffusion mapping, reverse-fountain model, voxel-resolved analysis

316. ❌ Autonomous Agent-Orchestrated Digital Twins (AADT): Leveraging the OpenClaw Framework for State Synchronization in Rare Genetic Disorders

作者: Hongzhuo Chen, Zhanliang Wang, Quan M. Nguyen, Gongbo Zhang, Chunhua Weng, Kai Wang 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27104v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究医疗数字孪生（MDT）的实时同步问题，提出了一个基于OpenClaw框架的自主代理编排数字孪生（AADT）系统。该系统利用代理（agents）的主动“心跳”机制和模块化技能，持续监控数据流并执行自动化工作流。论文与大多数关键词无关，因为这些关键词主要涉及大模型的技术细节（如训练方法、优化技术、推理加速等），而本文聚焦于代理系统在医疗领域的应用架构。仅与两个关键词相关：1. ‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（评分10分）：论文核心是自主代理（autonomous agents）编排的工作流，用于同步数字孪生状态，属于代理系统的应用。2. ‘AI for Science OR Bioinformatics OR Cheminformatics’（评分10分）：论文应用于罕见遗传疾病的医疗数字孪生，涉及生物信息学（如基因组数据、表型跟踪）和AI在科学（医疗）领域的应用。其他关键词均未在论文中提及或暗示，故评分为0。

!!! tip deepseek-chat TL;DR

该论文解决了医疗数字孪生在罕见遗传疾病中因数据更新滞后导致的同步问题，通过提出一个自主代理编排的数字孪生框架，实现了患者模型与动态临床和基因组数据的实时同步，从而支持更早诊断和更准确的疾病进展建模。

摘要翻译

背景：医学数字孪生（Medical Digital Twins, MDTs）是对个体患者的计算表征，其整合了临床、基因组和生理学数据，以支持诊断、治疗规划和预后预测。然而，大多数MDT系统保持静态或仅被动更新，这造成了关键的同步鸿沟，尤其在罕见遗传病领域，其表型、基因组解读和诊疗指南会随时间不断演变。
方法：我们提出一种采用智能体编排的数字孪生框架，该框架利用OpenClaw主动式“心跳”机制与模块化智能体技能。这一自主智能体编排数字孪生（Autonomous Agent-orchestrated Digital Twin, AADT）系统持续监测本地及外部数据流（例如患者报告的表型数据和变异分类数据库的更新），并执行自动化工作流，实现数据摄取、标准化、状态更新以及基于触发的分析。
结果：原型系统实施表明，智能体编排能够持续地将MDT状态与纵向表型更新及不断演进的基因组知识进行同步。在罕见病场景中，这有助于实现更早的诊断和更精确的疾病进展建模。我们展示了两个案例研究，包括变异重解读和纵向表型追踪，以突显AADT如何为科研和临床诊疗提供及时、可审计的更新支持。
结论：AADT框架解决了MDT中实时同步这一关键瓶颈，实现了可扩展且持续更新的患者模型。我们还通过人在回路的系统设计，探讨了数据安全考量及相应的缓解策略。

摘要 (Abstract)

Background: Medical Digital Twins (MDTs) are computational representations of individual patients that integrate clinical, genomic, and physiological data to support diagnosis, treatment planning, and outcome prediction. However, most MDTs remain static or passively updated, creating a critical synchronization gap, especially in rare genetic disorders where phenotypes, genomic interpretations, and care guidelines evolve over time. Methods: We propose an agent-orchestrated digital twin framework using OpenClaw’s proactive “heartbeat” mechanism and modular Agent Skills. This Autonomous Agent-orchestrated Digital Twin (AADT) system continuously monitors local and external data streams (e.g., patient-reported phenotypes and updates in variant classification databases) and executes automated workflows for data ingestion, normalization, state updates, and trigger-based analysis. Results: A prototype implementation demonstrates that agent orchestration can continuously synchronize MDT states with both longitudinal phenotype updates and evolving genomic knowledge. In rare disease settings, this enables earlier diagnosis and more accurate modeling of disease progression. We present two case studies, including variant reinterpretation and longitudinal phenotype tracking, highlighting how AADTs support timely, auditable updates for both research and clinical care. Conclusion: The AADT framework addresses the key bottleneck of real-time synchronization in MDTs, enabling scalable and continuously updated patient models. We also discuss data security considerations and mitigation strategies through human-in-the-loop system design.

关键词: Autonomous Agents, Digital Twins, Medical Digital Twins, Rare Genetic Disorders, State Synchronization, OpenClaw Framework, Agent Orchestration, Genomic Data Integration

317. ❌ Path Integral Methods in Atomistic Modelling: An Introduction

作者: Michele Ceriotti, David E. Manolopoulos, Thomas E. Markland, Mariana Rossi 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28588v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Path Integral Methods in Atomistic Modelling: An Introduction》是一本关于路径积分方法及其在原子尺度建模中应用的教科书，内容聚焦于物理化学和计算物理学的传统模拟技术，如路径积分分子动力学和量子统计力学。所有评分关键词均与大语言模型、深度学习、人工智能技术及其应用相关，而该论文完全不涉及这些领域，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文介绍了路径积分方法的基本理论和最新模拟技术，用于原子尺度过程的建模，为计算物理学和化学领域的研究人员提供了自包含的学习资源。

摘要翻译

本书系统介绍了路径积分方法及其在原子尺度过程建模中的应用。全书涵盖基础理论与近年发展的模拟技术，内容自成体系，最初是为CECAM（欧洲原子与分子计算中心）举办的路径积分方法讲习班编纂的教材。

摘要 (Abstract)

This book provides an introduction to path integral methods and their application to modeling atomistic processes. The book covers both the foundational theory and recently developed simulation techniques. The text provides a self-contained resource and was originally developed for the CECAM schools on Path Integral Methods.

关键词: Path Integral Methods, Atomistic Modelling, Simulation Techniques, Quantum Statistical Mechanics, Molecular Dynamics, Computational Physics, CECAM Schools

318. ❌ Hunting for quantum advantage in electronic structure calculations is a highly non-trivial task

作者: Örs Legeza, Andor Menczer, Miklós Antal Werner, Sotiris S. Xantheas, Frank Neese, Martin Ganahl, Cole Brower, Samuel Rodriguez Bernabeu, Jeff Hammond, John Gunnels 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28648v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于量子化学中的电子结构计算，使用DMRG方法在经典硬件（GPU）上提供基准数据，以评估量子计算的优势。论文内容与绝大多数关键词（涉及大模型、深度学习技术、训练方法、推理优化、智能体等）完全无关，因为这些关键词属于人工智能和机器学习领域，而本文属于计算化学和量子计算领域。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文涉及科学计算（量子化学），属于AI在科学领域的潜在应用背景，但论文本身并未使用AI或机器学习方法，而是传统的数值模拟方法，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究通过使用混合精度自旋适应从头算密度矩阵重整化群（DMRG）方法在NVIDIA Blackwell GPU平台上进行电子结构计算，为Fe4S4和Fe5S12H4^5-分子系统提供了高精度基态能量基准数据，旨在为评估量子计算优势提供经典参考。

摘要翻译

鉴于过去几十年量子计算与经典硬件模拟领域的重大进展，如何确定一个有望展现量子优势的实际问题已成为严峻挑战。在量子化学领域，强关联体系（即多参考问题）的电子结构计算常被认为属于此类，因为基于平均场理论的标准方法难以处理这类问题。因此，在比较这些竞争性发展方向时，必须通过经典算法提供最先进的基准数据才能得出决定性结论。我们报告了在CAS(54,36)模型空间中对Fe$_4$S$_4$分子簇的尖端性能计算结果与高精度基态能量——该体系近期已被列入IBM与RIKEN共同维护的“量子优势追踪器”网页系统名录。为进一步突破极限，我们还针对Fe$5$S${12}$H$_4^{5-}$分子体系实现了基于CAS-SCF的轨道优化，其六重态基态包含25个开壳层轨道，并构建了高达89个电子与102个轨道的CAS(89,102)活性空间，以及包含331个电子与451个轨道的超大活性空间。这些成果通过混合精度自旋适配的从头算密度矩阵重正化群（Density Matrix Renormalization Group, DMRG）电子结构计算实现，该计算与ORCA程序包对接，并依托NVIDIA Blackwell图形处理器（GPU）平台完成。我们认为在报告量子优势时，应将DMRG基准数据作为经典参照标准。此外，应充分考虑对经典硬件的深度开发利用，因为即便最先进的DMRG实现在利用GPU技术优势方面仍处于初级阶段。

摘要 (Abstract)

In light of major developments over the past decades in both quantum computing and simulations on classical hardware, it is a serious challenge to identify a real-world problem where quantum advantage is expected to appear. In quantum chemistry, electronic structure calculations of strongly correlated, i.e. multi-reference problems, are often argued to fall into such category because of their intractability with standard methods based on mean-field theory. Therefore, providing state-of-the-art benchmark data by classical algorithms is necessary to make a decisive conclusion when such competing development directions are compared. We report cutting-edge performance results together with high accuracy ground state energy for the Fe$_4$S$_4$ molecular cluster on a CAS(54,36) model space, a problem that has been included quite recently among the list of systems in the {\it Quantum Advantage Tracker} webpage maintained by IBM and RIKEN. Pushing the limits even further, we also present CAS-SCF based orbital optimizations for unprecedented CAS sizes of up to 89 electrons in 102 orbitals [CAS(89,102)] for the Fe$5$S${12}$H$_4^{5-}$ molecular system comprising twenty five open shell orbitals in its sextet ground state and an active spaces size of 331 electrons in 451 orbitals. We have achieved our results via mixed-precision spin-adapted \textit{ab initio} Density Matrix Renormalization Group (DMRG) electronic structure calculations interfaced with the ORCA program package and utilizing the NVIDIA Blackwell graphics processing unit (GPU) platform. We argue that DMRG benchmark data should be taken as a classical reference when quantum advantage is reported. In addition, full exploitation of classical hardware should also be considered since even the most advanced DMRG implementations are still in a premature stage regarding utilization of all the benefits of GPU technology.

关键词: quantum advantage, electronic structure calculations, Density Matrix Renormalization Group (DMRG), Fe4S4 molecular cluster, Fe5S12H4^5-, CAS-SCF, GPU acceleration, benchmark data

319. ❌ A reduced-cost two-component relativistic equation-of-motion coupled cluster method for the double electron attachment problem

作者: Sujan Mandal, Tamoghna Mukhopadhyay, Achintya Kumar Dutta 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28441v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文是关于计算化学中相对论性方程-运动耦合簇方法的理论研究，专注于双电子附着问题的计算效率改进。论文内容完全属于计算化学和量子化学领域，与所有大模型、深度学习、AI技术相关的关键词均无直接关联。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于科学计算领域，但论文并未使用AI或机器学习方法，而是传统的量子化学计算方法，因此仅给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种计算成本降低的相对论性方程-运动耦合簇方法，用于解决双电子附着问题，通过引入状态特定的冻结自然旋量基和Cholesky分解，显著减少了重元素和大基组计算的内存需求和计算成本。

摘要翻译

本文提出了一种针对双电子附着问题的相对论性运动方程耦合簇方法计算高效表述。在本工作中，我们采用了原子平均场近似下的精确双分量哈密顿量，所得结果与对应的四分量计算高度吻合。然而，对于重元素和大基组而言，由于涉及复杂的三粒子-空穴激发流形需要巨大的内存开销，传统的DEA-EOM-CCSD计算变得极其昂贵。为克服这一限制，我们引入了一种态特异性冻结自然旋量基，通过两个可控的截断阈值显著压缩了虚拟空间。此外，采用Cholesky分解处理双电子积分进一步降低了计算成本和内存需求。通过对第12族和第14族重元素的双电离势和激发能计算，验证了所提方法的性能。本文同时给出了重硫族元素二聚体的垂直激发能，并对第13族卤化物的一系列双原子光谱常数进行了评估。

摘要 (Abstract)

We present a computationally efficient relativistic formulation of the equation-of-motion coupled-cluster method for the double electron attachment problem. In this work, the exact two-component Hamiltonian within the atomic mean-field approximation is employed, yielding results that are in close agreement with the corresponding four-component calculations. However, canonical DEA-EOM-CCSD calculations become prohibitively expensive for heavy elements and large basis sets due to the substantial memory requirements associated with complex 3p1h excitation manifold. To address this limitation, we introduce a state-specific frozen natural spinor basis that significantly reduces the virtual space through two controllable truncation thresholds. Furthermore, the use of Cholesky decomposition for the two-electron integrals provides an additional reduction in computational cost and memory. The performance of the proposed approach is demonstrated through calculations of double ionization potentials and excitation energies for group-12 and group-14 heavy elements. Vertical excitation energies for heavy chalcogen dimers are also presented. In addition, a range of diatomic spectroscopic constants is evaluated for group-13 halides.

关键词: relativistic equation-of-motion coupled-cluster, double electron attachment, computational efficiency, frozen natural spinor basis, Cholesky decomposition, heavy elements, double ionization potentials, excitation energies

320. ❌ Hybrid QPE-Ansatz Strategy for Reliable Excited-State Variational Quantum Deflation

作者: Young Kyun Ahn, Young Min Rhee 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.27978v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于量子计算领域，具体研究用于激发态计算的变分量子算法（sfVQD），涉及量子相位估计、自旋对称性保持和NISQ设备兼容性。所有关键词均与大语言模型、深度学习技术原理或其在科学领域的应用直接相关，而本文研究的是量子计算算法，属于完全不同的技术领域。唯一可能的相关点是“AI for Science”，因为量子计算可被视为科学计算的一种方法，但论文并未使用AI或机器学习技术，而是纯粹的量子算法设计，因此给予5分（有一定关联）。其他所有关键词与论文内容完全无关，均评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合自旋对称性保持ansatz和浅层量子相位估计的spin-filtering变分量子deflation（sfVQD）方案，用于NISQ时代的激发态计算，并在LiH和BeH2分子上展示了比传统VQD更好的单重态-三重态分离效果。

摘要翻译

我们提出了一种自旋$z$分量（$S_{z}$）守恒的对称性保持拟设，以及一种针对自旋$x$分量（$S_x$）的浅层量子相位估计（QPE）流程，并将二者结合为一种适用于噪声中等规模量子（NISQ）计算时代的激发态计算方案——自旋过滤变分量子紧缩（sfVQD）。该方案通过受控旋转$\mathrm{exp} (iθ\hat{S}_{x})$将自旋信息编码至一个小的辅助寄存器中，仅需适度的电路开销。随后，利用编码信息通过筛选抑制自旋污染，避免了代价高昂的总自旋$\langle\hat{S}^{2}\rangle$显式计算。由于筛选模块独立于变分拟设运行，它也可与其他基于变分量子本征求解器的激发态计算方案结合使用。作为演示，我们将sfVQD应用于不同几何构型的LiH和BeH$_2$分子，结果显示相较于未采用QPE衍生筛选的传统VQD，单重态与三重态能级分离得到显著改善。这些结果表明，辅助比特辅助的对称性筛选为保障具有物理意义的激发态性质计算提供了一条模块化且兼容NISQ的路径。我们进一步讨论了本方案如何自然扩展至其他守恒量的计算。

摘要 (Abstract)

We introduce a spin $z$-component ($S_{z}$) conserving symmetry-preserving ansatz and a shallow quantum phase estimation (QPE) routine of spin $x$ ($S_x$), and combine them into a spin-filtering variational quantum deflation (sfVQD) scheme for noisy intermediate-scale quantum (NISQ) computing era excited state calculations. The scheme encodes the spin information into a small ancilla register through controlled rotations under $\mathrm{exp} (iθ\hat{S}_{x})$ with only modest circuit overhead. The encoded information is then utilized to suppress spin contamination by screening, avoiding costly explicit evaluation on the total spin $\langle\hat{S}^{2}\rangle$. Because the screening module operates independently of the variational ansatz, it can also be employed with other excited-state calculation schemes based on variational quantum eigensolvers. As a demonstration, we apply sfVQD to LiH and BeH$_2$ with varying geometries to show markedly improved separation of singlet and triplet manifolds over conventional VQD without QPE-derived screening. These results suggest that ancilla-assisted symmetry screening provides a modular and NISQ-compatible route to securing excited state calculations of physically meaningful properties. We discuss how our scheme may naturally be extended to computing other conserved quantities.

关键词: variational quantum deflation, excited-state calculation, quantum phase estimation, spin symmetry, NISQ computing, ansatz design, molecular systems, ancilla-assisted screening

321. ❌ Enhancing Spin Coherence of Optically-Addressed Molecular Qubit by Nuclear Spin Hyperpolarization

作者: Boning Li, Patrick Hautle, Duhan Zhang, Liangping Zhu, Paola Cappellaro, Tom Wenckebach, Yifan Quan 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27872v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究分子量子比特的自旋相干性增强，属于实验物理和量子信息科学领域，与绝大多数关键词（涉及大模型、深度学习、AI技术原理）完全无关。仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为该研究属于科学应用（量子技术），但论文未使用AI方法，故给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过核自旋超极化技术抑制分子量子比特中核自旋浴引起的退相干，实验结果表明质子自旋浴极化率达到60%时，自旋回波衰减时间提升了25%。

摘要翻译

光学可寻址分子三重态自旋为量子应用提供了一个化学可调平台，但其相干性常受限于与周围自旋浴的相互作用。本文展示了在高纯度萘单晶中共晶并五苯的光激发三重态自旋中，对核浴诱导退相干的可控抑制。通过三重态动态核极化技术对质子自旋浴进行超极化，核自旋产生的磁噪声被有效压制，从而延长了电子自旋的横向相干时间。实验上，我们观察到在质子自旋浴达到60%极化率时，自旋回波衰减时间提升了25%。测得的自旋回波衰减时间（$T_2$）随核极化的变化规律，在定量上符合基于极化调控核二阶矩推导的理论预测。相干时间的提升幅度与绝对值均通过团簇关联展开（CCE）模拟得到了定量复现。这些结果确立了核自旋超极化作为一种通用且可主动调控的方法，可用于调控分子量子比特的相干性。本工作为高相干分子与固态自旋系统提供了一个广泛适用的设计框架。

摘要 (Abstract)

Optically addressable molecular triplet spins provide a chemically tunable platform for quantum application, but their coherence is often limited by interactions with surrounding spin baths. Here we demonstrate controlled suppression of nuclear-bath-induced decoherence in photoexcited triplet spins of pentacene co-crystallized in high-purity naphthalene single crystals. By hyperpolarizing the proton spin bath through triplet dynamic nuclear polarization (triplet-DNP), magnetic noise generated by the nuclear spins is suppressed, leading to an extension of the electron spin transverse coherence time. Experimentally, we observe a 25% enhancement of the spin-echo decay time with $60%$ polarization of the proton spin bath. The measured scaling of the spin-echo decay time ($T_2$) with nuclear polarization quantitatively follows the predicted dependence derived from the polarization-controlled nuclear second moment. Both the enhancement and the absolute value of the coherence time are quantitatively reproduced by cluster correlation expansion (CCE) simulations. These results establish nuclear spin hyperpolarization as a general and actively tunable approach to engineering coherence in molecular qubits. This work provides a broadly applicable design framework for high-coherence molecular and solid-state spin systems.

关键词: molecular qubit, spin coherence, nuclear spin hyperpolarization, triplet-DNP, decoherence suppression, quantum application, spin-echo decay time, cluster correlation expansion

322. ❌ Understanding the Density Maximum of Water with Machine Learned Potentials

作者: Yizhi Song, Renxi Liu, Chunyi Zhang, Yifan Li, Biswajit Santra, Mohan Chen, Michael L. Klein, Xifan Wu 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27767v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文研究水密度异常的分子机制，使用机器学习势能进行分子动力学模拟，属于AI在科学领域的应用。所有关键词中，只有’AI for Science OR Bioinformatics OR Cheminformatics’与论文内容相关，因为论文属于AI for Science范畴，但并非生物信息学或化学信息学的具体应用。其他关键词均涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等，与论文的分子模拟研究完全无关。

!!! tip deepseek-chat TL;DR

该研究使用机器学习势能进行分子动力学模拟，揭示了水在4°C出现密度最大值是由于短程保持理想四面体配位而中程结构坍塌的微妙机制，而非传统的有序-无序混合解释。

摘要翻译

在常压融化后，水的密度随温度升高持续增加，直至约4°C时达到最大值。近一个世纪以来，这一现象在定性上被归因于有序与无序结构的混合。本文中，我们利用深度神经网络，基于先进密度泛函理论（density functional theory）的电子结构数据，训练了一种机器学习（machine learned, ML）原子间势函数。值得注意的是，采用该ML势函数进行的分子动力学模拟，同时重现了实验观测到的水密度异常现象与热膨胀系数。对计算所得氢键网络的详细结构分析表明，密度异常源于一种新兴的液体结构：该结构在短程范围内保持近乎理想的四面体配位，但在中程范围内发生塌缩。我们的研究指出，导致密度最大值的机制比传统图像更为微妙，强调了不同长度尺度下结构有序性的协同作用。

摘要 (Abstract)

After melting, at ambient pressure, the density of water continues to increase with temperature until it reaches a maximum around 4 °C. For nearly a century, this phenomenon has been qualitatively attributed to a mixture of ordered and disordered structures. Herein, we employ a deep neural network to train a machine learned (ML) interatomic potential for water using electronic structure data from advanced density functional theory. Notably, molecular dynamics simulations with the ML potential reproduce both the experimental water density anomaly and the thermal expansion coefficient. Detailed structural analysis of the computed hydrogen-bond network reveals that the density anomaly arises from an emergent liquid structure that retains nearly ideal tetrahedral coordination at short range but collapses at intermediate range. Our findings point to a more delicate mechanism causing the density maximum than the conventional picture, emphasizing the collective roles of structural orderings at different length scales.

关键词: water density anomaly, machine learned potential, molecular dynamics simulation, hydrogen-bond network, tetrahedral coordination, density functional theory, thermal expansion coefficient, structural analysis

323. ❌ The effects of ionic valency and size asymmetry on counterion adsorption

作者: Or Ben Yaakov, Haim Diamant, Rudolf Podgornik, David Andelman 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27444v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多价离子溶液在带电表面附近的平衡性质，属于物理化学领域，主要涉及离子吸附、尺寸不对称效应和泊松-玻尔兹曼方程等经典理论。所有评分关键词均与大模型、深度学习、AI技术或AI for Science直接相关，而本文完全不涉及任何人工智能、机器学习或计算模型技术，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究探讨了离子价态和尺寸不对称对带电表面附近多价离子溶液平衡性质的影响，发现高表面电荷和大离子尺寸会导致近表面浓度饱和，并可能形成按离子价态-尺寸比排序的分层结构。

摘要翻译

本研究探讨了溶剂与离子尺寸不对称性对带电表面附近多价离子溶液平衡性质的影响。针对溶液中单一离子组分，我们推导出带电表面处的广义格拉哈姆方程。通过分析离子与溶剂间的一般尺寸比例，我们获得了离子浓度分布随表面距离变化的解析结果。在弱表面电荷及较小离子-溶剂尺寸比条件下，浓度分布遵循稀溶液条件下的经典泊松-玻尔兹曼方程。然而，在高表面电荷与大离子尺寸条件下，近表面区域的浓度分布趋于饱和，导致溶液性质对表面电荷密度和尺寸不对称性产生独特的依赖关系。此外，稀溶液与饱和状态之间的转变取决于表面电荷和离子尺寸不对称性。我们提出，在饱和状态下，含有不同价态与尺寸的多种离子组分的溶液会在近表面区域发生分层现象，从而形成按离子价态-尺寸比排序的层状结构。

摘要 (Abstract)

We study the effect of asymmetry in solvent and ionic size on the equilibrium properties of multivalent ionic solutions near a charged surface. For a single ionic species in solution, we derive a generalized Grahame equation at the charged surface. For general size ratio between the ions and the solvent, we obtain analytical results for the concentration profiles as a function of the distance from the surface. For weak surface charge and small ion-to-solvent size ratio, the profile follows the classical Poisson-Boltzmann equation in dilute solution conditions. However, for high surface charge and large ionic size, the concentration profile saturates near the surface, leading to distinctive dependencies of the solution properties on the surface charge density and size asymmetry. Furthermore, the crossover between dilute and saturated regimes depends on the surface charge and ionic size asymmetry. We suggest that a solution containing multiple ionic species of different valencies and sizes stratifies close to the surface in the saturation regime. This leads to the formation of layers that are ordered according to the ions’ valency-to-size ratio.

关键词: ionic solutions, charged surface, size asymmetry, counterion adsorption, Poisson-Boltzmann equation, multivalent ions, concentration profiles, surface charge density

324. ❌ Temperature dependence of the dynamic structure factor of the electron liquid via analytic continuation

作者: Thomas Chuna, Maximilian P. Böhme, Tobias Dornheim 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27212v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究电子液体的动态结构因子，使用路径积分蒙特卡洛方法和解析延拓技术，属于凝聚态物理和计算物理领域。所有评分关键词均涉及大模型、深度学习及相关技术，而该论文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文通过路径积分蒙特卡洛数据和解析延拓方法研究了均匀电子液体在不同温度下的动态结构因子，为X射线汤姆逊散射实验和密度泛函理论提供了改进基础。

摘要翻译

我们基于宽温度范围内虚时密度-密度关联函数$F(\mathbf{q},τ)$的准精确从头算路径积分蒙特卡洛（PIMC）数据，提出了关于均匀电子液体动态结构因子$S(\mathbf{q},ω)$的新解析延拓结果。为此，我们同时采用了传统的最大熵方法求解器，以及近期在\texttt{PyLIT}软件包[Benedix Robles等人，《计算机物理通讯》\textbf{319}, 109904 (2026)]中实现的预优化稀疏高斯核表示方法，并识别了两种方法各自的潜在优势与局限。我们预计本研究结果将在多个领域具有广泛意义，包括对极端物质状态下X射线汤姆逊散射实验的阐释，以及为线性响应含时密度泛函理论构建更精确的交换-关联核。

摘要 (Abstract)

We present new analytic continuation results for the dynamic structure factor $S(\mathbf{q},ω)$ of the uniform electron liquid based on quasi-exact \emph{ab initio} path integral Monte Carlo (PIMC) data for the imaginary-time density–density correlation function $F(\mathbf{q},τ)$ across a broad range of temperatures. For this purpose, we employ both a traditional maximum entropy method solver, and a pre-optimized sparse Gaussian kernel representation as it has been implemented in the recent \texttt{PyLIT} package [Benedix Robles \textit{et al.}, \textit{Comp.~Phys.~~Comm.}~~\textbf{319}, 109904 (2026)], and we identify potential advantages and disadvantages in both. We expect our results to be interesting for a broad range of topics, including the interpretation of x-ray Thomson scattering experiments with extreme states of matter and the construction of improved exchange–correlation kernels for linear-response time-dependent density functional theory.

关键词: dynamic structure factor, electron liquid, analytic continuation, path integral Monte Carlo, density-density correlation function, temperature dependence, X-ray Thomson scattering, density functional theory

325. ❌ ADEPT-PolyGraphMT: Automated Molecular Simulation and Multi-Task Multi-Fidelity Machine Learning for Polymer Property Generation and Prediction

作者: Sobin Alosious, Yuhan Liu, Jiaxin Xu, Gang Liu, Renzheng Zhang, Meng Jiang, Tengfei Luo 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27106v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于聚合物信息学，结合分子动力学模拟、密度泛函理论计算和多任务多保真度机器学习（使用图神经网络）来预测聚合物性质。论文内容与大多数关键词（涉及大语言模型、训练技术、推理优化、智能体等）完全无关，因为这些关键词主要针对自然语言处理领域的大模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学（具体是材料科学/化学信息学）领域的应用，与生物信息学或化学信息学有概念上的重叠，因此给予10分（高度相关，核心内容）。

!!! tip deepseek-chat TL;DR

该研究解决了聚合物设计空间巨大且高质量多性质数据有限的问题，通过整合自动化分子模拟与多任务多保真度机器学习框架，构建了包含约62,000个聚合物性质值的数据集，并证明了多任务模型在数据有限时优于单任务模型，实现了大规模聚合物性质的预测和筛选。

摘要翻译

具有目标性能聚合物的发现面临着两大挑战：庞大的化学设计空间以及跨多种性能一致、高质量数据的有限可获得性。本研究提出了一种集成聚合物信息学框架，该框架将自动化聚合物模拟分子动力学引擎（ADEPT）工作流与多任务、多保真度机器学习（PolyGraphMT）相结合。聚合物重复单元被表示为分子图，并通过图神经网络进行处理，以学习结构-性能关系。从单体的SMILES表示出发，ADEPT自动化构建原子模型，并利用分子动力学模拟和密度泛函理论计算评估其性能。模拟数据与精选的实验数据以及基团贡献理论估算值相结合，构建了一个包含约62,000个聚合物性能值的统一数据集，涵盖28种性能。利用该数据集，分析了性能间的相关性，并评估了用于联合性能预测的多任务学习策略。结果表明，在数据丰富的条件下，多任务模型取得了与单任务模型相当的性能；而在训练数据有限时，则表现出更优的准确性。此外，当结合实验和计算数据源时，保真度感知训练提高了预测准确性。训练后的模型进一步应用于PolyInfo数据库和PI1M虚拟聚合物库中聚合物的大规模性能预测，在广阔的化学空间内产生了物理一致性的性能分布。总体而言，所提出的框架为跨多种性能类型和数据保真度水平的聚合物性能的可扩展预测与筛选提供了一种结构化方法。

摘要 (Abstract)

The discovery of polymers with targeted properties is challenged by the vast chemical design space and the limited availability of consistent, high-quality data across multiple properties. In this work, an integrated polymer informatics framework is presented that combines the Automated molecular Dynamics Engine for Polymer simulaTions (ADEPT) workflow with multi-task and multi-fidelity machine learning (PolyGraphMT). Polymer repeat units are represented as molecular graphs and processed using a graph neural network to learn structure-property relationships. Starting from SMILES representations for monomers, ADEPT automates the construction of atomistic models and the evaluation of their properties using molecular dynamics simulations and density functional theory calculations. The simulation data are combined with curated experimental data and group contribution theory estimates to construct a unified dataset of approximately 62,000 polymer property values spanning 28 properties. Using this dataset, inter-property correlations are analyzed, and multi-task learning strategies are evaluated for joint property prediction. The results show that multi-task models achieve performance comparable to single-task models in data-rich regimes and exhibit superior accuracy as training data become limited. In addition, fidelity-aware training improves predictive accuracy when combining experimental and computational data sources. The trained models are further applied to large-scale property prediction for polymers in the PolyInfo database and the PI1M virtual polymer library, producing physically consistent property distributions across a broad chemical space. Overall, the proposed framework provides a structured approach for scalable prediction and screening of polymer properties across multiple property types and data fidelity levels.

关键词: polymer informatics, molecular dynamics simulations, multi-task learning, multi-fidelity machine learning, graph neural network, property prediction, PolyInfo database, PI1M virtual polymer library

326. ❌ A theoretical and experimental assessment of adiabatic losses in force-gradient-detected magnetic resonance of nitroxide spin labels

作者: Michael C. Boucher, Peter Sun, Eric W. Moore, John A. Marohn 期刊/来源: arxiv 发布日期: 2026-03-28 arXiv链接: http://arxiv.org/abs/2603.27087v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究磁共振力显微镜（MRFM）中电子自旋信号的绝热损失理论建模与实验验证，属于物理、化学仪器领域的实验物理研究。论文内容涉及Landau-Zener-Stückelberg-Majorana（LZSM）跃迁理论、自旋弛豫、磁共振成像等，与所有评分关键词（均围绕大模型、深度学习、AI技术及其应用）完全无关，无任何重叠或间接关联。

!!! tip deepseek-chat TL;DR

该论文通过改进的LZSM跃迁理论，研究了磁共振力显微镜中电子自旋信号的绝热损失机制，推导了频率偏移方程并验证了实验数据，最终提出了一种消除寄生信号的新实验方案。

摘要翻译

我们近期提出了一种关于朗道-齐纳-施特克尔贝格-马约拉纳（Landau–Zener–Stückelberg–Majorana，LZSM）跃迁的新理论描述，该描述同时考虑了共振扫描过程中的绝热损耗与自旋退相干损耗。本文中，我们运用这一新描述来评估在电子自旋的磁共振力显微镜实验中，由悬臂梁针尖运动导致的信号损失。我们推导了自旋诱导的悬臂梁频移方程，该方程考虑了在悬臂梁同步的辐照与弛豫周期内存在的随时间变化的磁化强度。研究表明，频移可通过力耦合机制或力梯度耦合机制产生，具体取决于微波辐照的周期性与时序；当自旋-晶格弛豫时间短于悬臂梁振荡周期时，频移会减小。通过将所得方程与数值积分含时布洛赫方程计算出的磁化强度进行对比，验证了方程的正确性。我们将结合新方程的数值模拟结果与作为磁场、针尖-样品间距、微波功率及微波时序函数所采集的频移电子自旋信号进行了比较。模拟结果在几乎无自由参数的情况下定量描述了观测到的信号。最后，基于我们提出的新频移方程，我们设计了一种新的实验性自旋激发方案，该方案可在磁共振力显微镜实验中消除由微波直接激发悬臂梁所产生的伪信号。

摘要 (Abstract)

We recently introduced a new theoretical description of Landau–Zener–Stückelberg–Majorana (LZSM) transitions that accounts for both adiabatic and spin-dephasing losses during sweeps through resonance. Here, we use this new description to assess signal loss due to cantilever tip motion in magnetic resonance force microscopy experiments on electron spins. We derive equations for spin-induced cantilever frequency shifts that account for the time-dependent magnetization present during cantilever-synchronized periods of irradiation and relaxation. We show that a frequency shift can be created by either a force- or force-gradient coupling mechanism, depending on the periodicity and timing of the microwave irradiation; the frequency shift decreases when the spin-lattice relaxation time becomes shorter than the cantilever oscillation period. Equations were validated by comparing with the magnetization computed by numerically integrating the time-dependent Bloch equations. Numerical simulations incorporating the new equations were compared to frequency-shift electron-spin signals collected as a function of magnetic field, tip-sample separation, microwave power, and microwave timing. The simulations quantitatively describe the observed signals with essentially no free parameters. Finally, motivated by our new frequency-shift equations, we present a new experimental spin-excitation protocol that eliminates spurious signals arising from direct microwave excitation of the cantilever in a magnetic resonance force microscope experiment.

关键词: magnetic resonance force microscopy, Landau-Zener-Stückelberg-Majorana transitions, adiabatic losses, electron spins, frequency shift, spin-lattice relaxation, cantilever tip motion, numerical simulations

Token 消耗统计

总计: 956,406 tokens（输入 634,902 / 输出 321,504）

模型	输入	输出	合计
deepseek-chat	581,472	321,504	902,976
glm-4.7	53,430	0	53,430