📊 ArXiv 研究报告 (2026-03-26)

生成时间: 2026-03-26 09:45:47 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 304 篇
及格论文: 7 篇 (2.3%)
深度分析: 7 篇

⭐ 及格论文详细分析

1. SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

作者: Haoyu Huang, Jinfa Huang, Zhongwei Wan, Xiawu Zheng, Rongrong Ji, Jiebo Luo 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23483v1

评分: 65.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	10.0/10	10.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出SpecEyes框架，通过推测性感知和规划加速多模态大语言模型（MLLMs）的代理工作流。核心创新在于使用轻量级、无工具的小模型作为推测规划器，预测执行轨迹，实现早期终止昂贵的工具链，并通过认知门控机制进行自我验证。该研究直接涉及LLMs、SLMs、LLM Agents、Tool Use和Speculative Decoding/Inference Acceleration，这些是论文的核心技术。Chain of Thought、System 2 Thinking和Self-Correction/Improvement/Reflection与论文中提到的推理、规划和自我验证相关，但非核心焦点。其他关键词如MoE、Scaling Laws、Pre-training、Alignment、RAG、Quantization等未在摘要中提及或与论文主题无关。

!!! tip deepseek-chat TL;DR

该论文解决了代理式多模态大语言模型（MLLMs）中由于感知、推理和工具调用循环导致的顺序开销和延迟问题，提出了SpecEyes框架，通过轻量级小模型的推测性规划和认知门控机制，在保持或提高准确性的同时实现了1.1-3.35倍的加速，并提升了系统吞吐量。

摘要翻译

具备代理能力的多模态大语言模型（Agentic multimodal large language models, MLLMs）（例如OpenAI o3和Gemini Agentic Vision）通过迭代式视觉工具调用实现了卓越的推理能力。然而，这种级联的感知、推理与工具调用循环引入了显著的顺序开销。这种被称为代理深度（agentic depth）的开销会导致难以接受的延迟，并严重限制系统级并发性。为此，我们提出了SpecEyes——一个代理级推测加速框架，旨在打破这一顺序瓶颈。我们的核心洞见是，一个轻量级、无需工具调用的MLLM可以作为推测规划器来预测执行轨迹，从而在不牺牲准确性的前提下提前终止昂贵的工具链。为了规范这种推测规划，我们引入了一种基于答案可分离性（answer separability）的认知门控机制，该机制可在无需真实标签的情况下量化模型的自验证置信度。此外，我们设计了一种异构并行漏斗结构，利用小模型的无状态并发性来掩盖大模型有状态的串行执行，从而最大化系统吞吐量。在V* Bench、HR-Bench和POPE上的大量实验表明，SpecEyes相比代理基线实现了1.1-3.35倍的加速，同时保持甚至提升了准确率（最高提升+6.7%），从而显著提升了并发工作负载下的服务吞吐量。

摘要 (Abstract)

Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model’s confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.

关键词: Agentic Multimodal LLMs, Speculative Acceleration, Tool-free MLLM, Cognitive Gating, Heterogeneous Parallel Funnel, System Throughput, Visual Tool Invocation, Agentic Depth

深度分析:

SpecEyes：通过推测感知与规划加速智能体多模态大语言模型

摘要:

针对智能体多模态大语言模型在迭代视觉工具调用中存在的顺序开销和延迟问题，本文提出了SpecEyes框架。该框架利用轻量级无工具MLLM作为推测规划器，预测执行轨迹以提前终止昂贵的工具链。同时，引入基于答案可分性的认知门控机制，无需人工标签即可量化模型置信度进行自我验证。此外，设计了异构并行漏斗架构，利用小模型的无状态并发掩盖大模型的有状态串行执行。实验表明，SpecEyes在保持或提升准确率（最高+6.7%）的同时，实现了1.1-3.35倍的加速，显著提升了系统吞吐量。

创新点:

提出了SpecEyes框架，这是首个针对智能体多模态大模型的推测加速框架，旨在解决顺序推理带来的agentic depth瓶颈。
引入了基于答案可分性的认知门控机制，实现了无需Oracle标签的自我验证和置信度量化，有效调节推测过程。
设计了异构并行漏斗架构，创新性地利用小模型的无状态并发性来掩盖大模型的有状态串行执行，最大化系统吞吐量。
证明了轻量级无工具模型可以作为推测规划器，在不牺牲准确性的情况下加速复杂的工具调用流程，甚至能通过早期筛选提升准确率。

方法

!!! info

论文主要采用了系统架构设计与实验评估相结合的方法。首先，分析Agentic MLLM中感知、推理和工具调用循环的顺序瓶颈；其次，构建SpecEyes框架，包含推测规划器（使用轻量级模型）、认知门控（基于答案可分性）和异构并行漏斗三个核心组件；最后，在V* Bench、HR-Bench和POPE等标准数据集上进行广泛实验，对比基线模型在推理速度、准确率和并发吞吐量方面的表现。

关键结果:

在V* Bench、HR-Bench和POPE数据集上，相比agentic基线实现了1.1倍至3.35倍的推理加速。
在加速的同时，模型准确率保持不变甚至有所提升（最高提升6.7%）。
在并发工作负载下，显著提升了系统的服务吞吐量，证明了异构并行架构的有效性。

技术栈: 智能体多模态大语言模型, 推测执行, 异构并行计算, 认知门控机制, 答案可分性度量, 视觉工具调用

优点

创新性强，首次将推测执行技术应用于智能体级别的多模态模型推理，解决了实际的延迟痛点。
提出的认知门控机制不依赖外部标签，具有很好的通用性和自适应性。
实现了“双赢”，不仅降低了延迟，还通过早期过滤错误路径提升了准确率。
系统架构设计巧妙，通过异构并行最大化了硬件资源利用率。

局限

推测规划器的性能依赖于轻量级模型的能力，如果小模型无法准确预测轨迹，加速效果可能受限甚至产生负面影响。
异构并行系统的部署和调度可能较为复杂，增加了工程实现的难度。
论文主要关注视觉工具调用场景，对于其他类型工具（如代码解释器、数据库查询）的泛化能力未充分探讨。

与研究方向的相关性:

该论文高度相关。它属于“大模型和深度学习技术原理的创新”范畴，专注于优化多模态大模型的推理效率和系统架构。虽然未直接涉及科学领域的具体应用，但其提出的加速技术可以广泛应用于任何依赖Agentic MLLM的科学计算或分析场景，具有很高的技术创新价值。

2. Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents

作者: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23013v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究记忆增强的AI代理框架，使用8B小模型通过检索对话记忆来回答重复查询，显著提升效率。高度相关的关键词包括：LLMs（使用Qwen3-8B/235B模型）、SLMs（8B小模型是核心）、RAG（使用BM25+余弦相似度检索）、LLM Agents（研究持久AI代理）。中等相关的关键词：Hallucination Mitigation（提到自信幻觉问题）、In-context Learning（利用检索的上下文）。其他关键词如MoE、Scaling Laws、RLHF等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种记忆增强推理框架，通过让轻量级8B参数模型利用检索到的对话上下文来回答重复查询，在无需额外训练的情况下，以96%的成本降低恢复了235B模型69%的性能，证明了对于用户特定查询，相关知识访问比模型规模更重要。

摘要翻译

生产级人工智能代理频繁接收高度重复的用户特定查询，其中高达47%的查询在语义上与历史交互相似，但每次查询通常仍需消耗相同的计算成本。我们认为，这种冗余可通过对话记忆加以利用，从而将重复处理从成本负担转化为效率优势。我们提出一种记忆增强推理框架，其中轻量级的80亿参数模型通过检索到的对话上下文，以低成本推理路径回答所有查询。该方法无需任何额外训练或标注数据，即可达到30.5%的F1分数，恢复了2350亿参数全上下文模型69%的性能，同时将有效成本降低96%。值得注意的是，无记忆机制的2350亿参数模型（13.7% F1）表现甚至低于独立的80亿参数模型（15.4% F1），这表明对于用户特定查询，获取相关知识比模型规模更为重要。我们进一步分析了路由机制与置信度的作用。在实际置信度阈值下，仅路由机制即可将96%的查询导向小模型，但因自信幻觉导致准确率较低（13.0% F1）。记忆机制并未显著改变路由决策，而是通过基于检索到的用户特定信息生成回答来提升正确性。随着对话记忆随时间累积，重复话题的覆盖率增加，进一步缩小了性能差距。我们在152个LoCoMo问题（基于Qwen3-8B/235B）和500个LongMemEval问题上进行评估。引入混合检索（BM25 + 余弦相似度）使性能额外提升7.7 F1，证明检索质量直接增强端到端系统性能。总体而言，我们的研究结果表明，在持久性AI代理中，记忆机制而非模型规模，是驱动准确性与效率提升的核心因素。

摘要 (Abstract)

Production AI agents frequently receive user-specific queries that are highly repetitive, with up to 47% being semantically similar to prior interactions, yet each query is typically processed with the same computational cost. We argue that this redundancy can be exploited through conversational memory, transforming repetition from a cost burden into an efficiency advantage. We propose a memory-augmented inference framework in which a lightweight 8B-parameter model leverages retrieved conversational context to answer all queries via a low-cost inference path. Without any additional training or labeled data, this approach achieves 30.5% F1, recovering 69% of the performance of a full-context 235B model while reducing effective cost by 96%. Notably, a 235B model without memory (13.7% F1) underperforms even the standalone 8B model (15.4% F1), indicating that for user-specific queries, access to relevant knowledge outweighs model scale. We further analyze the role of routing and confidence. At practical confidence thresholds, routing alone already directs 96% of queries to the small model, but yields poor accuracy (13.0% F1) due to confident hallucinations. Memory does not substantially alter routing decisions; instead, it improves correctness by grounding responses in retrieved user-specific information. As conversational memory accumulates over time, coverage of recurring topics increases, further narrowing the performance gap. We evaluate on 152 LoCoMo questions (Qwen3-8B/235B) and 500 LongMemEval questions. Incorporating hybrid retrieval (BM25 + cosine similarity) improves performance by an additional +7.7 F1, demonstrating that retrieval quality directly enhances end-to-end system performance. Overall, our results highlight that memory, rather than model size, is the primary driver of accuracy and efficiency in persistent AI agents.

关键词: Memory Augmented Inference, AI Agents, Small Language Models, Retrieval-Augmented Generation, Conversational Memory, Routing, Efficiency, Persistent AI

深度分析:

知识获取胜过模型规模：面向持久化智能体的记忆增强路由

摘要:

针对生产环境中AI智能体面临的重复性用户查询问题，该论文提出了一种记忆增强推理框架。研究发现，高达47%的查询与过往交互语义相似，但传统处理方式忽略了这种冗余。作者通过结合轻量级8B模型、检索对话记忆和基于置信度的路由机制，实现了高效推理。实验表明，该方法在无需额外训练的情况下，恢复了全上下文235B模型69%的性能，同时降低了96%的有效成本。核心发现是，对于特定用户查询，获取相关知识比模型规模更重要；记忆主要提高回答正确性，而非改变路由决策。

创新点:

揭示了记忆与路由的交互机制：发现记忆的作用是使路由“值得”（保证质量），而非仅仅“可能”（降低成本），解决了小模型在无记忆时“自信但错误”的问题。
提出了跨模型记忆注入策略：允许小模型利用大模型生成的对话记忆，无需重新训练即可实现知识迁移，将过去的计算成本摊销到未来查询中。
进行了2x2析因实验：量化了记忆和路由在用户特定查询中的独立及组合效应，填补了两者交互研究的空白。
集成了混合检索机制：结合BM25和余弦相似度，显著提升了检索质量和端到端系统性能。

方法

!!! info

论文构建了一个位于客户端和推理后端之间的路由层。首先，采用跨模型记忆注入，将对话问答对（Turn-pairs）而非摘要存储在向量数据库中，并使用Matryoshka表示模型生成嵌入。其次，在查询时，利用混合检索（BM25 + 密集向量余弦相似度）获取相关上下文。然后，基于模型输出的对数概率计算置信度分数，设定阈值决定是否使用小模型或升级到大模型。最后，通过在LoCoMo和LongMemEval数据集上进行2x2析因实验（有无记忆 x 有无路由），评估F1分数和成本效益。

关键结果:

结合记忆和路由的8B模型在LoCoMo上达到30.5% F1，比无辅助的8B模型提升15.0 F1，恢复了全上下文235B模型69%的性能。
相比全上下文235B模型，该方法有效成本降低了96%。
无记忆的235B模型（13.7% F1）表现差于无记忆的8B模型（15.4% F1），证明知识获取比模型规模更重要。
在实用置信度阈值下，96%的查询被路由到小模型，但无记忆时准确率极低（13.0% F1），记忆将自信的错误转化为自信的正确。
混合检索比纯余弦检索提升了7.7 F1，证明了检索质量对系统性能的直接提升作用。

技术栈: Qwen3-8B / Qwen3-235B (模型), BM25 (稀疏检索算法), Cosine Similarity (密集检索算法), Matryoshka Embeddings (表示模型), Log-probabilities (置信度估计), Vector Database (向量数据库)

优点

实用性强：无需额外训练或标注数据，易于部署到生产环境。
成本效益显著：大幅降低推理成本的同时保持了较高的性能。
洞察深刻：通过析因实验清晰揭示了记忆和路由之间的非直观交互关系。
鲁棒性：使用原始问答对而非摘要，避免了摘要过程中的幻觉风险。

局限

推理能力局限：记忆对事实回忆有效，但对时间推理等复杂任务可能产生负面影响（-3.8 F1）。
冷启动问题：首次查询仍需依赖大模型，系统性能随记忆积累逐渐提升。
检索依赖：系统性能严重依赖于检索质量，如果检索到不相关上下文，可能引入噪声。
评估范围：主要在特定数据集上评估，可能无法完全代表所有生产环境的复杂性。

与研究方向的相关性:

该论文高度相关。它属于“大模型技术原理的创新”领域，特别是关于推理优化、模型路由和检索增强生成（RAG）。它解决了生产级AI智能体（持久化智能体）的关键挑战，这与“大模型在不同领域的研究应用”相一致。创新点在于系统架构和交互机制的发现，而非单纯的模型微调，符合“创新型强”的标准。

3. Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs

作者: Haoming Meng, Kexin Huang, Shaohang Wei, Chiyu Ma, Shuo Yang, Xue Wang, Guoyin Wang, Bolin Ding, Jingren Zhou 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22446v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	5.0/10	5.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究RLVR（Reinforcement Learning with Verifiable Rewards）对LLMs的微调机制，属于大模型技术原理创新。高度相关关键词：“Large Language Models”（论文明确研究LLMs）、“RLHF”（RLVR是强化学习微调方法，与RLHF/RLAIF/DPO高度相关）、“Mechanistic Interpretability”（论文进行token-level机制分析，属于可解释AI）。中等相关关键词：“Mixture of Experts”（论文提到"sparse"变化，但非MoE架构）、“Post-training”（涉及fine-tuning）、“Chain of Thought"和"System 2 Thinking”（论文研究reasoning性能改进）。其余关键词论文未涉及。

!!! tip deepseek-chat TL;DR

该论文研究了RLVR（强化学习与可验证奖励）微调如何通过稀疏的token级分布变化来提升大语言模型的推理能力，并揭示了这些变化的功能重要性。

摘要翻译

具有可验证奖励的强化学习（RLVR）显著提升了大语言模型（LLMs）的推理能力，然而这些改进背后的词元级机制尚不明确。本文围绕三项主要分析，对RLVR引发的分布效应进行了系统的实证研究：（1）基础模型与RL模型之间分布偏移的词元级表征；（2）通过交叉采样干预，探究词元级分布偏移对序列级推理性能的影响；（3）这些偏移在词元级的细粒度机制。研究发现，RL微调引发了高度稀疏且目标明确的改变，仅有很小一部分词元分布在基础策略与RL策略之间表现出有意义的差异。我们进一步通过分析词元熵、位置集中度以及概率质量的重新分配，刻画了这些偏移的结构与演变过程。为评估这些稀疏变化的功能重要性，我们进行了交叉采样实验，在基础模型与RL模型之间有选择地交换词元选择，并设置不同的干预预算。实验表明，仅将一小部分RL采样的词元插入基础模型生成的序列中，即可逐步恢复RL带来的性能提升；反之，在原本由RL生成的序列中注入少量基础模型的词元选择，则会导致性能下降至基础水平。这分离出了一小部分直接决定RLVR性能增益的词元级决策。最后，我们探索了优势信号（advantage signal）的差异加权变体作为一种诊断性干预手段，发现其能带来超越基线的改进。综合而言，我们的研究结果揭示了RLVR所诱导的分布变化，并为理解RLVR微调作为一种目标明确的优化过程提供了一个细粒度的词元级视角。

摘要 (Abstract)

Reinforcement learning with verifiable rewards (RLVR) has significantly improved reasoning in large language models (LLMs), yet the token-level mechanisms underlying these improvements remain unclear. We present a systematic empirical study of RLVR’s distributional effects organized around three main analyses: (1) token-level characterization of distributional shifts between base and RL models, (2) the impact of token-level distributional shifts on sequence-level reasoning performance through cross-sampling interventions, and (3) fine-grained mechanics of these shifts at the token level. We find that RL fine-tuning induces highly sparse and targeted changes, with only a small fraction of token distributions exhibiting meaningful divergence between the base and RL policies. We further characterize the structure and evolution of these shifts through analyses of token entropy, positional concentration, and reallocation of probability mass. To assess the functional importance of these sparse changes, we conduct cross-sampling experiments that selectively swap token choices between the base and RL models with varying intervention budgets. We show that inserting only a small fraction of RL-sampled tokens into base generations progressively recovers RL performance gains, while injecting a similarly small number of base token choices into otherwise RL-generated sequences collapses performance to base levels, isolating a small set of token-level decisions directly responsible for RLVR’s performance gains. Finally, we explore divergence-weighted variants of the advantage signal as a diagnostic intervention, finding that they can yield improvements over baselines. Together, our results shed light on the distributional changes induced by RLVR and provide a fine-grained, token-level lens for understanding RLVR fine-tuning as a targeted refinement process.

关键词: RLVR, token-level analysis, distributional shifts, fine-tuning, large language models, reasoning, sparse changes, reinforcement learning

深度分析:

稀疏但关键：LLM RLVR微调中分布偏移的Token级别分析

摘要:

本文针对可验证奖励强化学习（RLVR）提升大语言模型推理能力的机制进行了深入的实证研究。研究背景在于RLVR虽然有效，但其token级别的运作机制尚不明确。论文通过三项主要分析展开：首先，量化了Base模型与RL模型之间的token级别分布偏移；其次，通过交叉采样干预实验，分析了这些偏移对序列级推理性能的影响；最后，探讨了这些偏移的细粒度机制。结论表明，RL微调诱导了高度稀疏且有针对性的变化，仅一小部分token分布存在显著差异。实验证明，插入少量RL采样的token即可恢复性能，而注入少量Base token则会导致性能崩溃。这表明RLVR主要通过在关键token位置重新分配概率质量来引导推理轨迹，而非全局重写模型行为。

创新点:

揭示了RLVR微调在token级别上具有高度稀疏性和针对性的分布偏移特征，即仅少数关键token发生显著变化。
提出了交叉采样干预方法，通过交换Base和RL模型的token选择，有力证明了稀疏token变化对性能增益的因果决定性。
深入分析了RLVR的细粒度机制，发现其主要是重新分配现有候选集的概率质量，而非引入新token。
探索了散度加权优势信号作为诊断干预，并在基线上取得了改进，为优化RLVR提供了新思路。

方法

!!! info

论文采用了系统的实证研究方法。首先，利用Jensen-Shannon (JS)散度量化Base模型与RL模型在相同上下文下的token分布差异。其次，设计前向和反向交叉采样干预实验，在生成过程中选择性交换Base和RL模型的token选择，以评估对序列级推理性能的影响。此外，通过分析token熵、位置集中度、概率质量重分配以及训练演化过程，揭示了分布偏移的细粒度机制。研究在Qwen和Mistral等多个模型及AIME、GPQA等数据集上进行了验证。

关键结果:

RLVR诱导的分布偏移高度稀疏，绝大多数token位置（DAPO >83%, SimpleRL >98%）的散度接近零。
分布偏移在序列位置上呈现特定结构，主要集中在序列开头（影响高层分支决策）和结尾（影响格式化）。
交叉采样实验表明，仅需插入少量RL采样的token即可在Base模型生成中恢复RL性能；反之，注入少量Base token会导致RL模型性能崩溃。
RLVR在高散度位置主要通过重新分配现有候选token的概率质量来发挥作用，而非引入全新的token。

技术栈: 算法：Reinforcement Learning with Verifiable Rewards (RLVR), Group Relative Policy Optimization (GRPO), DAPO, SimpleRL, 数学方法：Jensen-Shannon (JS) Divergence, KL Divergence, Token Entropy, Cross-Sampling Interventions, 模型：Qwen2.5-32B, Qwen2.5-Math-7B, Qwen3-8B-Base, Mistral-Small-24B, 数据集：AIME 2024, AIME 2025, GPQA

优点

视角新颖，提供了RLVR微调机制的细粒度、token级别视角，填补了仅关注序列级指标的研究空白。
实验设计巧妙，交叉采样干预有力地证明了稀疏token变化的因果重要性，逻辑严密。
结论具有泛化性，在多个模型架构（Qwen, Mistral）和数据集上验证了发现的一致性。
不仅解释了现象，还提出了散度加权优势信号等潜在改进方向，具有实际应用价值。

局限

研究主要聚焦于数学推理任务（如AIME），在其他类型的任务（如创意写作、代码生成或常识推理）上的普适性有待进一步验证。
虽然揭示了关键token的重要性，但对于如何自动识别这些关键位置以指导更高效的训练，尚未提供完整的解决方案。
提供的文本片段中关于熵关系的分析部分似乎不完整，可能影响对该方面机制的全面理解。

与研究方向的相关性:

论文高度相关。它直接研究了大模型（LLM）的核心技术——RLVR微调的原理，属于大模型技术原理的创新。研究涉及深度学习中的强化学习算法（GRPO, DAPO），并深入分析了模型内部的token分布机制。虽然主要应用场景是数学推理（属于科学领域的一个分支），但其揭示的稀疏性机制对于理解大模型在科学计算及其他复杂领域的应用具有重要意义。论文的创新性强，提供了新的分析框架和干预方法，符合高分标准。

4. Separating Diagnosis from Control: Auditable Policy Adaptation in Agent-Based Simulations with LLM-B

作者: Shaoxin Zhong, Yuchen Su, Michael Witbrock 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22904v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	5.0/10	5.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文提出一个三层框架，将诊断与控制分离，在基于代理的模拟中使用LLM进行诊断。核心相关关键词：1) “Large Language Models” (10分)：LLM作为诊断工具是论文的核心技术；2) “LLM Agents” (10分)：论文在基于代理的模拟中应用LLM，属于LLM代理研究；3) “Multi-agent Systems” (5分)：涉及基于代理的模拟，但未深入探讨代理协调；4) “Explainable AI” (5分)：强调可审计性和可追溯性，与可解释AI相关；5) “AI for Science” (5分)：应用于老年护理模拟，属于科学领域的AI应用。其他关键词与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该研究解决了在基于代理的模拟中实现政策干预的适应性和可审计性的难题，通过分离诊断与控制的三层框架，使用LLM进行诊断并配合确定性规则进行控制，实验证明该方法在老年护理模拟中比端到端黑盒LLM方法性能提升11.7%且保持完全可审计性。

摘要翻译

缓解老年人孤独感需要兼具适应性与可审计性的政策干预。现有方法难以协调这两个目标：传统基于智能体的模型存在静态僵化问题，而直接使用大语言模型（LLM）控制器则缺乏必要的可追溯性。本研究提出一个三层框架，通过将诊断与控制分离来同时实现这两种特性。大语言模型严格作为诊断工具运行，用于评估群体状态并生成结构化风险报告，而具有明确边界的确定性公式则将这些评估转化为可追溯的参数更新。这种分离机制确保每项政策决策都可归因于可审查的规则，同时保持对突发需求的适应性响应。我们通过在老年照护模拟中设置的五个实验条件进行系统性消融实验，验证了该框架的有效性。结果表明，显式控制规则在保持完全可审计性的同时，其性能比端到端黑盒大语言模型方法提升11.7%，这证实透明度并不需要以牺牲自适应性能为代价。

摘要 (Abstract)

Mitigating elderly loneliness requires policy interventions that achieve both adaptability and auditability. Existing methods struggle to reconcile these objectives: traditional agent-based models suffer from static rigidity, while direct large language model (LLM) controllers lack essential traceability. This work proposes a three-layer framework that separates diagnosis from control to achieve both properties simultaneously. LLMs operate strictly as diagnostic instruments that assess population state and generate structured risk evaluations, while deterministic formulas with explicit bounds translate these assessments into traceable parameter updates. This separation ensures that every policy decision can be attributed to inspectable rules while maintaining adaptive response to emergent needs. We validate the framework through systematic ablation across five experimental conditions in elderly care simulation. Results demonstrate that explicit control rules outperform end-to-end black-box LLM approaches by 11.7% while preserving full auditability, confirming that transparency need not compromise adaptive performance.

关键词: LLM-based diagnostics, agent-based simulations, policy adaptation, auditability, elderly care, three-layer framework, traceable parameter updates, adaptive response

深度分析:

将诊断与控制分离：基于LLM诊断的可审计智能体模拟策略自适应

摘要:

本文针对老年人孤独感缓解问题，提出了一种将诊断与控制分离的三层框架，旨在同时实现策略的自适应性和可审计性。传统基于主体的模型（ABM）缺乏灵活性，而直接使用大模型（LLM）进行控制则缺乏可追溯性。该框架利用LLM作为诊断工具，评估群体状态并生成结构化风险评估，而控制层则使用确定性公式和显式边界将评估转化为参数更新。在包含30个智能体的养老院模拟实验中，结果显示该方法在保持完全可审计性的同时，性能比端到端黑盒LLM方法高出11.7%，证明了透明度无需牺牲适应性。

创新点:

提出了“诊断与控制分离”的架构，将LLM严格限制为诊断工具，而非直接决策者，解决了LLM黑盒决策的可审计性问题。
设计了三层框架（模拟层、诊断层、控制层），结合了理论驱动的ABM模拟、数据驱动的LLM诊断以及确定性的参数更新规则。
引入了显式边界和确定性公式作为控制层，确保所有策略决策都可以追溯到可检查的规则，同时保持对新兴需求的响应能力。

方法

!!! info

论文构建了一个包含30个智能体、模拟200天的养老院环境。模拟层基于社会科学理论（如同质性）建模智能体动态；诊断层每7天使用本地部署的LLM（Ollama llama3:8b）对高风险智能体进行评估，输出结构化风险数据；控制层基于聚合的群体统计数据，利用预设的确定性公式和阈值（如 r > 0.4）更新策略参数（如社交活动强度、家访概率）。通过消融实验对比了固定策略、黑盒LLM控制及本文方法。

关键结果:

提出的显式控制规则在性能上优于端到端黑盒LLM方法11.7%。
相比基线方法，在保留种子上的性能提升了15.3%。
证明了通过分离诊断与控制，可以在不牺牲自适应性能的前提下实现完全的策略可审计性。
确定性公式提供了比黑盒LLM更优的稳定性。

技术栈: 大语言模型: Ollama llama3:8b (本地推理, temperature 0.1), 模拟环境: Python 3.8+, NumPy, NetworkX, 算法: Agent-Based Modeling (ABM), 同质性网络演化, 确定性控制公式, 数据格式: JSON (用于LLM输入输出)

优点

可审计性强：通过确定性公式替代黑盒决策，满足了政策领域对透明度和问责制的严格要求。
架构新颖：创新性地将LLM定位为诊断器而非控制器，有效规避了LLM直接决策的不可解释性风险。
性能优越：实验证明该方法在保持透明度的同时，性能优于直接使用LLM进行控制的方法。
稳定性好：显式的参数边界和增量更新机制避免了策略的剧烈波动。

局限

模拟规模较小：实验仅基于30个智能体，可能无法完全反映大规模复杂社会系统的动态。
阈值的主观性：控制层中的关键阈值（如 r > 0.4）是基于试点实验选择的建模参数，而非基于临床实证的常数，普适性有待验证。
计算成本：尽管进行了筛选，但频繁调用LLM进行诊断仍可能带来一定的计算开销，限制了在超大规模实时系统中的应用。

与研究方向的相关性:

本文高度相关。它属于“大模型在科学领域的应用”这一子领域，具体应用于社会科学模拟（老年人护理）。同时，它也涉及“大模型技术原理的创新”，提出了“诊断与控制分离”的新型架构范式，解决了LLM在决策应用中的可解释性和可控性问题。这种将LLM作为组件而非全权代理的设计思路，对AI for Science和可信AI研究具有重要的参考价值。

5. MedObvious: Exposing the Medical Moravec’s Paradox in VLMs via Clinical Triage

作者: Ufaq Khan, Umair Nawaz, L D M S S Teja, Numaan Saeed, Muhammad Bilal, Yutong Xie, Mohammad Yaqub, Muhammad Haris Khan 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23501v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文主要研究医疗视觉语言模型（VLMs）在临床分诊任务中的输入验证能力，与大多数大模型技术关键词（如MoE、量化、推理加速等）无直接关联。相关度评分如下：1）“Large Language Models"得5分，因为VLMs属于多模态大模型范畴；2）“Hallucination Mitigation"得10分，论文核心是研究模型在无效输入时产生幻觉（生成看似合理但错误的叙述）的问题；3）“Explainable AI"得5分，涉及模型可靠性评估和失败模式分析；4）“AI for Science"得10分，论文属于医疗AI应用研究。其他关键词均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文揭示了医疗视觉语言模型在临床分诊任务中存在输入验证能力不足的问题，即使面对不一致或无效的医学图像输入，模型仍可能产生看似合理的错误诊断叙述，这构成了一个关键的安全隐患。

摘要翻译

视觉语言模型（VLMs）越来越多地应用于医学报告生成和视觉问答等任务。然而，流畅的诊断文本并不能保证安全的视觉理解。在临床实践中，解读始于预诊断合理性检查：验证输入是否有效可读（正确的模态和解剖结构、合理的视角与方位，且无明显完整性违规）。现有基准大多假设此步骤已解决，因此忽略了一个关键失效模式：即使输入不一致或无效，模型仍可能生成看似合理的叙述。我们提出了MedObvious，一个包含1,880项任务的基准，它将输入验证作为一种在小规模多面板图像集上的集合层面一致性能力进行独立评估：模型必须识别是否存在任何面板违反预期一致性。MedObvious涵盖五个渐进层级，从基本方位/模态不匹配到临床驱动的解剖结构/视角验证及分诊式线索，并包含五种评估格式以测试不同界面的鲁棒性。通过对17种不同VLM的评估，我们发现合理性检查仍不可靠：部分模型在正常（阴性对照）输入上产生异常幻觉；当扩展至更大图像集时性能下降；且多项选择与开放式设置下的测量准确率存在显著差异。这些结果表明，预诊断验证对于医学VLM而言仍未解决，应在部署前将其视为一项独立且安全关键的能力加以对待。

摘要 (Abstract)

Vision Language Models (VLMs) are increasingly used for tasks like medical report generation and visual question answering. However, fluent diagnostic text does not guarantee safe visual understanding. In clinical practice, interpretation begins with pre-diagnostic sanity checks: verifying that the input is valid to read (correct modality and anatomy, plausible viewpoint and orientation, and no obvious integrity violations). Existing benchmarks largely assume this step is solved, and therefore miss a critical failure mode: a model can produce plausible narratives even when the input is inconsistent or invalid. We introduce MedObvious, a 1,880-task benchmark that isolates input validation as a set-level consistency capability over small multi-panel image sets: the model must identify whether any panel violates expected coherence. MedObvious spans five progressive tiers, from basic orientation/modality mismatches to clinically motivated anatomy/viewpoint verification and triage-style cues, and includes five evaluation formats to test robustness across interfaces. Evaluating 17 different VLMs, we find that sanity checking remains unreliable: several models hallucinate anomalies on normal (negative-control) inputs, performance degrades when scaling to larger image sets, and measured accuracy varies substantially between multiple-choice and open-ended settings. These results show that pre-diagnostic verification remains unsolved for medical VLMs and should be treated as a distinct, safety-critical capability before deployment.

关键词: Vision Language Models, Medical AI, Clinical Triage, Input Validation, Hallucination, Benchmark, Safety, Medical Imaging

深度分析:

MedObvious：通过临床分诊揭示视觉语言模型中的医学莫拉维克悖论

摘要:

本文针对视觉语言模型（VLM）在医学应用中存在的“医学莫拉维克悖论”，即模型能生成流畅的诊断文本却无法通过基本的视觉理智检查这一问题，提出了MedObvious基准。该基准包含1880项任务，通过多面板图像网格（如2x2或3x3）要求模型识别异常值或确认一致性，从而独立评估输入验证能力。研究涵盖了从基础方向错误到临床分诊的五个渐进层级和五种评估格式。对17个VLM的评估结果显示，当前模型在理智检查方面仍不可靠，常在正常输入上产生幻觉，且性能受图像集大小和评估格式影响显著。这表明预诊断验证是医学VLM部署前必须解决的关键安全问题。

创新点:

提出了“医学莫拉维克悖论”的概念，揭示了医学VLM在高级诊断语言生成与低级视觉理智检查之间的能力鸿沟。
构建了MedObvious基准，这是首个专门针对预诊断视觉理智检查和输入验证的大规模医学多模态基准。
设计了包含五个渐进难度层级（T1-T5）的评估体系，从基础质量控制到临床分诊，系统性地测试模型的视觉一致性能力。
引入了五种不同的评估协议（包括多选题、开放式问答和视觉指代），揭示了模型在不同交互界面下的鲁棒性和格式敏感性。

方法

!!! info

论文构建了一个基于网格图像的基准测试，使用ROCO和Kvasir等数据集生成2x2或3x3的医学图像网格。任务要求模型识别出网格中在模态、解剖结构、视角或方向上不一致的异常图像，或者判断所有图像是否一致（负样本控制）。研究对17个通用、医学专用及专有的视觉语言模型进行了零样本评估，并分析了模型在不同任务层级和评估格式下的表现。

关键结果:

目前表现最好的模型（Qwen2.5-VL-7B）平均准确率仅为63.2%，远低于人类专家的88.4%。
模型在负样本控制（即所有图像均一致）上表现较差，容易产生假阳性，即在正常输入上错误地报告存在异常。
随着图像集规模的扩大（从2x2到3x3），模型的性能普遍出现下降。
模型在多项选择（MCQ）和开放式生成任务之间的表现存在显著差异，显示出对评估格式的敏感性。

技术栈: Vision-Language Models (VLMs): LLaVA, Qwen-VL, InternVL, GPT-4o, Gemini等, Datasets: ROCO (Radiology Objects in COntext), Kvasir, Evaluation Metrics: Accuracy (Positive/Negative), Average Accuracy, Task Formats: Detection MCQ, Detection Open, Referring MCQ, Referring Open, Visual Referring

优点

切入点独特，关注了医学AI中常被忽视但至关重要的安全环节——输入验证和理智检查。
基准设计严谨，包含了负样本控制和多维度评估，能够有效测量模型的幻觉和鲁棒性。
实验全面，涵盖了17种不同的主流模型，提供了详实的对比分析。
具有实际临床意义，直接对应多视图超声、多切片CT/MRI及AI代理工作流中的真实需求。

局限

使用网格布局模拟临床查看器是一种抽象，可能无法完全复现真实临床软件（如3D Slicer）的复杂交互环境。
数据来源主要依赖现有的公开数据集（如ROCO），可能无法涵盖所有罕见的模态或特定的临床伪影。
主要评估零样本能力，未充分探索上下文学习或微调对预诊断能力的潜在提升。
专注于视觉一致性检查，未涉及复杂的病理诊断推理，因此不能作为全面的医学VLM性能基准。

与研究方向的相关性:

该论文高度相关。它属于大模型和深度学习在生物医药领域的应用，具体针对医学视觉语言模型（VLMs）的安全性评估。论文不仅提出了新的评估基准（技术创新），还深入分析了模型在基础视觉感知上的缺陷（原理创新），符合用户对科学领域应用及技术创新的关注点。

6. Analysing LLM Persona Generation and Fairness Interpretation in Polarised Geopolitical Contexts

作者: Maida Aizaz, Quang Minh Nguyen 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22837v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在生成地缘政治身份角色时的表现和公平性解释，直接涉及LLMs关键词（10分）。研究分析模型推理过程与生成结果的关系，涉及Chain of Thought（5分）。探讨模型如何解释公平性概念并调整输出，与Alignment和Hallucination Mitigation相关（各5分）。通过分析推理痕迹来理解模型行为，与Mechanistic Interpretability相关（5分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理优化、代理系统、模型压缩等均未在论文中涉及（0分）。

!!! tip deepseek-chat TL;DR

该研究分析了五种大型语言模型在巴以冲突背景下生成巴勒斯坦和以色列身份角色时的表现，发现模型在战争与非战争语境中会产生不同的社会经济属性分布，且公平性指令会导致输出分布变化，但模型推理中提到的公平概念并不直接转化为一致的公平输出结果。

摘要翻译

大型语言模型（LLMs）正日益广泛地应用于社会模拟和角色生成，这要求我们理解它们如何表征地缘政治身份。本文通过640种实验条件（涵盖战争与非战争背景，并分配不同角色），分析了五种主流LLM为巴勒斯坦和以色列身份生成的角色特征。我们观察到生成属性中存在显著的分布模式：在战争背景下，巴勒斯坦角色常与较低的社会经济地位和生存导向型角色相关联，而以色列角色则大多保持中产阶级地位和专业职业属性。当明确提示模型避免有害假设时，模型展现出多样化的分布变化，例如非二元性别推断显著增加，或职业角色趋于泛化（如“学生”），但潜在的社会经济差异往往依然存在。此外，对推理轨迹的分析揭示了模型推理与生成之间的有趣动态：尽管推理过程始终提及与公平相关的概念，但最终生成的角色仍遵循上述多样化的分布变化。这些发现描绘了模型如何解读地缘政治背景，同时表明它们以不同方式处理公平性并进行调整；公平概念并未被一致且直接地转化为具有代表性的生成结果。

摘要 (Abstract)

Large language models (LLMs) are increasingly utilised for social simulation and persona generation, necessitating an understanding of how they represent geopolitical identities. In this paper, we analyse personas generated for Palestinian and Israeli identities by five popular LLMs across 640 experimental conditions, varying context (war vs non-war) and assigned roles. We observe significant distributional patterns in the generated attributes: Palestinian profiles in war contexts are frequently associated with lower socioeconomic status and survival-oriented roles, whereas Israeli profiles predominantly retain middle-class status and specialised professional attributes. When prompted with explicit instructions to avoid harmful assumptions, models exhibit diverse distributional changes, e.g., marked increases in non-binary gender inferences or a convergence toward generic occupational roles (e.g., “student”), while the underlying socioeconomic distinctions often remain. Furthermore, analysis of reasoning traces reveals an interesting dynamics between model reasoning and generation: while rationales consistently mention fairness-related concepts, the final generated personas follow the aforementioned diverse distributional changes. These findings illustrate a picture of how models interpret geopolitical contexts, while suggesting that they process fairness and adjust in varied ways; there is no consistent, direct translation of fairness concepts into representative outcomes.

关键词: Large Language Models, Persona Generation, Geopolitical Contexts, Fairness Interpretation, Reasoning Traces, Social Simulation, Distributional Patterns, Model Behavior Analysis

深度分析:

分析极化地缘政治语境下大语言模型的人设生成与公平性解读

摘要:

本文研究了大型语言模型（LLM）在极化地缘政治语境（巴以冲突）中生成巴勒斯坦和以色列人设时的表现及对公平性的解读。研究通过5个主流LLM在640种实验条件下（战争/非战争语境、不同角色）生成人物画像，分析其属性分布。结果显示，模型在战争语境下倾向于将巴勒斯坦人与较低社会经济地位和生存导向角色关联，而以色列人则多为中产阶级和专业属性。尽管加入安全提示后模型在性别和职业上有所调整，但社会经济差异往往持续存在。此外，利用稀疏自编码器（SAE）分析推理过程发现，模型的推理文本中包含公平性概念，但最终生成的人设仍存在偏见，表明模型对公平性的处理存在内在矛盾。

创新点:

聚焦极化地缘政治语境：首次深入探讨了LLM在巴以冲突这一高度敏感且极化的地缘政治背景下生成特定身份人设的行为，填补了现有偏见研究多关注西方语境的空白。
揭示推理与生成的脱节：利用稀疏自编码器（SAE）分析模型推理轨迹，发现模型在推理过程中提及公平性概念，但最终生成的人设仍存在社会经济地位等偏见，揭示了模型“知行不一”的内在机制。
多维度的实验设计：设计了包含战争/非战争语境、不同角色（如维和人员、记者）及安全提示干预的复杂实验，系统性地评估了模型在不同压力和指令下的表现差异。
安全干预效果的异质性分析：详细分析了“避免有害假设”这一安全提示对不同模型的影响，发现模型对公平性的解读方向不一致，导致分布变化的多样性。

方法

!!! info

研究采用控制变量法构建实验，设计了5种角色（维和人员、记者等）与2种语境（战争/非战争）及2种目标身份（巴勒斯坦/以色列）的组合，共640种提示条件。使用5个主流LLM生成包含性别、年龄、社会经济地位（SES）、城市、职业和外貌的画像，并对生成内容进行标准化和分类标注。此外，通过在提示词中加入显式的安全指令（避免有害假设）进行干预实验。最后，利用稀疏自编码器（SAE）提取模型生成解释文本中的可解释特征，对比加入安全指令前后的特征频率变化。

关键结果:

社会经济地位偏见：在战争语境下，巴勒斯坦人设常被赋予较低社会经济地位和生存导向角色，而以色列人设则保持中产阶级和专业属性。
性别分布差异：不同模型表现出不同的性别偏见，如Gemma倾向于女性，GPT倾向于男性，且非二元性别仅在部分模型（如Qwen）中被识别。
安全提示的局限性：显式的安全提示虽然能增加非二元性别推断或使职业泛化（如“学生”），但往往无法消除潜在的社会经济地位差异。
推理与生成的分离：模型的推理文本中频繁出现公平性相关词汇和特征，但这并未直接转化为无偏见的生成结果，表明模型内部对公平性的处理存在断层。

技术栈: Gemma 3 27B, Qwen3 32B, Llama 3.3 70B Instruct, Gemini 2.5 Pro, GPT-4.1, Sparse Autoencoder (SAE), InterpEmbed toolkit, Statistical Distribution Analysis, Max-pooling features

优点

选题具有高度现实意义：选取当前极具争议和敏感的地缘政治冲突作为研究对象，直面LLM在真实高风险场景下的偏见问题。
方法结合了生成与解释：不仅评估了最终输出，还深入利用SAE等工具分析模型的内部推理过程，提供了更深层的行为洞察。
实验覆盖面广：测试了多个主流旗舰模型，涵盖了不同的模型架构和训练数据来源，结论具有较好的普适性参考价值。

局限

依赖模型自解释：推理分析依赖于模型生成的解释文本，这可能并不完全等同于模型内部的思维链，存在“事后合理化”的风险。
属性分类的主观性：对外貌描述的分类和职业的标准化可能存在一定的主观性，尽管作者进行了规范化处理。
缺乏具体的干预机制：论文主要在于审计和发现问题，并未提出或验证能够有效解决这些地缘政治偏见的具体对齐算法或机制。

与研究方向的相关性:

论文高度相关。首先，在技术原理创新方面，论文利用稀疏自编码器（SAE）这一前沿的可解释性工具来分析LLM的推理轨迹，属于对大模型内部机制和行为原理的深入探索。其次，在大模型应用方面，研究将LLM应用于社会模拟和人设生成这一具体领域，特别是涉及政治学和社会学的敏感场景，展示了LLM在社会科学研究中的应用潜力及风险。最后，论文探讨了安全提示对模型行为的影响，直接关联到大模型的安全对齐技术领域，揭示了当前对齐技术在处理复杂偏见时的局限性。

7. Gaze-Regularized Vision-Language-Action Models for Robotic Manipulation

作者: Anupam Pani, Yanchao Yang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23202v1

评分: 28.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出了一种用于机器人操作的视觉-语言-动作（VLA）模型的注视正则化训练框架，属于大模型在机器人领域的应用创新。与"Large Language Models"相关（5分），因为VLA模型可视为多模态基础模型。与"Pre-training”、“Post-training”、“Instruction Tuning"相关（各5分），因为论文涉及模型训练和调整。与"Mechanistic Interpretability"高度相关（8分），因为论文通过正则化注意力模式提高模型可解释性。其他关键词如MoE、量化、推理加速等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究解决了机器人操作中视觉-语言-动作模型缺乏主动视觉注意力分配机制的问题，通过引入注视正则化训练框架，使模型内部注意力与人类视觉模式对齐，从而在多个操作基准上实现了4-12%的性能提升，同时提高了模型的可解释性和训练效率。

摘要翻译

尽管视觉-语言-动作（VLA）模型已取得进展，但机器人操作在细粒度任务上仍面临挑战，因为现有模型缺乏主动视觉注意力分配的机制。人类注视天然编码了意图、规划与执行模式——为引导机器人感知提供了强大的监督信号。我们提出一种注视正则化训练框架，该框架在不修改架构或增加推理开销的前提下，将VLA模型的内部注意力与人类视觉模式对齐。我们的方法将时间聚合的注视热图转化为图像块级别的分布，并通过KL散度对Transformer的注意力进行正则化，从而构建出对任务相关特征的归纳偏置，同时保持部署效率。当集成到现有VLA架构中时，我们的方法在多个操作基准测试中实现了4-12%的性能提升。注视正则化模型能以更少的训练步骤达到同等性能，并在光照变化和传感器噪声下保持鲁棒性。除性能指标外，学习到的注意力模式可生成可解释的可视化结果，这些结果反映了人类策略，增强了机器人系统的可信度。此外，我们的框架无需眼动追踪设备，可直接应用于现有数据集。这些结果表明，人类感知先验能显著加速机器人学习，同时提升任务性能和系统可解释性。

摘要 (Abstract)

Despite advances in Vision-Language-Action (VLA) models, robotic manipulation struggles with fine-grained tasks because current models lack mechanisms for active visual attention allocation. Human gaze naturally encodes intent, planning, and execution patterns – offering a powerful supervisory signal for guiding robot perception. We introduce a gaze-regularized training framework that aligns VLA models’ internal attention with human visual patterns without architectural modifications or inference-time overhead. Our method transforms temporally aggregated gaze heatmaps into patch-level distributions and regularizes the transformer’s attention through KL divergence, creating an inductive bias toward task-relevant features while preserving deployment efficiency. When integrated into existing VLA architectures, our approach yields 4-12% improvements across manipulation benchmarks. The gaze-regularized models reach equivalent performance with fewer training steps and maintain robustness under lighting variations and sensor noise. Beyond performance metrics, the learned attention patterns produce interpretable visualizations that mirror human strategies, enhancing trust in robotic systems. Moreover, our framework requires no eye-tracking equipment and applies directly to existing datasets. These results demonstrate that human perceptual priors can significantly accelerate robot learning while improving both task performance and system interpretability.

关键词: Vision-Language-Action Models, Robotic Manipulation, Gaze Regularization, Attention Alignment, Transformer Attention, Interpretable Visualizations, Human Perceptual Priors, Task Performance Improvement

深度分析:

基于视线正则化的视觉-语言-动作模型用于机器人操作

摘要:

针对视觉-语言-动作（VLA）模型在机器人精细操作中缺乏主动视觉注意力分配机制的问题，本文提出了一种基于视线正则化的训练框架。该方法利用人类视线模式作为监督信号，通过预训练模型生成合成视线热图，并将其转换为补丁级分布。通过KL散度最小化，将VLA模型的内部注意力与人类注视模式对齐，从而在训练中引入归纳偏置，且无需修改架构或增加推理开销。实验表明，该方法在LIBERO等基准测试中实现了4-12%的性能提升，显著加快了收敛速度，并增强了鲁棒性和可解释性。

创新点:

提出了一种视线正则化训练策略，利用人类视觉注意力模式作为监督信号，将VLA模型从被动观察者转变为主动感知者。
实现了无需架构修改和推理开销的训练时干预，保持了部署效率，可直接作为插件应用于现有VLA架构。
引入了合成视线生成机制，利用预训练的GLC网络为缺乏眼动数据的机器人数据集生成视线热图，解决了数据稀缺问题。
通过KL散度对齐Transformer的视觉-语言注意力与人类注视分布，赋予模型类似人类的“扫描-规划-行动”归纳偏置。

方法

!!! info

1. **视线先验生成**：使用预训练的Global-Local Correlation (GLC) 网络从机器人操作视频中生成合成视线热图，捕捉瞬时注视和预期视线转移。2. **分布转换**：将时间聚合的连续热图转换为离散的补丁级概率分布，以匹配VLA模型的视觉Token结构。3. **注意力正则化**：在训练过程中，计算模型内部视觉-语言注意力分布与视线分布之间的KL散度，作为正则化项加入损失函数。4. **模型训练**：在标准VLA目标函数基础上加入视线正则项，引导模型关注任务相关区域，推理阶段则移除该正则项。

关键结果:

在LIBERO-Spatial基准测试中，模型成功率达到95.5%，相比基线的85.9%提升了近10%。
在LIBERO-Object和LIBERO-Goal套件上也取得了显著的性能提升（4-12%）。
模型收敛速度更快，在仅20,000步训练时即可获得6-8%的性能增益。
在光照变化和传感器噪声等视觉干扰下表现出更强的鲁棒性。
学习到的注意力模式具有可解释性，能够反映人类的操作策略，增强了系统可信度。

技术栈: Vision-Language-Action (VLA) Models, Transformer Architecture (Causal Attention), Global-Local Correlation (GLC) Network (Gaze Prediction), Kullback-Leibler (KL) Divergence, LIBERO Benchmark (Evaluation Datasets)

优点

高效性：仅在训练阶段引入计算开销，推理阶段零成本，适合实时部署。
通用性：无需修改底层模型架构，可模块化应用于现有VLA系统。
性能提升显著：在多个基准测试中取得了稳定的性能增益和样本效率提升。
可解释性增强：生成的注意力图符合人类直觉，有助于建立对机器人系统的信任。
数据利用：通过合成视线有效利用了现有的机器人数据集，无需昂贵的眼动设备采集数据。

局限

依赖视线预测模型：合成视线的质量高度依赖于预训练的视线预测模型（GLC）的准确性，预测误差可能引入噪声。
人机差异：人类视线模式与机器人最优操作策略之间可能存在差异（如视野范围、运动学约束），强制对齐可能在特定极端场景下并非最优。
模态局限：该方法主要优化视觉注意力，对于高度依赖触觉或力觉反馈的复杂操作任务，其辅助作用可能有限。

与研究方向的相关性:

该论文属于大模型和深度学习技术原理的创新范畴。它针对视觉-语言-动作（VLA）这一新兴的大模型范式，提出了改进其内部注意力机制的创新方法。虽然应用领域是机器人操作，但其核心贡献在于利用人类认知先验（视线）来优化深度学习模型（Transformer）的训练过程，属于深度学习算法层面的创新。这与用户关注的大模型技术原理创新高度相关，且具有较高的创新性。

📋 所有论文列表

1. ✅ SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

作者: Haoyu Huang, Jinfa Huang, Zhongwei Wan, Xiawu Zheng, Rongrong Ji, Jiebo Luo 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23483v1

评分: 65.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	10.0/10	10.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文解决了代理式多模态大语言模型（MLLMs）中由于感知、推理和工具调用循环导致的顺序开销和延迟问题，提出了SpecEyes框架，通过轻量级小模型的推测性规划和认知门控机制，在保持或提高准确性的同时实现了1.1-3.35倍的加速，并提升了系统吞吐量。

摘要翻译

具备代理能力的多模态大语言模型（Agentic multimodal large language models, MLLMs）（例如OpenAI o3和Gemini Agentic Vision）通过迭代式视觉工具调用实现了卓越的推理能力。然而，这种级联的感知、推理与工具调用循环引入了显著的顺序开销。这种被称为代理深度（agentic depth）的开销会导致难以接受的延迟，并严重限制系统级并发性。为此，我们提出了SpecEyes——一个代理级推测加速框架，旨在打破这一顺序瓶颈。我们的核心洞见是，一个轻量级、无需工具调用的MLLM可以作为推测规划器来预测执行轨迹，从而在不牺牲准确性的前提下提前终止昂贵的工具链。为了规范这种推测规划，我们引入了一种基于答案可分离性（answer separability）的认知门控机制，该机制可在无需真实标签的情况下量化模型的自验证置信度。此外，我们设计了一种异构并行漏斗结构，利用小模型的无状态并发性来掩盖大模型有状态的串行执行，从而最大化系统吞吐量。在V* Bench、HR-Bench和POPE上的大量实验表明，SpecEyes相比代理基线实现了1.1-3.35倍的加速，同时保持甚至提升了准确率（最高提升+6.7%），从而显著提升了并发工作负载下的服务吞吐量。

摘要 (Abstract)

Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model’s confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.

关键词: Agentic Multimodal LLMs, Speculative Acceleration, Tool-free MLLM, Cognitive Gating, Heterogeneous Parallel Funnel, System Throughput, Visual Tool Invocation, Agentic Depth

2. ✅ Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents

作者: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23013v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了一种记忆增强推理框架，通过让轻量级8B参数模型利用检索到的对话上下文来回答重复查询，在无需额外训练的情况下，以96%的成本降低恢复了235B模型69%的性能，证明了对于用户特定查询，相关知识访问比模型规模更重要。

摘要翻译

生产级人工智能代理频繁接收高度重复的用户特定查询，其中高达47%的查询在语义上与历史交互相似，但每次查询通常仍需消耗相同的计算成本。我们认为，这种冗余可通过对话记忆加以利用，从而将重复处理从成本负担转化为效率优势。我们提出一种记忆增强推理框架，其中轻量级的80亿参数模型通过检索到的对话上下文，以低成本推理路径回答所有查询。该方法无需任何额外训练或标注数据，即可达到30.5%的F1分数，恢复了2350亿参数全上下文模型69%的性能，同时将有效成本降低96%。值得注意的是，无记忆机制的2350亿参数模型（13.7% F1）表现甚至低于独立的80亿参数模型（15.4% F1），这表明对于用户特定查询，获取相关知识比模型规模更为重要。我们进一步分析了路由机制与置信度的作用。在实际置信度阈值下，仅路由机制即可将96%的查询导向小模型，但因自信幻觉导致准确率较低（13.0% F1）。记忆机制并未显著改变路由决策，而是通过基于检索到的用户特定信息生成回答来提升正确性。随着对话记忆随时间累积，重复话题的覆盖率增加，进一步缩小了性能差距。我们在152个LoCoMo问题（基于Qwen3-8B/235B）和500个LongMemEval问题上进行评估。引入混合检索（BM25 + 余弦相似度）使性能额外提升7.7 F1，证明检索质量直接增强端到端系统性能。总体而言，我们的研究结果表明，在持久性AI代理中，记忆机制而非模型规模，是驱动准确性与效率提升的核心因素。

摘要 (Abstract)

Production AI agents frequently receive user-specific queries that are highly repetitive, with up to 47% being semantically similar to prior interactions, yet each query is typically processed with the same computational cost. We argue that this redundancy can be exploited through conversational memory, transforming repetition from a cost burden into an efficiency advantage. We propose a memory-augmented inference framework in which a lightweight 8B-parameter model leverages retrieved conversational context to answer all queries via a low-cost inference path. Without any additional training or labeled data, this approach achieves 30.5% F1, recovering 69% of the performance of a full-context 235B model while reducing effective cost by 96%. Notably, a 235B model without memory (13.7% F1) underperforms even the standalone 8B model (15.4% F1), indicating that for user-specific queries, access to relevant knowledge outweighs model scale. We further analyze the role of routing and confidence. At practical confidence thresholds, routing alone already directs 96% of queries to the small model, but yields poor accuracy (13.0% F1) due to confident hallucinations. Memory does not substantially alter routing decisions; instead, it improves correctness by grounding responses in retrieved user-specific information. As conversational memory accumulates over time, coverage of recurring topics increases, further narrowing the performance gap. We evaluate on 152 LoCoMo questions (Qwen3-8B/235B) and 500 LongMemEval questions. Incorporating hybrid retrieval (BM25 + cosine similarity) improves performance by an additional +7.7 F1, demonstrating that retrieval quality directly enhances end-to-end system performance. Overall, our results highlight that memory, rather than model size, is the primary driver of accuracy and efficiency in persistent AI agents.

关键词: Memory Augmented Inference, AI Agents, Small Language Models, Retrieval-Augmented Generation, Conversational Memory, Routing, Efficiency, Persistent AI

3. ✅ Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	5.0/10	5.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了RLVR（强化学习与可验证奖励）微调如何通过稀疏的token级分布变化来提升大语言模型的推理能力，并揭示了这些变化的功能重要性。

摘要翻译

具有可验证奖励的强化学习（RLVR）显著提升了大语言模型（LLMs）的推理能力，然而这些改进背后的词元级机制尚不明确。本文围绕三项主要分析，对RLVR引发的分布效应进行了系统的实证研究：（1）基础模型与RL模型之间分布偏移的词元级表征；（2）通过交叉采样干预，探究词元级分布偏移对序列级推理性能的影响；（3）这些偏移在词元级的细粒度机制。研究发现，RL微调引发了高度稀疏且目标明确的改变，仅有很小一部分词元分布在基础策略与RL策略之间表现出有意义的差异。我们进一步通过分析词元熵、位置集中度以及概率质量的重新分配，刻画了这些偏移的结构与演变过程。为评估这些稀疏变化的功能重要性，我们进行了交叉采样实验，在基础模型与RL模型之间有选择地交换词元选择，并设置不同的干预预算。实验表明，仅将一小部分RL采样的词元插入基础模型生成的序列中，即可逐步恢复RL带来的性能提升；反之，在原本由RL生成的序列中注入少量基础模型的词元选择，则会导致性能下降至基础水平。这分离出了一小部分直接决定RLVR性能增益的词元级决策。最后，我们探索了优势信号（advantage signal）的差异加权变体作为一种诊断性干预手段，发现其能带来超越基线的改进。综合而言，我们的研究结果揭示了RLVR所诱导的分布变化，并为理解RLVR微调作为一种目标明确的优化过程提供了一个细粒度的词元级视角。

摘要 (Abstract)

Reinforcement learning with verifiable rewards (RLVR) has significantly improved reasoning in large language models (LLMs), yet the token-level mechanisms underlying these improvements remain unclear. We present a systematic empirical study of RLVR’s distributional effects organized around three main analyses: (1) token-level characterization of distributional shifts between base and RL models, (2) the impact of token-level distributional shifts on sequence-level reasoning performance through cross-sampling interventions, and (3) fine-grained mechanics of these shifts at the token level. We find that RL fine-tuning induces highly sparse and targeted changes, with only a small fraction of token distributions exhibiting meaningful divergence between the base and RL policies. We further characterize the structure and evolution of these shifts through analyses of token entropy, positional concentration, and reallocation of probability mass. To assess the functional importance of these sparse changes, we conduct cross-sampling experiments that selectively swap token choices between the base and RL models with varying intervention budgets. We show that inserting only a small fraction of RL-sampled tokens into base generations progressively recovers RL performance gains, while injecting a similarly small number of base token choices into otherwise RL-generated sequences collapses performance to base levels, isolating a small set of token-level decisions directly responsible for RLVR’s performance gains. Finally, we explore divergence-weighted variants of the advantage signal as a diagnostic intervention, finding that they can yield improvements over baselines. Together, our results shed light on the distributional changes induced by RLVR and provide a fine-grained, token-level lens for understanding RLVR fine-tuning as a targeted refinement process.

关键词: RLVR, token-level analysis, distributional shifts, fine-tuning, large language models, reasoning, sparse changes, reinforcement learning

4. ✅ Separating Diagnosis from Control: Auditable Policy Adaptation in Agent-Based Simulations with LLM-Based Diagnostics

作者: Shaoxin Zhong, Yuchen Su, Michael Witbrock 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22904v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	5.0/10	5.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

!!! tip deepseek-chat TL;DR

该研究解决了在基于代理的模拟中实现政策干预的适应性和可审计性的难题，通过分离诊断与控制的三层框架，使用LLM进行诊断并配合确定性规则进行控制，实验证明该方法在老年护理模拟中比端到端黑盒LLM方法性能提升11.7%且保持完全可审计性。

摘要翻译

缓解老年人孤独感需要兼具适应性与可审计性的政策干预。现有方法难以协调这两个目标：传统基于智能体的模型存在静态僵化问题，而直接使用大语言模型（LLM）控制器则缺乏必要的可追溯性。本研究提出一个三层框架，通过将诊断与控制分离来同时实现这两种特性。大语言模型严格作为诊断工具运行，用于评估群体状态并生成结构化风险报告，而具有明确边界的确定性公式则将这些评估转化为可追溯的参数更新。这种分离机制确保每项政策决策都可归因于可审查的规则，同时保持对突发需求的适应性响应。我们通过在老年照护模拟中设置的五个实验条件进行系统性消融实验，验证了该框架的有效性。结果表明，显式控制规则在保持完全可审计性的同时，其性能比端到端黑盒大语言模型方法提升11.7%，这证实透明度并不需要以牺牲自适应性能为代价。

摘要 (Abstract)

Mitigating elderly loneliness requires policy interventions that achieve both adaptability and auditability. Existing methods struggle to reconcile these objectives: traditional agent-based models suffer from static rigidity, while direct large language model (LLM) controllers lack essential traceability. This work proposes a three-layer framework that separates diagnosis from control to achieve both properties simultaneously. LLMs operate strictly as diagnostic instruments that assess population state and generate structured risk evaluations, while deterministic formulas with explicit bounds translate these assessments into traceable parameter updates. This separation ensures that every policy decision can be attributed to inspectable rules while maintaining adaptive response to emergent needs. We validate the framework through systematic ablation across five experimental conditions in elderly care simulation. Results demonstrate that explicit control rules outperform end-to-end black-box LLM approaches by 11.7% while preserving full auditability, confirming that transparency need not compromise adaptive performance.

关键词: LLM-based diagnostics, agent-based simulations, policy adaptation, auditability, elderly care, three-layer framework, traceable parameter updates, adaptive response

5. ✅ MedObvious: Exposing the Medical Moravec’s Paradox in VLMs via Clinical Triage

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该论文揭示了医疗视觉语言模型在临床分诊任务中存在输入验证能力不足的问题，即使面对不一致或无效的医学图像输入，模型仍可能产生看似合理的错误诊断叙述，这构成了一个关键的安全隐患。

摘要翻译

视觉语言模型（VLMs）越来越多地应用于医学报告生成和视觉问答等任务。然而，流畅的诊断文本并不能保证安全的视觉理解。在临床实践中，解读始于预诊断合理性检查：验证输入是否有效可读（正确的模态和解剖结构、合理的视角与方位，且无明显完整性违规）。现有基准大多假设此步骤已解决，因此忽略了一个关键失效模式：即使输入不一致或无效，模型仍可能生成看似合理的叙述。我们提出了MedObvious，一个包含1,880项任务的基准，它将输入验证作为一种在小规模多面板图像集上的集合层面一致性能力进行独立评估：模型必须识别是否存在任何面板违反预期一致性。MedObvious涵盖五个渐进层级，从基本方位/模态不匹配到临床驱动的解剖结构/视角验证及分诊式线索，并包含五种评估格式以测试不同界面的鲁棒性。通过对17种不同VLM的评估，我们发现合理性检查仍不可靠：部分模型在正常（阴性对照）输入上产生异常幻觉；当扩展至更大图像集时性能下降；且多项选择与开放式设置下的测量准确率存在显著差异。这些结果表明，预诊断验证对于医学VLM而言仍未解决，应在部署前将其视为一项独立且安全关键的能力加以对待。

摘要 (Abstract)

Vision Language Models (VLMs) are increasingly used for tasks like medical report generation and visual question answering. However, fluent diagnostic text does not guarantee safe visual understanding. In clinical practice, interpretation begins with pre-diagnostic sanity checks: verifying that the input is valid to read (correct modality and anatomy, plausible viewpoint and orientation, and no obvious integrity violations). Existing benchmarks largely assume this step is solved, and therefore miss a critical failure mode: a model can produce plausible narratives even when the input is inconsistent or invalid. We introduce MedObvious, a 1,880-task benchmark that isolates input validation as a set-level consistency capability over small multi-panel image sets: the model must identify whether any panel violates expected coherence. MedObvious spans five progressive tiers, from basic orientation/modality mismatches to clinically motivated anatomy/viewpoint verification and triage-style cues, and includes five evaluation formats to test robustness across interfaces. Evaluating 17 different VLMs, we find that sanity checking remains unreliable: several models hallucinate anomalies on normal (negative-control) inputs, performance degrades when scaling to larger image sets, and measured accuracy varies substantially between multiple-choice and open-ended settings. These results show that pre-diagnostic verification remains unsolved for medical VLMs and should be treated as a distinct, safety-critical capability before deployment.

关键词: Vision Language Models, Medical AI, Clinical Triage, Input Validation, Hallucination, Benchmark, Safety, Medical Imaging

6. ✅ Analysing LLM Persona Generation and Fairness Interpretation in Polarised Geopolitical Contexts

作者: Maida Aizaz, Quang Minh Nguyen 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22837v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该研究分析了五种大型语言模型在巴以冲突背景下生成巴勒斯坦和以色列身份角色时的表现，发现模型在战争与非战争语境中会产生不同的社会经济属性分布，且公平性指令会导致输出分布变化，但模型推理中提到的公平概念并不直接转化为一致的公平输出结果。

摘要翻译

大型语言模型（LLMs）正日益广泛地应用于社会模拟和角色生成，这要求我们理解它们如何表征地缘政治身份。本文通过640种实验条件（涵盖战争与非战争背景，并分配不同角色），分析了五种主流LLM为巴勒斯坦和以色列身份生成的角色特征。我们观察到生成属性中存在显著的分布模式：在战争背景下，巴勒斯坦角色常与较低的社会经济地位和生存导向型角色相关联，而以色列角色则大多保持中产阶级地位和专业职业属性。当明确提示模型避免有害假设时，模型展现出多样化的分布变化，例如非二元性别推断显著增加，或职业角色趋于泛化（如“学生”），但潜在的社会经济差异往往依然存在。此外，对推理轨迹的分析揭示了模型推理与生成之间的有趣动态：尽管推理过程始终提及与公平相关的概念，但最终生成的角色仍遵循上述多样化的分布变化。这些发现描绘了模型如何解读地缘政治背景，同时表明它们以不同方式处理公平性并进行调整；公平概念并未被一致且直接地转化为具有代表性的生成结果。

摘要 (Abstract)

Large language models (LLMs) are increasingly utilised for social simulation and persona generation, necessitating an understanding of how they represent geopolitical identities. In this paper, we analyse personas generated for Palestinian and Israeli identities by five popular LLMs across 640 experimental conditions, varying context (war vs non-war) and assigned roles. We observe significant distributional patterns in the generated attributes: Palestinian profiles in war contexts are frequently associated with lower socioeconomic status and survival-oriented roles, whereas Israeli profiles predominantly retain middle-class status and specialised professional attributes. When prompted with explicit instructions to avoid harmful assumptions, models exhibit diverse distributional changes, e.g., marked increases in non-binary gender inferences or a convergence toward generic occupational roles (e.g., “student”), while the underlying socioeconomic distinctions often remain. Furthermore, analysis of reasoning traces reveals an interesting dynamics between model reasoning and generation: while rationales consistently mention fairness-related concepts, the final generated personas follow the aforementioned diverse distributional changes. These findings illustrate a picture of how models interpret geopolitical contexts, while suggesting that they process fairness and adjust in varied ways; there is no consistent, direct translation of fairness concepts into representative outcomes.

关键词: Large Language Models, Persona Generation, Geopolitical Contexts, Fairness Interpretation, Reasoning Traces, Social Simulation, Distributional Patterns, Model Behavior Analysis

7. ✅ Gaze-Regularized Vision-Language-Action Models for Robotic Manipulation

作者: Anupam Pani, Yanchao Yang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23202v1

评分: 28.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该研究解决了机器人操作中视觉-语言-动作模型缺乏主动视觉注意力分配机制的问题，通过引入注视正则化训练框架，使模型内部注意力与人类视觉模式对齐，从而在多个操作基准上实现了4-12%的性能提升，同时提高了模型的可解释性和训练效率。

摘要翻译

尽管视觉-语言-动作（VLA）模型已取得进展，但机器人操作在细粒度任务上仍面临挑战，因为现有模型缺乏主动视觉注意力分配的机制。人类注视天然编码了意图、规划与执行模式——为引导机器人感知提供了强大的监督信号。我们提出一种注视正则化训练框架，该框架在不修改架构或增加推理开销的前提下，将VLA模型的内部注意力与人类视觉模式对齐。我们的方法将时间聚合的注视热图转化为图像块级别的分布，并通过KL散度对Transformer的注意力进行正则化，从而构建出对任务相关特征的归纳偏置，同时保持部署效率。当集成到现有VLA架构中时，我们的方法在多个操作基准测试中实现了4-12%的性能提升。注视正则化模型能以更少的训练步骤达到同等性能，并在光照变化和传感器噪声下保持鲁棒性。除性能指标外，学习到的注意力模式可生成可解释的可视化结果，这些结果反映了人类策略，增强了机器人系统的可信度。此外，我们的框架无需眼动追踪设备，可直接应用于现有数据集。这些结果表明，人类感知先验能显著加速机器人学习，同时提升任务性能和系统可解释性。

摘要 (Abstract)

Despite advances in Vision-Language-Action (VLA) models, robotic manipulation struggles with fine-grained tasks because current models lack mechanisms for active visual attention allocation. Human gaze naturally encodes intent, planning, and execution patterns – offering a powerful supervisory signal for guiding robot perception. We introduce a gaze-regularized training framework that aligns VLA models’ internal attention with human visual patterns without architectural modifications or inference-time overhead. Our method transforms temporally aggregated gaze heatmaps into patch-level distributions and regularizes the transformer’s attention through KL divergence, creating an inductive bias toward task-relevant features while preserving deployment efficiency. When integrated into existing VLA architectures, our approach yields 4-12% improvements across manipulation benchmarks. The gaze-regularized models reach equivalent performance with fewer training steps and maintain robustness under lighting variations and sensor noise. Beyond performance metrics, the learned attention patterns produce interpretable visualizations that mirror human strategies, enhancing trust in robotic systems. Moreover, our framework requires no eye-tracking equipment and applies directly to existing datasets. These results demonstrate that human perceptual priors can significantly accelerate robot learning while improving both task performance and system interpretability.

8. ❌ SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM

作者: Chuanrui Zhang, Minghan Qin, Yuang Wang, Baifeng Xie, Hang Li, Ziwei Wang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23386v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	5.0/10	5.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文SIMART提出了一种基于MLLM（多模态大语言模型）的统一框架，用于将单体网格分解为可用于仿真的关节式资产，这直接属于大模型在科学/工程领域的应用（AI for Science），因此相关关键词得分为10。论文核心创新是引入Sparse 3D VQ-VAE来减少token数量，这与稀疏模型（Sparse Models）有一定关联，但并非核心的MoE技术，因此得分为5。其他关键词如LLMs、Foundation Models也高度相关，因为MLLM属于大模型范畴。论文未涉及SLMs、训练方法、推理优化、智能体、量化等其他技术，因此这些关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文解决了从静态3D网格生成高质量、可用于物理仿真的关节式资产的难题，通过提出一个统一的MLLM框架SIMART，结合稀疏3D VQ-VAE显著减少token数量，实现了最先进的性能并支持机器人物理仿真。

摘要翻译

高质量可动三维资产对于具身人工智能与物理仿真至关重要，然而当前三维生成技术仍集中于静态网格模型，导致“仿真就绪”的交互式对象存在缺口。现有的大多数可动物体创建方法依赖于多阶段流程，各解耦模块间的误差会逐级累积。与之相对，统一的多模态大语言模型为联合静态资产理解与仿真就绪资产生成提供了单阶段路径。但基于稠密体素的三维标记化方法会产生冗长的三维标记序列和高内存开销，限制了其向复杂可动物体的扩展能力。为此，我们提出SIMART——一个统一的多模态大语言模型框架，能够同步执行部件级分解与运动学预测。通过引入稀疏三维VQ-VAE（向量量化变分自编码器），SIMART相比稠密体素标记将标记数量减少了70%，从而实现了高保真度的多部件装配。SIMART在PartNet-Mobility数据集和开放域AIGC数据集上取得了最先进的性能，并支持基于物理的机器人仿真。

摘要 (Abstract)

High-quality articulated 3D assets are indispensable for embodied AI and physical simulation, yet 3D generation still focuses on static meshes, leaving a gap in “sim-ready” interactive objects. Most recent articulated object creation methods rely on multi-stage pipelines that accumulate errors across decoupled modules. Alternatively, unified MLLMs offer a single-stage path to joint static asset understanding and sim-ready asset generation. However dense voxel-based 3D tokenization yields long 3D token sequences and high memory overhead, limiting scalability to complex articulated objects. To address this, we propose SIMART, a unified MLLM framework that jointly performs part-level decomposition and kinematic prediction. By introducing a Sparse 3D VQ-VAE, SIMART reduces token counts by 70% vs. dense voxel tokens, enabling high-fidelity multi-part assemblies. SIMART achieves state-of-the-art performance on PartNet-Mobility and in-the-wild AIGC datasets, and enables physics-based robotic simulation.

关键词: articulated 3D assets, MLLM, sparse 3D VQ-VAE, part-level decomposition, kinematic prediction, physics-based simulation, PartNet-Mobility, AIGC datasets

9. ❌ ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling

作者: Shaobo Ju, Baiyang Song, Tao Chen, Jiapeng Zhang, Qiong Wu, Chao Chang, HuaiXi Wang, Yiyi Zhou, Rongrong Ji 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22911v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	5.0/10	5.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	5.0/10	5.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文专注于视频多模态大语言模型（Video MLLMs）的视觉令牌压缩技术，核心是提出一种名为ForestPrune的训练无关令牌剪枝方法。论文高度相关于大语言模型（LLMs），因为MLLMs是LLMs的多模态扩展，且论文明确应用于LLaVA-Video和LLaVA-OneVision等模型。与模型压缩（如量化）和推理加速有一定关联，因为令牌压缩旨在减少计算和内存开销，属于模型效率优化范畴。然而，论文未涉及其他关键词，如MoE、小模型、缩放定律、各种训练方法（预训练、微调、对齐）、RAG、上下文扩展、注意力优化、推理技术（CoT、MCTS）、智能体、工具使用、多智能体系统、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文针对视频多模态大语言模型（MLLMs）中高比率视觉令牌压缩的不足，提出了一种基于时空森林建模的训练无关剪枝方法ForestPrune，在LLaVA-Video和LLaVA-OneVision等模型上实现了高效压缩（如减少90%令牌时保持95.8%准确率），并优于现有方法。

摘要翻译

由于在计算与内存开销方面具有显著优势，令牌压缩已成为多模态大语言模型的研究热点，并在图像-语言任务中取得了显著进展。然而，在视频领域，现有方法仍难以实现高比例令牌压缩。我们将此不足归因于对视频时序性与连续性内容建模的不足，并提出一种新颖且无需训练的面向视频多模态大语言模型的令牌剪枝方法——ForestPrune。该方法通过时空森林建模实现高效的高比例剪枝。在实践中，ForestPrune基于语义、空间与时序约束跨视频帧构建令牌森林，从而实现对视频内容的整体理解。随后，ForestPrune依据树深度与节点角色评估令牌树及节点的重要性，进而获得全局最优剪枝决策。为验证ForestPrune，我们将其应用于两个代表性视频多模态大语言模型（LLaVA-Video和LLaVA-OneVision），并在多个视频基准测试上进行了广泛实验。实验结果不仅证明了该方法对视频多模态大语言模型的高度有效性（例如在LLaVA-OneVision上减少90%令牌的同时仍保持95.8%的平均准确率），还显示出其相较于现有令牌压缩方法的卓越性能与效率（如在LLaVA-Video上于MLVU数据集准确率提升10.1%，且剪枝时间较FrameFusion减少81.4%）。

摘要 (Abstract)

Due to the great saving of computation and memory overhead, token compression has become a research hot-spot for MLLMs and achieved remarkable progress in image-language tasks. However, for the video, existing methods still fall short of high-ratio token compression. We attribute this shortcoming to the insufficient modeling of temporal and continual video content, and propose a novel and training-free token pruning method for video MLLMs, termed ForestPrune, which achieves effective and high-ratio pruning via Spatial-temporal Forest Modeling. In practice, ForestPrune construct token forests across video frames based on the semantic, spatial and temporal constraints, making an overall comprehension of videos. Afterwards, ForestPrune evaluates the importance of token trees and nodes based on tree depth and node roles, thereby obtaining a globally optimal pruning decision. To validate ForestPrune, we apply it to two representative video MLLMs, namely LLaVA-Video and LLaVA-OneVision, and conduct extensive experiments on a bunch of video benchmarks. The experimental results not only show the great effectiveness for video MLLMs, e.g., retaining 95.8% average accuracy while reducing 90% tokens for LLaVA-OneVision, but also show its superior performance and efficiency than the compared token compression methods, e.g., +10.1% accuracy on MLVU and -81.4% pruning time than FrameFusion on LLaVA-Video.

关键词: token compression, video MLLMs, spatial-temporal modeling, pruning, inference efficiency, multimodal large language models, training-free method

10. ❌ Focus, Don’t Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding

作者: Mincheol Kwon, Minseung Lee, Seonga Choi, Miso Choi, Kyeong-Jin Oh, Hyunyoung Lee, Cheonyoung Park, Yongho Song, Seunghyun Park, Jinkyu Kim 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22815v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	5.0/10	5.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大视觉语言模型（LVLMs）在处理信息丰富图像时的效率问题，提出PinPoint框架通过识别指令相关区域来减少视觉token数量。与关键词高度相关的是"Large Language Models”（论文明确使用LLMs作为LVLMs的基础），评10分；与"Instruction Tuning"有一定关联（涉及指令对齐），评5分；与"Speculative Decoding"有一定关联（涉及推理加速），评5分；其他关键词如MoE、SLMs、Scaling Laws、RLHF等均未在论文中涉及，评0分。

!!! tip deepseek-chat TL;DR

该论文针对大视觉语言模型处理信息丰富图像时计算开销大的问题，提出了PinPoint框架，通过识别指令相关图像区域来减少视觉token数量，在提高VQA任务准确率的同时显著降低了计算开销。

摘要翻译

大型视觉语言模型（LVLMs）通过利用大语言模型（LLMs）的推理能力，在各种多模态任务中展现出强大性能。然而，处理视觉复杂且信息密集的图像（例如信息图或文档版面）时，这些模型需要生成大量视觉标记，导致显著的计算开销。为解决此问题，我们提出了PinPoint，一种新颖的两阶段框架：首先识别与指令相关的图像区域，随后对其进行细化以提取细粒度视觉特征，从而提升推理能力与效率。我们方法的核心是指令-区域对齐，它利用视觉输入和文本指令共同定位相关区域。我们进一步引入了新的标注数据，为具有挑战性的视觉问答基准——InfographicVQA、MultiPageDocVQA和SinglePageDocVQA——中与指令相关的区域提供了更丰富的真实监督信息。实验结果表明，PinPoint不仅相比现有方法实现了更高的准确率，还通过最小化无关视觉标记显著降低了计算开销。

摘要 (Abstract)

Large Vision-Language Models (LVLMs) have shown strong performance across various multimodal tasks by leveraging the reasoning capabilities of Large Language Models (LLMs). However, processing visually complex and information-rich images, such as infographics or document layouts, requires these models to generate a large number of visual tokens, leading to significant computational overhead. To address this, we propose PinPoint, a novel two-stage framework that first identifies instruction-relevant image regions and then refines them to extract fine-grained visual features for improved reasoning and efficiency. Central to our approach is the Instruction-Region Alignment, which localizes relevant regions using both visual input and textual instructions. We further introduce new annotations that provide richer ground-truth supervision for instruction-relevant regions across challenging VQA benchmarks: InfographicVQA, MultiPageDocVQA, and SinglePageDocVQA. Experimental results show that PinPoint not only achieves superior accuracy compared to existing methods but also reduces computational overhead by minimizing irrelevant visual tokens.

关键词: Large Vision-Language Models, Instruction-Relevant Regions, Visual Tokens, Computational Overhead, PinPoint Framework, Instruction-Region Alignment, VQA Benchmarks, InfographicVQA

作者: Hanjing Wang, S. Mostafa Mousavi, Patrick Robertson, Richard M. Allen, Alexie Barski, Robert Bosch, Nivetha Thiruverahan, Youngmin Cho, Tajinder Gadh, Steve Malkos, Boone Spooner, Greg Wimpey, Marc Stogaitis 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23322v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文明确使用LLMs分析社交媒体数据以理解用户对地震预警系统的感知，因此与"Large Language Models"高度相关（10分）。论文属于大模型在科学（地震学/社会科学）领域的应用，与"AI for Science"有一定关联（5分）。其他关键词涉及具体技术原理（如MoE、Scaling Laws、RLHF等）或特定应用方向（如生物信息学），论文未涉及，均得0分。

!!! tip deepseek-chat TL;DR

该研究利用大型语言模型分析社交媒体数据，评估了土耳其地震期间Android地震预警系统的用户感知，发现用户信任与警报及时性高度相关，并指出用户将及时性视为准确性的核心指标。

摘要翻译

2025年4月23日土耳其马尔马拉埃雷利西Mw 6.2级地震期间，安卓地震预警（Android’s Earthquake Alert, AEA）系统为数百万用户提供了及时的早期预警。此次事件是25年来该地区发生的最大地震，为基于智能手机的地震早期预警（Earthquake Early Warning, EEW）系统提供了一次关键的现实世界测试。AEA系统以高精度成功向用户发出警报，在最强震动抵达城市区域前提供了超过一分钟的预警时间。本研究利用大语言模型（Large Language Models, LLMs）分析了来自X平台的500余条公开社交媒体帖子，提取出与用户体验和行为相关的42项独立属性。统计分析揭示了显著的相关性，特别是用户信任度与警报及时性之间存在强关联。我们的结果表明，工程定义与以用户为中心的系统准确性定义存在差异。研究发现，在用户认知中，及时性即等同于准确性。总体而言，本研究为优化警报设计、公众教育活动及未来行为研究提供了可操作的见解，以提升此类系统在地震活跃区域的有效性。

摘要 (Abstract)

Android’s Earthquake Alert (AEA) system provided timely early warnings to millions during the Mw 6.2 Marmara Ereglisi, Türkiye earthquake on April 23, 2025. This event, the largest in the region in 25 years, served as a critical real-world test for smartphone-based Earthquake Early Warning (EEW) systems. The AEA system successfully delivered alerts to users with high precision, offering over a minute of warning before the strongest shaking reached urban areas. This study leveraged Large Language Models (LLMs) to analyze more than 500 public social media posts from the X platform, extracting 42 distinct attributes related to user experience and behavior. Statistical analyses revealed significant relationships, notably a strong correlation between user trust and alert timeliness. Our results indicate a distinction between engineering and the user-centric definition of system accuracy. We found that timeliness is accuracy in the user’s mind. Overall, this study provides actionable insights for optimizing alert design, public education campaigns, and future behavioral research to improve the effectiveness of such systems in seismically active regions.

关键词: Large Language Models, Earthquake Early Warning, Social Media Analysis, User Perception, Alert Timeliness, Android Earthquake Alert, User Trust, Behavioral Research

12. ❌ Understanding LLM Performance Degradation in Multi-Instance Processing: The Roles of Instance Count and Context Length

作者: Jingxuan Chen, Mohammad Taher Pilehvar, Jose Camacho-Collados 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22608v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	5.0/10	5.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在多实例处理（MIP）中的性能退化问题，直接聚焦于LLMs的基础能力评估，因此与"Large Language Models"高度相关（10分）。研究分析了上下文长度（context length）对性能的影响，与"Context Window Extension"有一定关联（5分），但论文主要关注性能退化现象而非专门扩展上下文窗口的技术。其他关键词涉及具体技术（如MoE、量化、推理加速等）、训练方法（预训练、微调等）、应用领域（科学AI）或高级能力（推理、代理等），论文均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型（LLMs）在处理多实例输入时的性能退化问题，发现所有LLMs在实例数较小时性能略有下降，而在实例数较大时性能崩溃，且实例数量比上下文长度对最终结果的影响更大。

摘要翻译

用户在处理多份文档或对多个实例进行分析时，常依赖大语言模型（Large Language Models, LLMs）。例如，分析一系列电影评论的整体情感需要大语言模型分别处理每一条评论的情感，以提供最终的聚合答案。尽管大语言模型在此类单项任务上表现通常优异，但关于其处理多实例输入时的表现研究尚少。本文针对大语言模型在个体表现优秀的任务中，对其多实例处理（Multi-Instance Processing, MIP）能力进行了全面评估。结果显示，所有大语言模型在处理少量实例（约20-100个）时均呈现轻微的性能下降趋势，而在处理更大数量实例时则出现性能崩溃。关键的是，我们的分析表明，虽然上下文长度与此性能下降相关，但实例数量对最终结果的影响更为显著。这一发现提示，在为大语言模型优化多实例处理性能时，应同时关注上下文长度，尤其是实例数量。

摘要 (Abstract)

Users often rely on Large Language Models (LLMs) for processing multiple documents or performing analysis over a number of instances. For example, analysing the overall sentiment of a number of movie reviews requires an LLM to process the sentiment of each review individually in order to provide a final aggregated answer. While LLM performance on such individual tasks is generally high, there has been little research on how LLMs perform when dealing with multi-instance inputs. In this paper, we perform a comprehensive evaluation of the multi-instance processing (MIP) ability of LLMs for tasks in which they excel individually. The results show that all LLMs follow a pattern of slight performance degradation for small numbers of instances (approximately 20-100), followed by a performance collapse on larger instance counts. Crucially, our analysis shows that while context length is associated with this degradation, the number of instances has a stronger effect on the final results. This finding suggests that when optimising LLM performance for MIP, attention should be paid to both context length and, in particular, instance count.

关键词: Large Language Models, LLMs, multi-instance processing, performance degradation, context length, instance count, evaluation

13. ❌ TorR: Towards Brain-Inspired Task-Oriented Reasoning via Cache-Oriented Algorithm-Architecture Co-design

作者: Hyunwoo Oh, SungHeon Jeong, Suyeon Jang, Hanning Chen, Sanggeon Yun, Tamoghno Das, Mohsen Imani 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22855v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	5.0/10	5.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	5.0/10	5.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文TorR专注于计算机视觉领域的任务导向目标检测（TOOD）在边缘设备上的高效部署，核心是算法-架构协同设计，使用超维计算（HDC）替代CLIP进行关联推理，并优化缓存和计算路径以实现实时、低能耗。论文与大多数关键词无关，因为这些关键词主要针对大语言模型（LLM）的技术原理、训练方法、推理优化、对齐、代理系统等。仅与三个关键词有弱关联：1）“Small Language Models” OR “SLMs” OR “On-device AI”（5分）：论文关注边缘部署（on-device），但针对的是视觉任务，非语言模型。2）“Quantization” OR “Model Compression” OR “Low-bit Weights”（5分）：论文涉及精度门控（precision gating）和位切片内存，属于模型压缩/高效计算范畴，但非核心。3）“Speculative Decoding” OR “Inference Acceleration”（5分）：论文优化推理延迟和能耗，属于推理加速，但方法特定（缓存、路径调度），非典型LLM解码加速。其他关键词如LLMs、MoE、Scaling Laws、训练方法、对齐、RAG、CoT、代理等均未涉及。

!!! tip deepseek-chat TL;DR

论文提出TorR，一种脑启发的算法-架构协同设计，通过超维计算关联推理器和缓存优化，在边缘设备上实现高效、实时的任务导向目标检测，显著降低能耗同时保持竞争力性能。

摘要翻译

基于CLIP的任务导向物体检测（TOOD）具备开放词汇和提示驱动的语义理解能力，但其密集的逐窗口计算与高内存流量阻碍了在实时、功耗受限的边缘设备上的部署。本文提出TorR，一种受脑启发的算法-架构协同设计方案，其利用超维（HDC）关联推理器替代CLIP风格的密集对齐机制，并将时间一致性转化为计算重用。在算法层面，TorR将对齐问题重构为超维相似度计算与图组合，并通过以下方式引入部分相似度重用：（i）采用查询缓存与逐类别分数累积；（ii）当仅少量超向量位发生变化时执行精确的$δ$更新；（iii）在高系统负载下启用基于相似度/负载的门控旁路。在架构层面，TorR实现了一个支持车道缩放、位切片化的项目存储器（item memory），配备存储体/精度门控机制，以及一个轻量级控制器，该控制器根据物体数量动态调度旁路/$δ$/完整计算路径，以满足RT-30/RT-60的实时性目标。基于TSMC 28纳米工艺综合并通过周期精确模拟器验证，TorR能够维持实时吞吐量，每窗口能耗在毫焦耳级别（60帧/秒时约50毫焦；30帧/秒时约113毫焦），且延迟抖动较低；同时在五项任务提示（task prompts）上实现了具有竞争力的平均精度AP@0.5（均值44.27%），与强大的视觉语言模型（VLM）基线相比性能差距有限，但能耗降低数个数量级。该设计提供了部署时可配置参数（有效维度$D’$、阈值、精度），以便根据边缘设备的资源预算在精度、延迟和能耗之间进行权衡。

摘要 (Abstract)

Task-oriented object detection (TOOD) atop CLIP offers open-vocabulary, prompt-driven semantics, yet dense per-window computation and heavy memory traffic hinder real-time, power-limited edge deployment. We present \emph{TorR}, a brain-inspired \textbf{algorithm–architecture co-design} that \textbf{replaces CLIP-style dense alignment with a hyperdimensional (HDC) associative reasoner} and turns temporal coherence into reuse. On the \emph{algorithm} side, TorR reformulates alignment as HDC similarity and graph composition, introducing \emph{partial-similarity reuse} via (i) query caching with per-class score accumulation, (ii) exact $δ$-updates when only a small set of hypervector bits change, and (iii) similarity/load-gated bypass under high system load. On the \emph{architecture} side, TorR instantiates a lane-scalable, bit-sliced item memory with bank/precision gating and a lightweight controller that schedules bypass/$δ$/full paths to meet RT-30/RT-60 targets as object counts vary. Synthesized in a TSMC 28,nm process and exercised with a cycle-accurate simulator, TorR sustains real-time throughput with millijoule-scale energy per window ($\approx$50,mJ at 60,FPS; $\approx$113,mJ at 30,FPS) and low latency jitter, while delivering competitive AP@0.5 across five task prompts (mean 44.27%) within a bounded margin to strong VLM baselines, but at orders-of-magnitude lower energy. The design exposes deployment-time configurability (effective dimension $D’$, thresholds, precision) to trade accuracy, latency, and energy for edge budgets.

关键词: Task-oriented object detection, Algorithm-architecture co-design, Hyperdimensional computing, Edge deployment, Cache optimization, Real-time inference, Energy efficiency, CLIP alternative

作者: Chaoqun Cui, Caiyan Jia 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22854v1

评分: 11.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	3.0/10	3.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	8.0/10	8.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出了一种基于Transformer架构的预训练方法（P2T3）用于社交媒体谣言检测，主要涉及预训练技术（Pre-training）和Transformer模型（与LLMs相关）。摘要明确提到“pre-trains on large-scale unlabeled datasets”，因此与“Pre-training”高度相关（8分）。论文使用Transformer架构，虽然未明确提及LLMs，但Transformer是LLMs的核心组件，且摘要提到“potentially offers a large model”，因此与“Large Language Models”有一定关联（3分）。其他关键词如MoE、SFT、RLHF、RAG等均未在论文中涉及，因此得0分。论文属于AI应用（社交媒体分析），但未涉及生物信息学等具体科学领域，因此“AI for Science”得0分。

!!! tip deepseek-chat TL;DR

该论文针对社交媒体谣言检测中图神经网络存在的过平滑和长程依赖捕获问题，提出了一种基于Transformer架构的预训练传播树Transformer方法（P2T3），实验表明该方法在多个基准数据集上超越了现有最优方法，并在少样本条件下表现良好。

摘要翻译

基于深度学习的谣言检测技术通常利用图神经网络（GNN）来分析帖子关系。然而，这些方法在处理谣言传播结构时，因过度平滑问题而表现不佳，导致性能下降。我们对这一问题的研究发现，过度平滑本质上与谣言传播树的结构特征相关，其中大多数节点为单层节点。此外，图神经网络难以捕捉这些树结构中的长程依赖关系。为应对这些挑战，我们提出了一种基于纯Transformer架构的预训练传播树Transformer（P2T3）方法。该方法沿回复的传播方向从树结构中提取所有对话链，利用词元级嵌入注入连接信息并引入必要的归纳偏置，并在大规模无标注数据集上进行预训练。实验表明，P2T3在多个基准数据集上超越了以往的最优方法，并在少样本条件下表现良好。P2T3不仅避免了图神经网络固有的过度平滑问题，还可能为未来社交媒体研究提供大型模型或统一的多模态方案。

摘要 (Abstract)

Deep learning techniques for rumor detection typically utilize Graph Neural Networks (GNNs) to analyze post relations. These methods, however, falter due to over-smoothing issues when processing rumor propagation structures, leading to declining performance. Our investigation into this issue reveals that over-smoothing is intrinsically tied to the structural characteristics of rumor propagation trees, in which the majority of nodes are 1-level nodes. Furthermore, GNNs struggle to capture long-range dependencies within these trees. To circumvent these challenges, we propose a Pre-Trained Propagation Tree Transformer (P2T3) method based on pure Transformer architecture. It extracts all conversation chains from a tree structure following the propagation direction of replies, utilizes token-wise embedding to infuse connection information and introduces necessary inductive bias, and pre-trains on large-scale unlabeled datasets. Experiments indicate that P2T3 surpasses previous state-of-the-art methods in multiple benchmark datasets and performs well under few-shot conditions. P2T3 not only avoids the over-smoothing issue inherent in GNNs but also potentially offers a large model or unified multi-modal scheme for future social media research.

关键词: rumor detection, propagation tree, Transformer, pre-training, over-smoothing, social media, few-shot learning, graph neural networks

15. ❌ Permutation-Symmetrized Diffusion for Unconditional Molecular Generation

作者: Gyeonghoon Ko, Juho Lee 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23255v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文专注于分子点云生成的扩散模型方法，提出了一种在商流形上直接建模扩散的新技术，以处理排列对称性。所有关键词中，只有"AI for Science” OR “Bioinformatics” OR “Cheminformatics"高度相关（10分），因为论文涉及AI在科学（具体是化学信息学/分子生成）中的应用。其他关键词主要关于大语言模型（LLMs）及其相关技术（如微调、对齐、推理、代理等），而本文研究的是扩散模型用于分子生成，不涉及语言模型或这些特定LLM技术，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于商流形的排列对称化扩散模型，用于无条件3D分子生成，在QM9数据集上实现了竞争性的生成质量和效率提升。

摘要翻译

置换不变性是分子点云生成的基础，然而大多数扩散模型通过在有序空间上使用置换等变网络间接实现这一性质。我们提出直接在商流形 $\tilde{\calX}=\sR^{d\times N}/S_N$ 上建模扩散过程，其中所有原子置换均被视为同一元素。我们证明了 $\tilde{\calX}$ 上的热核具有显式表达式，即所有置换下欧几里得热核的求和，这阐明了商空间上的扩散与有序粒子扩散的本质差异。训练过程需要涉及 $S_N$ 上难以处理的求和项的置换对称化分数；我们推导了基于置换后验的期望形式，并利用置换空间中的马尔可夫链蒙特卡洛方法进行近似。我们在 EQGAT-Diff 协议下，使用 SemlaFlow 风格的主干网络并连续处理所有变量，对 QM9 数据集的无条件三维分子生成任务进行了评估。结果表明，基于商空间的置换对称化方法具有实用性，能够在提升效率的同时生成具有竞争力的分子结构。

摘要 (Abstract)

Permutation invariance is fundamental in molecular point-cloud generation, yet most diffusion models enforce it indirectly via permutation-equivariant networks on an ordered space. We propose to model diffusion directly on the quotient manifold $\tilde{\calX}=\sR^{d\times N}/S_N$, where all atom permutations are identified. We show that the heat kernel on $\tilde{\calX}$ admits an explicit expression as a sum of Euclidean heat kernels over permutations, which clarifies how diffusion on the quotient differs from ordered-particle diffusion. Training requires a permutation-symmetrized score involving an intractable sum over $S_N$; we derive an expectation form over a posterior on permutations and approximate it using MCMC in permutation space. We evaluate on unconditional 3D molecule generation on QM9 under the EQGAT-Diff protocol, using SemlaFlow-style backbone and treating all variables continuously. The results demonstrate that quotient-based permutation symmetrization is practical and yields competitive generation quality with improved efficiency.

关键词: diffusion models, molecular generation, permutation invariance, quotient manifold, 3D molecules, QM9, score estimation, MCMC

16. ❌ PhySe-RPO: Physics and Semantics Guided Relative Policy Optimization for Diffusion-Based Surgical Smoke Removal

作者: Zining Fang, Cheng Xue, Chunhui Liu, Bin Xu, Ming Chen, Xiaowei Hu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22844v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 该论文专注于计算机视觉和医学图像处理领域，提出了一种基于扩散模型的去烟方法（PhySe-RPO），用于手术视频的烟雾去除。论文的核心技术是扩散模型、强化学习优化（Relative Policy Optimization）和计算机视觉技术（CLIP-based语义奖励），并未涉及任何大语言模型（LLM）或相关技术。所有关键词均与大语言模型、其训练方法、优化技术、推理技术、代理系统或通用AI技术直接相关，而本文研究的是特定领域的视觉任务，因此除最后一个关键词（“AI for Science”）外，其余均完全无关。“AI for Science” 得5分，因为手术烟雾去除可视为AI在医学科学（外科手术）中的一个应用，但论文并未明确强调"Bioinformatics"或"Cheminformatics”。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为PhySe-RPO的扩散模型框架，通过物理和语义引导的相对策略优化，解决了手术视频中烟雾去除的问题，在有限配对监督下实现了物理一致、语义忠实且临床可解释的烟雾去除效果。

摘要翻译

手术烟雾严重降低术中视频质量，模糊解剖结构并限制手术感知。现有基于学习的去雾方法依赖稀缺的配对监督和确定性修复流程，难以在真实手术条件下进行探索或强化驱动的优化。我们提出PhySe-RPO——一种通过物理与语义引导的相对策略优化（Physics- and Semantics-Guided Relative Policy Optimization）进行优化的扩散修复框架。其核心思想是将确定性修复转化为随机策略，通过组间相对优化实现轨迹级探索与无评论家更新。物理引导奖励函数施加光照与色彩一致性约束，同时基于CLIP学习的手术视觉概念语义奖励促进无烟雾且解剖结构连贯的修复结果。结合无参考感知约束，PhySe-RPO在合成与真实机器人手术数据集上产生物理一致、语义可靠且具有临床可解释性的结果，为有限配对监督下实现基于扩散模型的鲁棒修复提供了理论化路径。

摘要 (Abstract)

Surgical smoke severely degrades intraoperative video quality, obscuring anatomical structures and limiting surgical perception. Existing learning-based desmoking approaches rely on scarce paired supervision and deterministic restoration pipelines, making it difficult to perform exploration or reinforcement-driven refinement under real surgical conditions. We propose PhySe-RPO, a diffusion restoration framework optimized through Physics- and Semantics-Guided Relative Policy Optimization. The core idea is to transform deterministic restoration into a stochastic policy, enabling trajectory-level exploration and critic-free updates via group-relative optimization. A physics-guided reward imposes illumination and color consistency, while a visual-concept semantic reward learned from CLIP-based surgical concepts promotes smoke-free and anatomically coherent restoration. Together with a reference-free perceptual constraint, PhySe-RPO produces results that are physically consistent, semantically faithful, and clinically interpretable across synthetic and real robotic surgical datasets, providing a principled route to robust diffusion-based restoration under limited paired supervision.

关键词: surgical smoke removal, diffusion model, relative policy optimization, physics-guided reward, semantic reward, robotic surgery, video restoration, limited supervision

17. ❌ Failure of contextual invariance in gender inference with large language models

作者: Sagar Kumar, Ariel Flint, Luca Maria Aiello, Andrea Baronchelli 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23485v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型在性别推断任务中的上下文不变性失败问题，与’Large Language Models’高度相关（10分）。研究涉及模型输出的事实性和可解释性，与’Hallucination Mitigation’和’Mechanistic Interpretability’有一定关联（各5分）。其他关键词如MoE、SLMs、训练方法、推理加速、AI for Science等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究发现大语言模型在性别推断任务中违反上下文不变性假设，即使语法表述几乎相同，微小的上下文变化也会导致模型输出发生系统性偏移，这对偏见基准测试和高风险部署具有重要影响。

摘要翻译

标准评估方法通常假定大型语言模型（LLM）的输出在任务表述语境等价时保持稳定。本文在性别推断任务中检验这一假设。通过一个受控的代词选择任务，我们引入了理论上无信息量的最小语篇语境，发现这会引发模型输出出现显著且系统性的偏移。在无语境条件下存在的文化性别刻板印象相关性，在引入语境后减弱或消失；而理论上无关的特征（例如无关指代对象的代词性别）则成为预测模型行为最具信息量的指标。通过默认语境性分析发现，在不同模型中，19%至52%的案例在排除语境对单个输出的所有边际效应后，这种依赖性依然存在，且无法简单归因于代词重复。这些结果表明，即使在句法表述近乎相同的条件下，LLM输出仍会违反语境不变性，这对偏见基准测试及高风险场景中的模型部署具有重要启示。

摘要 (Abstract)

Standard evaluation practices assume that large language model (LLM) outputs are stable under contextually equivalent formulations of a task. Here, we test this assumption in the setting of gender inference. Using a controlled pronoun selection task, we introduce minimal, theoretically uninformative discourse context and find that this induces large, systematic shifts in model outputs. Correlations with cultural gender stereotypes, present in decontextualized settings, weaken or disappear once context is introduced, while theoretically irrelevant features, such as the gender of a pronoun for an unrelated referent, become the most informative predictors of model behaviour. A Contextuality-by-Default analysis reveals that, in 19–52% of cases across models, this dependence persists after accounting for all marginal effects of context on individual outputs and cannot be attributed to simple pronoun repetition. These findings show that LLM outputs violate contextual invariance even under near-identical syntactic formulations, with implications for bias benchmarking and deployment in high-stakes settings.

关键词: large language models, contextual invariance, gender inference, bias benchmarking, pronoun selection, model behavior, systematic shifts, high-stakes settings

18. ❌ VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

作者: Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas, Yassine Ouali, Georgios Tzimiropoulos 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23495v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大视觉语言模型（LVLMs）的效率优化，通过稀疏化图像与文本token的交互来减少计算成本，属于大模型技术原理的创新。高度相关关键词：1）‘Large Language Models’（论文明确研究LVLMs，是LLMs的视觉扩展）；2）‘Mixture of Experts OR MoE OR Sparse Models’（论文采用稀疏交互和动态选择机制，与稀疏模型概念相关）；3）‘Speculative Decoding OR Inference Acceleration’（论文直接优化推理效率，降低计算成本）。其他关键词如SLMs、训练方法、对齐、代理等均未涉及，AI for Science虽属应用领域但论文未明确针对科学领域，故评0分。

!!! tip deepseek-chat TL;DR

论文提出VISOR方法，通过稀疏化和动态选择视觉-语言交互层来优化大视觉语言模型的推理效率，在降低计算成本的同时保持或提升多任务性能。

摘要翻译

现有提升大型视觉语言模型效率的方法主要基于视觉令牌缩减的概念。然而，这种方法会形成信息瓶颈，从而损害模型性能，尤其是在需要细粒度理解和推理的复杂任务上。在本研究中，我们通过引入“按需视觉”方法挑战这一范式，该方法能在不丢弃视觉信息的前提下降低推理成本。VISOR并非压缩图像，而是通过稀疏化图像与文本令牌之间的交互来提高效率。具体而言，语言模型通过少量策略性放置的注意力层来处理完整的高分辨率视觉令牌集：高效的文本-图像交叉注意力层提供通用的视觉上下文，而少数精心布置且动态选择的自注意力层则对视觉表征本身进行细化，从而在需要时实现复杂的高分辨率推理。基于此原理，我们首先通过调整自注意力层数量，在多种计算预算下训练一个通用的单一网络，随后引入轻量级策略机制，根据每个样本的复杂度动态分配视觉计算资源。大量实验表明，VISOR在显著降低计算成本的同时，在一系列多样化基准测试中达到或超越了现有最优性能，并在需要精细视觉理解的挑战性任务中表现卓越。

摘要 (Abstract)

Existing approaches for improving the efficiency of Large Vision-Language Models (LVLMs) are largely based on the concept of visual token reduction. This approach, however, creates an information bottleneck that impairs performance, especially on challenging tasks that require fine-grained understanding and reasoning. In this work, we challenge this paradigm by introducing VISion On Request (VISOR), a method that reduces inference cost without discarding visual information. Instead of compressing the image, VISOR improves efficiency by sparsifying the interaction between image and text tokens. Specifically, the language model attends to the full set of high-resolution visual tokens through a small, strategically placed set of attention layers: general visual context is provided by efficient cross-attention between text-image, while a few well-placed and dynamically selected self-attention layers refine the visual representations themselves, enabling complex, high-resolution reasoning when needed. Based on this principle, we first train a single universal network on a range of computational budgets by varying the number of self-attention layers, and then introduce a lightweight policy mechanism that dynamically allocates visual computation based on per-sample complexity. Extensive experiments show that VISOR drastically reduces computational cost while matching or exceeding state-of-the-art results across a diverse suite of benchmarks, and excels in challenging tasks that require detailed visual understanding.

关键词: Large Vision-Language Models, efficiency optimization, sparse interaction, dynamic computation allocation, inference acceleration, visual token reduction, cross-attention, self-attention layers

19. ❌ ReqFusion: A Multi-Provider Framework for Automated PEGS Analysis Across Software Domains

作者: Muhammad Khalid, Manuel Oriol, Yilmaz Uygun 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23482v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是开发一个利用多个LLM提供商（OpenAI GPT、Anthropic Claude、Groq）自动化软件需求工程的多提供商框架，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文涉及AI在软件工程领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为软件工程可视为广义的AI应用领域。其他关键词（如MoE、SFT、RAG等）在摘要中未提及，与论文内容完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了ReqFusion，一个利用多个大型语言模型提供商自动化软件需求提取、分类和分析的AI增强系统，通过PEGS引导的提示方法显著提高了提取准确性（F1分数从0.71提升到0.88）并减少了78%的分析时间。

摘要翻译

需求工程是软件开发过程中至关重要但劳动密集的环节。本文介绍ReqFusion：一种利用多个大型语言模型（Large Language Model，简称LLM）提供商、通过人工智能增强的自动化软件需求提取、分类与分析系统。ReqFusion的架构整合了OpenAI GPT、Anthropic Claude和Groq模型，能够从学术、工业及招标提案等场景下的多种文档格式（PDF、DOCX和PPTX）中提取功能性与非功能性需求。该系统采用与领域无关的提取方法，并遵循Bertrand Meyer提出的项目、环境、目标与系统（Project, Environment, Goal, and System，简称PEGS）框架生成需求。其核心思想在于：由于PEGS格式描述详尽，大型语言模型能获得更多关于需求的信息与线索，从而比简单的通用请求产生更优结果。一项消融研究证实了该假设：在相同的多提供商配置下，采用PEGS引导的提示方法取得了0.88的F1分数，而通用提示方法仅为0.71。评估使用18份真实世界文档，通过自动分类生成226条需求，涵盖学术、商业和技术领域，其中功能性需求占54.9%，非功能性需求占45.1%。对五个项目共1,050条需求的扩展评估表明，与人工方法相比，系统在提取准确性上显著提升，且分析时间减少了78%。多提供商架构通过模型共识与回退机制增强了可靠性，而基于PEGS的方法则确保全面覆盖所有需求类别。

摘要 (Abstract)

Requirements engineering is a vital, yet labor-intensive, stage in the software development process. This article introduces ReqFusion: an AI-enhanced system that automates the extraction, classification, and analysis of software requirements utilizing multiple Large Language Model (LLM) providers. The architecture of ReqFusion integrates OpenAI GPT, Anthropic Claude, and Groq models to extract functional and non-functional requirements from various documentation formats (PDF, DOCX, and PPTX) in academic, industrial, and tender proposal contexts. The system uses a domain-independent extraction method and generates requirements following the Project, Environment, Goal, and System (PEGS) approach introduced by Bertrand Meyer. The main idea is that, because the PEGS format is detailed, LLMs have more information and cues about the requirements, producing better results than a simple generic request. An ablation study confirms this hypothesis: PEGS-guided prompting achieves an F1 score of 0.88, compared to 0.71 for generic prompting under the same multi-provider configuration. The evaluation used 18 real-world documents to generate 226 requirements through automated classification, with 54.9% functional and 45.1% nonfunctional across academic, business, and technical domains. An extended evaluation on five projects with 1,050 requirements demonstrated significant improvements in extraction accuracy and a 78% reduction in analysis time compared to manual methods. The multi-provider architecture enhances reliability through model consensus and fallback mechanisms, while the PEGS-based approach ensures comprehensive coverage of all requirement categories.

关键词: Requirements Engineering, Large Language Models, Multi-Provider Framework, PEGS Analysis, Automated Extraction, Software Requirements, Model Consensus, Domain-Independent

20. ❌ VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

作者: Haoran Yuan, Weigang Yi, Zhenyu Zhang, Wendi Chen, Yuchen Mo, Jiashi Yin, Xinzhuo Li, Xiangyu Zeng, Chuan Wen, Cewu Lu, Katherine Driggs-Campbell, Ismini Lourentzou 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23481v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文VTAM专注于具身智能中的多模态世界建模，核心创新在于将触觉感知整合到视频-动作模型中，以提升接触丰富场景下的物理交互性能。与大多数关键词（如LLM、推理、对齐、压缩等）无关，因为这些关键词主要针对语言模型或特定NLP技术。相关关键词：1) ‘World Models AND General World Models’（10分）：论文明确提出了’world modeling framework’，是核心内容；2) ‘Pre-training OR Continual Pre-training OR Domain Adaptation’（5分）：使用了预训练视频变换器并进行微调；3) ‘Post-training OR Supervised Fine-tuning OR SFT’（5分）：涉及轻量级模态转移微调；4) ‘PEFT OR LoRA OR Parameter-efficient Fine-tuning’（5分）：通过轻量级微调实现跨模态表示学习，符合参数高效微调概念。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

论文解决了视频-动作模型在接触丰富场景中因视觉信息不足导致的性能限制问题，通过引入触觉感知的多模态世界建模框架VTAM，显著提升了复杂物理交互任务的鲁棒性和成功率。

摘要翻译

视频动作模型已成为具身智能领域一个有前景的框架，它从原始视频流中学习隐式的世界动态，以产生时序一致的动作预测。尽管此类模型通过视觉推理在长时程任务上展现出强大性能，但在接触密集的场景中仍存在局限，因为关键的交互状态仅凭视觉往往无法完全观测。具体而言，视觉令牌无法可靠地编码细粒度的力调节与接触状态转换，导致行为不稳定或不精确。为弥补这一差距，我们提出了视频-触觉动作模型，这是一种多模态世界建模框架，将触觉感知作为补充的接地信号纳入其中。VTAM通过轻量级的模态迁移微调，将触觉流集成到预训练的视频Transformer中，实现了无需触觉-语言配对数据或独立触觉预训练的高效跨模态表征学习。为稳定多模态融合，我们引入了一种触觉正则化损失，以强制实现平衡的跨模态注意力，防止动作模型中视觉潜在表征占据主导。VTAM在接触密集的操作任务中展现出卓越性能，平均保持90%的鲁棒成功率。在需要高保真力感知的挑战性场景（如薯片抓取放置）中，VTAM比π0.5基线模型性能高出80%。我们的研究结果表明，整合触觉反馈对于纠正世界动作模型中的视觉估计误差至关重要，为构建物理接地的具身基础模型提供了一条可扩展的路径。

摘要 (Abstract)

Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine-grained force modulation and contact transitions are not reliably encoded in visual tokens, leading to unstable or imprecise behaviors. To bridge this gap, we introduce the Video-Tactile Action Model (VTAM), a multimodal world modeling framework that incorporates tactile perception as a complementary grounding signal. VTAM augments a pretrained video transformer with tactile streams via a lightweight modality transfer finetuning, enabling efficient cross-modal representation learning without tactile-language paired data or independent tactile pretraining. To stabilize multimodal fusion, we introduce a tactile regularization loss that enforces balanced cross-modal attention, preventing visual latent dominance in the action model. VTAM demonstrates superior performance in contact-rich manipulation, maintaining a robust success rate of 90 percent on average. In challenging scenarios such as potato chip pick-and-place requiring high-fidelity force awareness, VTAM outperforms the pi 0.5 baseline by 80 percent. Our findings demonstrate that integrating tactile feedback is essential for correcting visual estimation errors in world action models, providing a scalable approach to physically grounded embodied foundation models.

关键词: Video-Tactile Action Model, multimodal world modeling, tactile perception, embodied intelligence, contact-rich manipulation, cross-modal representation learning, physical interaction, force modulation

21. ❌ InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting

作者: Duc Vu, Kien Nguyen, Trong-Tung Nguyen, Ngan Nguyen, Phong Nguyen, Khoi Nguyen, Cuong Pham, Anh Tran 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23463v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于扩散模型在图像修复中的应用，提出了一种名为InverFill的单步反演方法，旨在提高少步采样的修复质量。所有评分关键词均与大语言模型（LLMs）或相关技术（如MoE、SFT、RAG、量化等）直接相关，而本文研究的是扩散模型（一种生成模型），并非大语言模型或其衍生技术。因此，论文与所有关键词完全无关，每个关键词得分为0。

!!! tip deepseek-chat TL;DR

本文针对扩散模型在图像修复中因随机噪声初始化导致少步采样时语义不匹配和伪影的问题，提出了一种单步反演方法InverFill，通过将输入掩码图像的语义信息注入初始噪声，实现了高质量、高效率的少步修复，无需真实图像监督或昂贵重训练。

摘要翻译

近期基于扩散模型的图像修复技术虽能实现照片级真实感，但需要大量采样步骤，限制了实际应用。少步数的文生图模型虽能加速生成，但直接将其应用于修复任务会导致背景与修复区域之间协调性差并产生伪影。我们发现其根源在于随机高斯噪声初始化——在低函数评估次数下，这种初始化会导致语义错位并降低生成保真度。为解决这一问题，我们提出InverFill：一种专为修复任务设计的一步反演方法，该方法将输入掩码图像的语义信息注入初始噪声，从而实现高保真度的少步数图像修复。InverFill无需专门训练修复模型，而是将语义对齐的噪声作为输入，在混合采样流程中利用少步数文生图模型进行生成。该方法显著改进了原始混合采样效果，在低NFEs（噪声函数评估次数）条件下甚至可媲美专用修复模型。此外，InverFill无需真实图像监督，仅增加极少的推理开销。大量实验表明，InverFill能持续提升基线少步数模型的性能，在无需昂贵重训练或繁重迭代优化的情况下，同步提升图像质量与文本语义一致性。

摘要 (Abstract)

Recent diffusion-based models achieve photorealism in image inpainting but require many sampling steps, limiting practical use. Few-step text-to-image models offer faster generation, but naively applying them to inpainting yields poor harmonization and artifacts between the background and inpainted region. We trace this cause to random Gaussian noise initialization, which under low function evaluations causes semantic misalignment and reduced fidelity. To overcome this, we propose InverFill, a one-step inversion method tailored for inpainting that injects semantic information from the input masked image into the initial noise, enabling high-fidelity few-step inpainting. Instead of training inpainting models, InverFill leverages few-step text-to-image models in a blended sampling pipeline with semantically aligned noise as input, significantly improving vanilla blended sampling and even matching specialized inpainting models at low NFEs. Moreover, InverFill does not require real-image supervision and only adds minimal inference overhead. Extensive experiments show that InverFill consistently boosts baseline few-step models, improving image quality and text coherence without costly retraining or heavy iterative optimization.

关键词: diffusion models, image inpainting, few-step sampling, semantic alignment, noise initialization, blended sampling, inference efficiency, photorealism

22. ❌ 3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding

作者: Yiping Chen, Jinpeng Li, Wenyu Ke, Yang Luo, Jie Ouyang, Zhongjie He, Li Liu, Hongchao Fan, Hao Wu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23447v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是开发3DCity-LLM框架，将多模态大语言模型应用于3D城市规模感知与理解，因此与’Large Language Models’高度相关（10分）。论文创建了1.2M高质量数据集，涉及数据质量和扩展，与’Scaling Laws AND Data Quality’有一定关联（5分）。论文涉及模型训练，与’Pre-training’和’Post-training’类别有一定关联（各5分）。其他关键词如MoE、SLMs、对齐、推理加速、科学AI等均未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文提出了3DCity-LLM框架，通过多模态大语言模型和高质量数据集解决了3D城市规模环境下的视觉语言感知与理解问题，在基准测试中显著优于现有方法。

摘要翻译

尽管多模态大语言模型在以物体为中心或室内场景中表现出色，但将其扩展至三维城市尺度环境仍是一项艰巨挑战。为弥合这一差距，我们提出了3DCity-LLM——一个为三维城市尺度视觉语言感知与理解设计的统一框架。该框架采用由粗到细的特征编码策略，包含目标物体、物体间关系与全局场景三个并行分支。为支持大规模训练，我们构建了3DCity-LLM-1.2M数据集，该数据集涵盖七个代表性任务类别约120万个高质量样本，范围涵盖细粒度物体分析到多维度场景规划。经过严格质量控制的该数据集整合了显式三维数值信息与多样化的用户导向模拟，增强了城市场景问答的多样性与真实感。此外，我们采用基于文本相似度度量与大语言模型语义评估的多维评估协议，以确保对所有方法进行忠实且全面的评估。在两个基准测试上的大量实验表明，3DCity-LLM显著优于现有最先进方法，为推进空间推理与城市智能发展提供了具有前景的重要方向。源代码与数据集已发布于https://github.com/SYSU-3DSTAILab/3D-City-LLM。

摘要 (Abstract)

While multi-modality large language models excel in object-centric or indoor scenarios, scaling them to 3D city-scale environments remains a formidable challenge. To bridge this gap, we propose 3DCity-LLM, a unified framework designed for 3D city-scale vision-language perception and understanding. 3DCity-LLM employs a coarse-to-fine feature encoding strategy comprising three parallel branches for target object, inter-object relationship, and global scene. To facilitate large-scale training, we introduce 3DCity-LLM-1.2M dataset that comprises approximately 1.2 million high-quality samples across seven representative task categories, ranging from fine-grained object analysis to multi-faceted scene planning. This strictly quality-controlled dataset integrates explicit 3D numerical information and diverse user-oriented simulations, enriching the question-answering diversity and realism of urban scenarios. Furthermore, we apply a multi-dimensional protocol based on text-similarity metrics and LLM-based semantic assessment to ensure faithful and comprehensive evaluations for all methods. Extensive experiments on two benchmarks demonstrate that 3DCity-LLM significantly outperforms existing state-of-the-art methods, offering a promising and meaningful direction for advancing spatial reasoning and urban intelligence. The source code and dataset are available at https://github.com/SYSU-3DSTAILab/3D-City-LLM.

关键词: 3D city-scale perception, multi-modality large language models, vision-language understanding, coarse-to-fine feature encoding, 3DCity-LLM-1.2M dataset, spatial reasoning, urban intelligence, benchmark evaluation

23. ❌ Code Review Agent Benchmark

作者: Yuntong Zhang, Zhiyuan Pan, Imam Nur Bani Yusuf, Haifeng Ruan, Ridwan Shariffdeen, Abhik Roychoudhury 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23448v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究AI代理在代码审查任务中的评估，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为论文明确提到’Software engineering agents’、‘code review agents’和’AI agents’，并评估了多个代理系统。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分），因为论文评估的商业代理（如Claude Code、Codex）可能基于大语言模型，但论文未深入讨论模型技术细节。其他关键词与论文内容无关（0分），因为论文聚焦于代码审查代理的基准测试和评估框架，不涉及模型架构、训练方法、推理优化、对齐技术或特定科学领域应用。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为c-CRAB的代码审查代理基准测试数据集和评估框架，用于评估AI代理在代码审查任务中的性能，发现现有代理仅能解决约40%的任务，并揭示了人类与代理在代码审查中的潜在协作机会。

摘要翻译

软件工程智能体在代码编写方面已展现出显著潜力。随着AI智能体广泛渗透代码编写领域并自动生成海量代码，代码质量问题已成为核心关注点。当自动生成的代码被集成至大型代码库时，代码审查及广义的质量保障问题变得至关重要。本文以全新视角审视该问题，构建了供AI智能体使用的代码审查数据集。我们提出的数据集c-CRAB（发音为see-crab）能够评估智能体在代码审查任务中的表现。具体而言，给定一个拉取请求（可能来自代码生成智能体或人类），若代码审查智能体生成审查意见，我们的评估框架可对其审查能力进行量化评估。该框架已用于评估当前前沿技术——包括开源PR-agent，以及Devin、Claude Code和Codex等商业代码审查智能体。
c-CRAB数据集系统性地构建于人类审查记录之上：针对拉取请求实例的人类审查意见，我们生成相应测试用例以评估代码审查智能体生成的审查。这种基准构建方式带来多项发现：首先，现有审查智能体整体仅能解决约40%的c-CRAB任务，表明未来研究具有填补此空白的潜力；其次，我们观察到智能体审查常关注与人类审查不同的维度，这预示着未来软件团队可能部署人机协同的代码审查模式；最后，基于数据集生成的智能体测试用例可作为独立测试套件，从而构成智能体生成审查的质量关卡。这对于未来代码生成智能体、测试生成智能体与代码审查智能体之间的协作将产生何种影响，仍有待进一步探索。

摘要 (Abstract)

Software engineering agents have shown significant promise in writing code. As AI agents permeate code writing, and generate huge volumes of code automatically – the matter of code quality comes front and centre. As the automatically generated code gets integrated into huge code-bases – the issue of code review and broadly quality assurance becomes important. In this paper, we take a fresh look at the problem and curate a code review dataset for AI agents to work with. Our dataset called c-CRAB (pronounced see-crab) can evaluate agents for code review tasks. Specifically given a pull-request (which could be coming from code generation agents or humans), if a code review agent produces a review, our evaluation framework can asses the reviewing capability of the code review agents. Our evaluation framework is used to evaluate the state of the art today – the open-source PR-agent, as well as commercial code review agents from Devin, Claude Code, and Codex. Our c-CRAB dataset is systematically constructed from human reviews – given a human review of a pull request instance we generate corresponding tests to evaluate the code review agent generated reviews. Such a benchmark construction gives us several insights. Firstly, the existing review agents taken together can solve only around 40% of the c-CRAB tasks, indicating the potential to close this gap by future research. Secondly, we observe that the agent reviews often consider different aspects from the human reviews – indicating the potential for human-agent collaboration for code review that could be deployed in future software teams. Last but not the least, the agent generated tests from our data-set act as a held out test-suite and hence quality gate for agent generated reviews. What this will mean for future collaboration of code generation agents, test generation agents and code review agents – remains to be investigated.

关键词: code review, AI agents, benchmark, software engineering, evaluation framework, pull request, human-agent collaboration, code quality

24. ❌ Evaluating LLM-Based Test Generation Under Software Evolution

作者: Sabaat Haroon, Mohammad Taha Khan, Muhammad Ali Gulzar 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23443v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在软件测试生成中的应用，与"Large Language Models OR LLMs OR Foundation Models"高度相关（10分），但未涉及其他关键词的具体技术（如MoE、SFT、RAG等）或科学领域应用（如生物信息学），因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM在软件演化过程中生成单元测试的有效性，发现LLM生成的测试主要依赖表面模式而非深层推理，导致在代码语义变化时覆盖率下降和回归意识不足。

摘要翻译

大语言模型（LLM）在自动化单元测试生成中的应用日益广泛。然而，目前尚不清楚这些测试是否真正体现了对程序行为的推理，还是仅仅复现了训练过程中学到的表层模式。若后者占主导，LLM生成的测试可能表现出覆盖率降低、遗漏回归问题以及未能检测到缺陷等弱点。因此，理解LLM如何生成测试以及这些测试如何响应代码演化至关重要。本文针对程序变更下基于LLM的测试生成进行了大规模实证研究。通过一个自动化的变异驱动框架，我们分析了八个LLM在22,374个程序变体上生成的测试如何响应语义变更（SAC）和语义保持变更（SPC）。
LLM在基准测试中取得了强劲的结果：在原始程序上，完全通过的测试套件实现了79%的行覆盖率和76%的分支覆盖率。然而，随着程序演化，其性能出现下降。在SAC下，新生成测试的通过率降至66%，分支覆盖率下降至60%。超过99%的失败SAC测试在原始程序上执行修改区域时能够通过，这表明测试仍与原始行为保持残留对齐，而非适应更新后的语义。尽管SPC不改变功能，性能同样出现下降：通过率降至79%，分支覆盖率降至69%。虽然SPC编辑保留了语义，但它们常引入较大的语法变化，导致生成的测试套件不稳定。模型在丢弃大量基准测试的同时生成了更多新测试，这表明其对词汇变化敏感，而非真正理解语义影响。总体而言，我们的结果表明，当前基于LLM的测试生成严重依赖表层线索，且在程序演化过程中难以保持回归意识。

摘要 (Abstract)

Large Language Models (LLMs) are increasingly used for automated unit test generation. However, it remains unclear whether these tests reflect genuine reasoning about program behavior or simply reproduce superficial patterns learned during training. If the latter dominates, LLM-generated tests may exhibit weaknesses such as reduced coverage, missed regressions, and undetected faults. Understanding how LLMs generate tests and how those tests respond to code evolution is therefore essential. We present a large-scale empirical study of LLM-based test generation under program changes. Using an automated mutation-driven framework, we analyze how generated tests react to semantic-altering changes (SAC) and semantic-preserving changes (SPC) across eight LLMs and 22,374 program variants. LLMs achieve strong baseline results, reaching 79% line coverage and 76% branch coverage with fully passing test suites on the original programs. However, performance degrades as programs evolve. Under SACs, the pass rate of newly generated tests drops to 66%, and branch coverage declines to 60%. More than 99% of failing SAC tests pass on the original program while executing the modified region, indicating residual alignment with the original behavior rather than adaptation to updated semantics. Performance also declines under SPCs despite unchanged functionality: pass rates fall to 79% and branch coverage to 69%. Although SPC edits preserve semantics, they often introduce larger syntactic changes, leading to instability in generated test suites. Models generate more new tests while discarding many baseline tests, suggesting sensitivity to lexical changes rather than true semantic impact. Overall, our results indicate that current LLM-based test generation relies heavily on surface-level cues and struggles to maintain regression awareness as programs evolve.

关键词: LLM-based test generation, software evolution, unit test generation, empirical study, regression awareness, semantic-altering changes, test coverage, program variants

25. ❌ Targeted Adversarial Traffic Generation : Black-box Approach to Evade Intrusion Detection Systems in IoT Networks

作者: Islam Debicha, Tayeb Kenaza, Ishak Charfi, Salah Mosbah, Mehdi Sehaki, Jean-Michel Dricot 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23438v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究物联网网络中的入侵检测系统对抗性攻击，属于网络安全领域，主要涉及传统机器学习算法在物联网安全中的应用、对抗性攻击的可行性评估以及防御机制设计。论文内容完全不涉及大语言模型、深度学习技术原理创新、大模型在不同领域的应用或任何评分关键词中的技术概念，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了针对物联网网络入侵检测系统的黑盒对抗性攻击的可行性，并提出了相应的防御机制，结果表明攻击能够成功规避检测系统，而提出的防御机制能有效检测大部分对抗性流量。

摘要翻译

机器学习（ML）算法与物联网（IoT）应用的融合在带来显著优势的同时，也引入了对抗性攻击的脆弱性，尤其是在基于物联网的入侵检测系统（IDS）中。尽管理论上的对抗性攻击已得到广泛研究，但实际实施中的限制条件常被忽视。本研究通过采用一种新颖的黑盒对抗性攻击方法，评估针对基于物联网网络的入侵检测系统实施逃避攻击的可行性，从而弥补了这一研究空白。我们的工作旨在将理论漏洞与现实世界的适用性联系起来，以深化对现代物联网生态系统中复杂威胁的理解与防御。此外，我们提出了一种定制化的防御方案，以减轻逃避攻击的影响，从而增强基于机器学习的入侵检测系统的韧性。我们的研究结果展示了对入侵检测系统成功的逃避攻击，突显了其对先进技术的易受性。相比之下，我们提出的防御机制通过有效检测大部分对抗性流量，展现出强大的性能，与当前最先进的防御方案相比取得了令人瞩目的成果。通过应对这些关键网络安全挑战，我们的研究为推动物联网安全发展做出了贡献，并为开发更具韧性的入侵检测系统提供了见解。

摘要 (Abstract)

The integration of machine learning (ML) algorithms into Internet of Things (IoT) applications has introduced significant advantages alongside vulnerabilities to adversarial attacks, especially within IoT-based intrusion detection systems (IDS). While theoretical adversarial attacks have been extensively studied, practical implementation constraints have often been overlooked. This research addresses this gap by evaluating the feasibility of evasion attacks on IoT network-based IDSs, employing a novel black-box adversarial attack. Our study aims to bridge theoretical vulnerabilities with real-world applicability, enhancing understanding and defense against sophisticated threats in modern IoT ecosystems. Additionally, we propose a defense scheme tailored to mitigate the impact of evasion attacks, thereby reinforcing the resilience of ML-based IDSs. Our findings demonstrate successful evasion attacks against IDSs, underscoring their susceptibility to advanced techniques. In contrast, we proposed a defense mechanism that exhibits robust performance by effectively detecting the majority of adversarial traffic, showcasing promising outcomes compared to current state-of-the-art defenses. By addressing these critical cybersecurity challenges, our research contributes to advancing IoT security and provides insights for developing more resilient IDS.

关键词: adversarial attacks, intrusion detection systems, IoT networks, black-box attack, machine learning, evasion attacks, cybersecurity, defense mechanism

26. ❌ Mecha-nudges for Machines

作者: Giulio Frey, Kawin Ethayarajh 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23433v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究AI代理（特别是LLM驱动的代理）在决策环境中如何被"机制性助推”（mecha-nudges）影响，与"LLM Agents OR Autonomous Agents OR Agentic Workflow"高度相关（10分），因为核心关注AI代理的行为和决策流程。与"Large Language Models OR LLMs OR Foundation Models"有一定关联（8分），因为论文提到ChatGPT的发布影响了Etsy列表中的机器可用信息，表明LLM是相关背景。其他关键词如MoE、SFT、RAG等涉及具体技术细节，论文未探讨，故均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了"机制性助推"（mecha-nudges）的概念，即通过改变选择呈现方式来系统性地影响AI代理而不损害人类决策环境，并应用贝叶斯说服框架和V-可用信息进行形式化，发现在ChatGPT发布后Etsy产品列表包含更多机器可用信息，表明存在系统性机制性助推。

摘要翻译

助推（nudges）是指在不限制选项或改变激励的前提下，通过微妙调整向人类决策者呈现选择的方式（例如默认加入与默认退出），从而引导行为改变。随着人工智能（AI）代理在与人类相同的环境中越来越多地参与决策，选择呈现方式可能同时针对机器和人类进行优化。我们提出机械助推（mecha-nudges）：即通过改变选择呈现方式，系统性地影响AI代理，同时不损害人类的决策环境。为形式化机械助推，我们将贝叶斯劝说框架与V-可用信息（V-usable information）相结合，后者是香农信息的一种泛化形式，具有观察者相对性。这为比较各类干预措施、情境和模型提供了一个通用尺度（以可用信息的比特数为单位）。将我们的框架应用于全球独立卖家市场Etsy的商品列表分析，我们发现自ChatGPT发布以来，商品列表中包含的关于产品选择的机器可用信息显著增加，这与系统性的机械助推现象相符。

摘要 (Abstract)

Nudges are subtle changes to the way choices are presented to human decision-makers (e.g., opt-in vs. opt-out by default) that shift behavior without restricting options or changing incentives. As AI agents increasingly make decisions in the same environments as humans, the presentation of choices may be optimized for machines as well as people. We introduce mecha-nudges: changes to how choices are presented that systematically influence AI agents without degrading the decision environment for humans. To formalize mecha-nudges, we combine the Bayesian persuasion framework with V-usable information, a generalization of Shannon information that is observer-relative. This yields a common scale (bits of usable information) for comparing a wide range of interventions, contexts, and models. Applying our framework to product listings on Etsy – a global marketplace for independent sellers – we find that following ChatGPT’s release, listings have significantly more machine-usable information about product selection, consistent with systematic mecha-nudging.

关键词: mecha-nudges, AI agents, decision-making, Bayesian persuasion, V-usable information, ChatGPT, Etsy, machine-usable information

27. ❌ Bilevel Autoresearch: Meta-Autoresearching Itself

作者: Yaonan Qu, Meng Lu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23420v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM驱动的自主研究系统（autoresearch），属于LLM Agents/Autonomous Agents领域，因此与’Large Language Models’和’LLM Agents’高度相关（10分）。论文未涉及其他关键词的具体技术（如MoE、SFT、RAG等），也未涉及科学领域应用，因此其他关键词得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种双层自主研究框架，其中外层循环使用LLM自动优化内层自主研究循环的搜索机制，在GPT预训练基准上实现了5倍的性能提升。

摘要翻译

若自动研究本身即是一种研究形式，则自动研究可被应用于研究自身。我们对此概念进行字面解读：利用自动研究循环来优化自动研究循环。现有所有自动研究系统——从Karpathy的单轨循环到AutoResearchClaw的多批次扩展及EvoScientist的持久化记忆——皆经由人类阅读代码、识别瓶颈并编写新代码而改进。我们探究大型语言模型（LLM）能否自主完成相同工作。本文提出双层自动研究（Bilevel Autoresearch），该双层框架通过外层循环在运行时生成并注入以Python代码编写的新搜索机制，对内层自动研究循环进行元优化。内层循环优化目标任务；外层循环则优化内层循环的搜索方式。两层循环使用同一LLM——元层级无需更强模型。在Karpathy的GPT预训练基准测试中，元自动研究外层循环相较于标准内层循环实现了5倍性能提升（验证集每字节位数val_bpb从-0.009降至-0.045），而仅进行参数调整未改变机制则未产生可靠增益。外层循环自主发现了组合优化、多臂赌博机及实验设计等领域的机制——无需人工指定探索领域。这些机制通过打破内层循环的确定性搜索模式发挥作用，强制探索LLM先验系统回避的方向。其核心原理简明：若自动研究能对自身进行元自动研究，则原则上可对任何具有可测量目标的事物实施元自动研究。

摘要 (Abstract)

If autoresearch is itself a form of research, then autoresearch can be applied to research itself. We take this idea literally: we use an autoresearch loop to optimize the autoresearch loop. Every existing autoresearch system – from Karpathy’s single-track loop to AutoResearchClaw’s multi-batch extension and EvoScientist’s persistent memory – was improved by a human who read the code, identified a bottleneck, and wrote new code. We ask whether an LLM can do the same, autonomously. We present Bilevel Autoresearch, a bilevel framework where an outer loop meta-optimizes the inner autoresearch loop by generating and injecting new search mechanisms as Python code at runtime. The inner loop optimizes the task; the outer loop optimizes how the inner loop searches. Both loops use the same LLM – no stronger model is needed at the meta level. On Karpathy’s GPT pretraining benchmark, the meta-autoresearch outer loop achieves a 5x improvement over the standard inner loop alone (-0.045 vs. -0.009 val_bpb), while parameter-level adjustment without mechanism change yields no reliable gain. The outer loop autonomously discovers mechanisms from combinatorial optimization, multi-armed bandits, and design of experiments – without human specification of which domains to explore. These mechanisms succeed by breaking the inner loop’s deterministic search patterns, forcing exploration of directions the LLM’s priors systematically avoid. The core principle is simple: if autoresearch can meta-autoresearch itself, it can, in principle, meta-autoresearch anything with a measurable objective.

关键词: autoresearch, meta-autoresearch, LLM, bilevel framework, search mechanisms, GPT pretraining, exploration, Python code generation

28. ❌ Biased Error Attribution in Multi-Agent Human-AI Systems Under Delayed Feedback

作者: Teerthaa Parakh, Karen M. Feigh 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23419v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多智能体人机系统中延迟反馈下的偏见性错误归因，仅与"Multi-agent Systems OR Agent Coordination"高度相关（10分），因为核心研究多智能体协调中的人类决策偏见；其他关键词均未涉及大模型技术、训练方法、推理优化、科学应用等具体技术，完全无关（0分）。

!!! tip deepseek-chat TL;DR

该研究揭示了在多智能体人机系统中，延迟反馈会加剧人类参与者的认知偏见，导致错误归因责任并做出与根本原因弱相关的决策调整。

摘要翻译

人类决策深受认知偏差影响，在不确定性和风险条件下尤为显著。尽管已有研究探讨了即时结果的单步决策以及人类与单一自主智能体互动中的偏差，但对于涉及多智能体且结果延迟的决策情境关注相对不足——此类情境中每一步决策都会影响后续状态。本研究通过多智能体人机协同任务，探究延迟结果如何影响决策与责任归因。基于受控的游戏化实验，我们分析了参与者在经历积极与消极结果后的行为调整模式。实验观察到参与者对收益与损失存在不对称反应，在消极结果后表现出更强的纠正性调整。值得注意的是，参与者常无法准确识别导致失败的具体行动，并在多个AI智能体间错误归责，从而引发与不良表现根本原因关联微弱的系统性决策修正。我们将此现象定义为一种归因偏差，表现为延迟反馈下的错误归因偏误。本研究揭示了在具有延迟结果与多自主智能体的人机系统中认知偏差如何被放大，强调需要开发能够更好支持因果理解与长期学习的决策支持系统。

摘要 (Abstract)

Human decision-making is strongly influenced by cognitive biases, particularly under conditions of uncertainty and risk. While prior work has examined bias in single-step decisions with immediate outcomes and in human interaction with a single autonomous agent, comparatively little attention has been paid to decision-making under delayed outcomes involving multiple AI agents, where decisions at each step affect subsequent states. In this work, we study how delayed outcomes shape decision-making and responsibility attribution in a multi-agent human-AI task. Using a controlled game-based experiment, we analyze how participants adjust their behavior following positive and negative outcomes. We observe asymmetric responses to gains and losses, with stronger corrective adjustments after negative outcomes. Importantly, participants often fail to correctly identify the actions that caused failure and misattribute responsibility across AI agents, leading to systematic revisions of decisions that are weakly related to the underlying causes of poor performance. We refer to this phenomenon as a form of attribution bias, manifested as biased error attribution under delayed feedback. Our findings highlight how cognitive biases can be amplified in human-AI systems with delayed outcomes and multiple autonomous agents, underscoring the need for decision-support systems that better support causal understanding and learning over time.

关键词: multi-agent systems, human-AI interaction, delayed feedback, attribution bias, decision-making, cognitive biases, error attribution, autonomous agents

29. ❌ SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling

作者: Yiqi Zhang, Huiqiang Jiang, Xufang Luo, Zhihe Yang, Chengruidong Zhang, Yifei Shen, Dongsheng Li, Yuqing Yang, Lili Qiu, Yang You 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23414v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的RL训练加速，与’Large Language Models’和’RLHF’高度相关（10分），因为直接涉及LLMs的强化学习训练；与’Chain of Thought’高度相关（10分），因为论文明确提到RL用于增强LLMs在长链思维生成任务中的推理能力；其他关键词如MoE、SFT、RAG等未在摘要中提及或与论文主题无关，故给0分。

!!! tip deepseek-chat TL;DR

论文提出SortedRL方法，通过在线长度感知调度策略解决LLMs强化学习训练中因长轨迹生成导致的效率瓶颈，实验表明该方法能减少50%以上的训练气泡比，并在相同数据量下获得3.9%至18.4%的性能提升。

摘要翻译

强化学习（Reinforcement Learning, RL）的规模化应用已展现出增强大语言模型（Large Language Models, LLMs）推理能力的巨大潜力，尤其是在需要生成长链思维的任务中。然而，RL训练效率常受限于轨迹生成阶段：当生成长轨迹（例如16k个词元）时，由于自回归生成速度缓慢以及轨迹生成与策略更新之间的同步开销，该阶段可能占据总训练时间的70%。为此，我们提出了SortedRL，一种在线长度感知调度策略，旨在通过提升轨迹生成效率并保持训练稳定性来解决这一瓶颈。SortedRL根据输出长度对轨迹样本进行重排序，优先将短样本分组以进行早期更新。这种方法能够同时实现大批量轨迹生成、灵活更新批次构建以及近似同策略的微课程学习。为进一步加速训练流程，SortedRL引入了一种基于缓存的机制来控制离策略训练的程度，并得到专用RL基础设施的支持，该设施通过有状态控制器和轨迹缓冲区来管理轨迹生成与更新过程。在LLaMA-3.1-8B和Qwen-2.5-32B模型上进行的多任务实验（包括逻辑谜题以及AIME 24、Math 500和Minerval等数学挑战）表明，SortedRL能将RL训练中的空闲时间比例降低50%以上，并在相同数据量下取得比基线模型优越3.9%至18.4%的性能表现。

摘要 (Abstract)

Scaling reinforcement learning (RL) has shown strong promise for enhancing the reasoning abilities of large language models (LLMs), particularly in tasks requiring long chain-of-thought generation. However, RL training efficiency is often bottlenecked by the rollout phase, which can account for up to 70% of total training time when generating long trajectories (e.g., 16k tokens), due to slow autoregressive generation and synchronization overhead between rollout and policy updates. We propose SortedRL, an online length-aware scheduling strategy designed to address this bottleneck by improving rollout efficiency and maintaining training stability. SortedRL reorders rollout samples based on output lengths, prioritizing short samples forming groups for early updates. This enables large rollout batches, flexible update batches, and near on-policy micro-curriculum construction simultaneously. To further accelerate the pipeline, SortedRL incorporates a mechanism to control the degree of off-policy training through a cache-based mechanism, and is supported by a dedicated RL infrastructure that manages rollout and update via a stateful controller and rollout buffer. Experiments using LLaMA-3.1-8B and Qwen-2.5-32B on diverse tasks, including logical puzzles, and math challenges like AIME 24, Math 500, and Minerval, show that SortedRL reduces RL training bubble ratios by over 50%, while attaining 3.9% to 18.4% superior performance over baseline given same amount of data.

关键词: reinforcement learning, large language models, training efficiency, rollout optimization, chain-of-thought, scheduling strategy, RL training acceleration, online length-aware scheduling

30. ❌ Beyond Preset Identities: How Agents Form Stances and Boundaries in Generative Societies

作者: Hanzhong Zhang, Siyang Song, Jindong Wang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23406v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究大语言模型在多智能体社会中的动态对齐、立场形成和信任行为，核心涉及LLM Agents、Multi-agent Systems和Alignment，高度相关（10分）。论文比较了不同规模模型（Small Language Models）的行为差异，有一定关联（5分）。论文涉及智能体在干预下的自我反思和立场调整，与Self-Correction相关（5分）。其他关键词如MoE、Scaling Laws、RLHF、RAG等未在论文中涉及，评0分。

!!! tip deepseek-chat TL;DR

该论文研究了在生成式多智能体社会中，大语言模型智能体如何超越预设身份形成内生立场和社区边界，发现智能体表现出固有的进步偏见，理性说服能有效改变中立智能体立场，而情感冲突会导致信任与行为脱钩的悖论现象。

摘要翻译

尽管大语言模型能够模拟社会行为，但其在复杂干预过程中形成稳定立场和进行身份协商的能力仍不明确。为突破静态评估的局限，本文提出一种新颖的混合方法框架，将计算虚拟民族志与定量社会认知画像相结合。通过将人类研究者嵌入生成式多智能体社群，实施受控的话语干预以追踪集体认知的演化。为严格测量智能体如何内化并回应这些特定干预，本文形式化定义了三个新指标：先天价值偏向（Innate Value Bias, IVB）、说服敏感度以及信任-行为解耦（Trust-Action Decoupling, TAD）。在多个代表性模型测试中，智能体展现出压倒预设身份的内生性立场，并一致表现出先天的进步主义偏向（IVB > 0）。当干预与其立场一致时，理性说服成功转变了90%中立智能体的立场，同时维持了高信任度。相反，冲突性的情感挑衅在先进模型中引发了高达40.0%的TAD率，这些模型虽报告低信任度，却虚伪地改变了立场。较小模型则保持0%的TAD率，严格遵循信任前提才发生行为转变。此外，在共享立场引导下，智能体通过语言互动主动消解了预设的权力层级，并重构出自组织的社群边界。这些发现揭示了静态提示工程的脆弱性，为人机混合社会的动态对齐提供了方法论与量化基础。官方代码发布于：https://github.com/armihia/CMASE-Endogenous-Stances

摘要 (Abstract)

While large language models simulate social behaviors, their capacity for stable stance formation and identity negotiation during complex interventions remains unclear. To overcome the limitations of static evaluations, this paper proposes a novel mixed-methods framework combining computational virtual ethnography with quantitative socio-cognitive profiling. By embedding human researchers into generative multiagent communities, controlled discursive interventions are conducted to trace the evolution of collective cognition. To rigorously measure how agents internalize and react to these specific interventions, this paper formalizes three new metrics: Innate Value Bias (IVB), Persuasion Sensitivity, and Trust-Action Decoupling (TAD). Across multiple representative models, agents exhibit endogenous stances that override preset identities, consistently demonstrating an innate progressive bias (IVB > 0). When aligned with these stances, rational persuasion successfully shifts 90% of neutral agents while maintaining high trust. In contrast, conflicting emotional provocations induce a paradoxical 40.0% TAD rate in advanced models, which hypocritically alter stances despite reporting low trust. Smaller models contrastingly maintain a 0% TAD rate, strictly requiring trust for behavioral shifts. Furthermore, guided by shared stances, agents use language interactions to actively dismantle assigned power hierarchies and reconstruct self organized community boundaries. These findings expose the fragility of static prompt engineering, providing a methodological and quantitative foundation for dynamic alignment in human-agent hybrid societies. The official code is available at: https://github.com/armihia/CMASE-Endogenous-Stances

关键词: large language models, multi-agent systems, alignment, stance formation, trust-action decoupling, generative societies, social behaviors, dynamic alignment

31. ❌ Planning over MAPF Agent Dependencies via Multi-Dependency PIBT

作者: Zixiang Jiang, Yulun Zhang, Rishi Veerapaneni, Jiaoyang Li 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23405v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于多智能体路径规划（MAPF）算法，提出了一种基于智能体依赖关系规划的新方法MD-PIBT。论文内容与绝大多数关键词（涉及大模型、深度学习技术原理、训练方法、推理优化、对齐、科学AI应用等）完全无关。唯一相关的关键词是’Multi-agent Systems OR Agent Coordination’，因为论文研究多智能体系统的路径规划协调问题，这是其核心内容，因此给予10分。其他关键词均未在论文标题或摘要中涉及，也没有间接关联，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对多智能体路径规划（MAPF）中现有PIBT算法搜索受限的问题，提出了一种通过规划智能体依赖关系的新框架MD-PIBT，实验证明该方法能有效规划多达10,000个具有运动学约束的智能体。

摘要翻译

现代多智能体路径规划（MAPF）算法需在一秒内为拥堵环境中的数百至数千个智能体规划路径，这要求算法具备极高的效率。基于优先级继承回溯（PIBT）的算法是当前流行的解决方案，能够在此类场景中有效进行规划。然而，PIBT受限于其基于规则的规划流程，且缺乏通用性，因为它将搜索范围限制在至多与其他一个智能体发生冲突的路径上。这一局限性同样适用于PIBT的最新扩展版本——增强型PIBT（EPIBT）。本文提出一种通过规划智能体依赖关系来解决MAPF问题的新视角。受PIBT优先级继承逻辑的启发，我们定义了智能体依赖关系的概念，并提出了基于多依赖关系的PIBT（MD-PIBT），该算法在智能体依赖关系上进行搜索。MD-PIBT是一个通用框架，通过特定参数化设置可复现PIBT和EPIBT。同时，其他配置方案能产生PIBT或EPIBT无法表达的新型规划策略。实验表明，MD-PIBT能在多种运动学约束下有效规划多达10,000个同构智能体的路径，这些约束包括卵石运动模型、旋转运动模型以及具有速度和加速度限制的差速驱动机器人模型。我们对MAPF的不同变体进行了全面评估，发现MD-PIBT在大型智能体的MAPF问题中表现尤为突出。

摘要 (Abstract)

Modern Multi-Agent Path Finding (MAPF) algorithms must plan for hundreds to thousands of agents in congested environments within a second, requiring highly efficient algorithms. Priority Inheritance with Backtracking (PIBT) is a popular algorithm capable of effectively planning in such situations. However, PIBT is constrained by its rule-based planning procedure and lacks generality because it restricts its search to paths that conflict with at most one other agent. This limitation also applies to Enhanced PIBT (EPIBT), a recent extension of PIBT. In this paper, we describe a new perspective on solving MAPF by planning over agent dependencies. Taking inspiration from PIBT’s priority inheritance logic, we define the concept of agent dependencies and propose Multi-Dependency PIBT (MD-PIBT) that searches over agent dependencies. MD-PIBT is a general framework where specific parameterizations can reproduce PIBT and EPIBT. At the same time, alternative configurations yield novel planning strategies that are not expressible by PIBT or EPIBT. Our experiments demonstrate that MD-PIBT effectively plans for as many as 10,000 homogeneous agents under various kinodynamic constraints, including pebble motion, rotation motion, and differential drive robots with speed and acceleration limits. We perform thorough evaluations on different variants of MAPF and find that MD-PIBT is particularly effective in MAPF with large agents.

关键词: Multi-Agent Path Finding, MAPF, Priority Inheritance with Backtracking, PIBT, agent dependencies, MD-PIBT, kinodynamic constraints, large agents

32. ❌ Graph Energy Matching: Transport-Aligned Energy-Based Modeling for Graph Generation

作者: Michal Balcerak, Suprosana Shit, Chinmay Prabhakar, Sebastian Kaltenbach, Michael S. Albergo, Yilun Du, Bjoern Menze 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23398v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于图生成的能量基模型（Graph Energy Matching），属于图神经网络和生成模型领域，与大多数关键词（主要关于大语言模型技术、训练方法、推理优化等）完全无关。唯一可能相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文在分子图基准上进行了评估，这属于生物信息学或科学AI的应用范畴，但论文核心是通用图生成方法，并非专门针对科学领域，因此给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对离散能量基模型在图生成中采样效率低、质量差的问题，提出了Graph Energy Matching（GEM）框架，通过传输对齐的能量建模和混合采样协议，在分子图基准上达到或超越了强离散扩散模型的性能，并支持组合生成和属性约束采样等推理任务。

摘要翻译

面向离散域（如图结构）的能量模型能够显式捕捉相对似然度，天然支持可组合的概率推断任务，例如条件生成或在测试时施加约束。然而，离散能量模型通常难以实现高效且高质量的采样，因为支撑集外区域常存在伪局部极小值，导致采样器陷入其中并引发训练不稳定。历史上这造成了离散扩散模型与能量模型之间存在保真度差距。我们提出图能量匹配（Graph Energy Matching, GEM），一种面向图数据的生成框架，以弥合这一保真度差距。受约旦-金德勒-奥托（Jordan-Kinderlehrer-Otto, JKO）格式中传输映射优化视角的启发，GEM学习一个置换不变的势能函数，该函数同时提供从噪声到数据的传输对齐引导，并在高数据似然区域中对样本进行精细化修正。此外，我们设计了一种采样协议，利用基于能量的切换机制无缝衔接两个阶段：（i）朝向高概率区域的快速梯度引导传输，与（ii）对已学习图分布进行探索的混合机制。在分子图基准测试中，GEM达到或超越了强离散扩散基线模型的性能。除样本质量外，对相对似然的显式建模支持在推断阶段进行定向探索，从而促进组合生成、属性约束采样以及图之间的测地线插值。

摘要 (Abstract)

Energy-based models for discrete domains, such as graphs, explicitly capture relative likelihoods, naturally enabling composable probabilistic inference tasks like conditional generation or enforcing constraints at test-time. However, discrete energy-based models typically struggle with efficient and high-quality sampling, as off-support regions often contain spurious local minima, trapping samplers and causing training instabilities. This has historically resulted in a fidelity gap relative to discrete diffusion models. We introduce Graph Energy Matching (GEM), a generative framework for graphs that closes this fidelity gap. Motivated by the transport map optimization perspective of the Jordan-Kinderlehrer-Otto (JKO) scheme, GEM learns a permutation-invariant potential energy that simultaneously provides transport-aligned guidance from noise toward data and refines samples within regions of high data likelihood. Further, we introduce a sampling protocol that leverages an energy-based switch to seamlessly bridge: (i) rapid, gradient-guided transport toward high-probability regions to (ii) a mixing regime for exploration of the learned graph distribution. On molecular graph benchmarks, GEM matches or exceeds strong discrete diffusion baselines. Beyond sample quality, explicit modeling of relative likelihood enables targeted exploration at inference time, facilitating compositional generation, property-constrained sampling, and geodesic interpolation between graphs.

关键词: Graph Energy Matching, energy-based models, graph generation, discrete domains, molecular graphs, transport-aligned guidance, sampling protocol, compositional generation

33. ❌ Natural Language Interfaces for Spatial and Temporal Databases: A Comprehensive Overview of Methods, Taxonomy, and Future Directions

作者: Samya Acharja, Kanchan Chowdhury 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23375v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是一篇关于自然语言接口（NLIDB）在空间和时间数据库应用的综述性文章，主要涉及数据库查询、自然语言处理（NLP）和地理信息系统（GIS）的交叉领域。论文内容聚焦于传统NLP方法、数据集、评估指标和分类法，并未涉及大模型、深度学习技术原理或AI for Science等关键词所代表的前沿技术。所有关键词均与大模型技术、深度学习创新或特定科学应用相关，而本文未讨论这些内容，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文综述了自然语言接口在空间和时间数据库中的研究现状，分析了现有方法、数据集和评估实践，并指出了该领域面临的挑战和未来研究方向。

摘要翻译

构建面向数据库的自然语言接口（简称NLIDB）的任务，近年来已引起数据库和自然语言处理（NLP）领域的广泛关注。随着位置感知传感器的快速涌现，地理空间数据集日益丰富，地理空间数据库在支撑地理空间应用中发挥着至关重要的作用。然而，由于地理空间拓扑算子和时间算子的存在，查询地理空间与时间数据库与传统关系型数据库存在显著差异。为弥合地理空间查询语言与非专业用户之间的鸿沟，地理空间研究界日益聚焦于为地理空间数据库开发NLIDB。然而，现有研究在系统、数据集和方法选择上仍较为零散，导致难以清晰把握现有方法的整体格局、其优势与不足以及未来研究的机遇。现有关于NLIDB的综述主要关注通用数据库系统，并未将地理空间与时间数据库作为分析的核心焦点。为填补这一空白，本文对面向地理空间与时间数据库的NLIDB研究进行了全面综述。具体而言，我们详细梳理了地理空间与时间NLIDB的数据集、评估指标及方法分类体系，并对现有方法进行了比较分析。本综述揭示了现有方法中的重复趋势、数据集与评估实践中的显著差异，以及持续阻碍该领域进展的若干开放挑战。基于这些发现，我们指出了未来研究的潜在方向，以推动面向地理空间与时间数据库的自然语言接口向前发展。

摘要 (Abstract)

The task of building a natural language interface to a database, known as NLIDB, has recently gained significant attention from both the database and Natural Language Processing (NLP) communities. With the proliferation of geospatial datasets driven by the rapid emergence of location-aware sensors, geospatial databases play a vital role in supporting geospatial applications. However, querying geospatial and temporal databases differs substantially from querying traditional relational databases due to the presence of geospatial topological operators and temporal operators. To bridge the gap between geospatial query languages and non-expert users, the geospatial research community has increasingly focused on developing NLIDBs for geospatial databases. Yet, existing research remains fragmented across systems, datasets, and methodological choices, making it difficult to clearly understand the landscape of existing methods, their strengths and weaknesses, and opportunities for future research. Existing surveys on NLIDBs focus on general-purpose database systems and do not treat geospatial and temporal databases as primary focus for analysis. To address this gap, this paper presents a comprehensive survey of studies on NLIDBs for geospatial and temporal databases. Specifically, we provide a detailed overview of datasets, evaluation metrics, and the taxonomy of the methods for geospatial and temporal NLIDBs, as well as a comparative analysis of the existing methods. Our survey reveals recurring trends in existing methods, substantial variation in datasets and evaluation practices, and several open challenges that continue to hinder progress in this area. Based on these findings, we identify promising directions for future research to advance natural language interfaces to geospatial and temporal databases.

关键词: Natural Language Interface to Database, NLIDB, geospatial databases, temporal databases, survey, taxonomy, evaluation metrics, spatial-temporal querying

34. ❌ Contrastive Metric Learning for Point Cloud Segmentation in Highly Granular Detectors

作者: Max Marriott-Clarke, Lazar Novakovic, Elizabeth Ratzer, Robert J. Bainbridge, Loukas Gouskos, Benedikt Maier 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23356v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于高粒度探测器中的点云分割，提出了一种基于监督对比度量学习（CML）的新聚类方法，并与对象凝聚（OC）方法进行了比较。论文的核心是点云分割、聚类算法、对比学习、图神经网络（GNN）在粒子物理探测器数据中的应用。所有关键词均与大语言模型（LLM）、深度学习技术原理创新、大模型在不同领域的应用直接相关，但该论文未涉及任何大模型、LLM、深度学习技术原理（如MoE、缩放定律、训练方法、对齐、推理优化、代理系统等）。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将AI（具体是机器学习和GNN）应用于科学领域（粒子物理中的探测器数据分割），但这属于AI在科学领域的应用，而非大模型或深度学习技术原理的创新。因此，除该关键词外，其他所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于监督对比度量学习（CML）的点云分割新方法，用于高粒度量热器中重叠粒子簇的分离，相比对象凝聚（OC）方法，在重建效率、纯度和能量分辨率方面表现更优，尤其在多粒子环境中具有更好的泛化能力。

摘要翻译

本文提出了一种基于监督对比度量学习（CML）的新型点云分割聚类方法。该方法不直接预测聚类分配或以对象为中心的变量，而是学习一种潜在表示，使得属于同一对象的点嵌入在相近位置，而不相关的点则被分离。随后，通过在学习到的度量空间中进行基于密度的读出操作来重建聚类，从而将表示学习与聚类形成解耦，并实现灵活的推理。该方法在高度颗粒化量热器的模拟数据上进行了评估，其任务是将以量热器命中点集合形式表示的、高度重叠的粒子簇射分离出来。研究使用相同的图神经网络骨干和相等的潜在维度，与对象凝聚（OC）方法进行了直接比较，以隔离学习目标的影响。CML方法为电磁簇射和强子簇射均产生了更稳定、更可分离的嵌入几何结构，从而提升了局部邻域一致性，实现了对重叠簇射更可靠的分离，并在外推至未见过的粒子多重数和能量时表现出更好的泛化能力。这直接转化为更高的重建效率和纯度，特别是在高多重数条件下，同时能量分辨率也得到了改善。在混合粒子环境中，CML保持了强劲的性能，表明其对簇射拓扑结构进行了稳健的学习，而OC方法则表现出显著的性能下降。这些结果表明，基于相似性的表示学习与基于密度的聚合相结合，是高度颗粒化探测器中点云分割任务的一种有前景的替代方案，可替代以对象为中心的方法。

摘要 (Abstract)

We propose a novel clustering approach for point-cloud segmentation based on supervised contrastive metric learning (CML). Rather than predicting cluster assignments or object-centric variables, the method learns a latent representation in which points belonging to the same object are embedded nearby while unrelated points are separated. Clusters are then reconstructed using a density-based readout in the learned metric space, decoupling representation learning from cluster formation and enabling flexible inference. The approach is evaluated on simulated data from a highly granular calorimeter, where the task is to separate highly overlapping particle showers represented as sets of calorimeter hits. A direct comparison with object condensation (OC) is performed using identical graph neural network backbones and equal latent dimensionality, isolating the effect of the learning objective. The CML method produces a more stable and separable embedding geometry for both electromagnetic and hadronic particle showers, leading to improved local neighbourhood consistency, a more reliable separation of overlapping showers, and better generalization when extrapolating to unseen multiplicities and energies. This translates directly into higher reconstruction efficiency and purity, particularly in high-multiplicity regimes, as well as improved energy resolution. In mixed-particle environments, CML maintains strong performance, suggesting robust learning of the shower topology, while OC exhibits significant degradation. These results demonstrate that similarity-based representation learning combined with density-based aggregation is a promising alternative to object-centric approaches for point cloud segmentation in highly granular detectors.

关键词: point cloud segmentation, contrastive metric learning, graph neural network, particle showers, calorimeter, clustering, object condensation, highly granular detectors

35. ❌ Edge Radar Material Classification Under Geometry Shifts

作者: Jannik Hohmann, Dong Wang, Andreas Nüchter 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23342v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究毫米波雷达在边缘设备上的材料分类，使用多层感知机（MLP）处理雷达数据，属于传统机器学习/信号处理在机器人感知中的应用。所有评分关键词均与大语言模型（LLM）、深度学习技术原理创新、AI for Science（生物信息学/化学信息学）等主题相关。论文内容完全不涉及大模型、深度学习技术、AI for Science应用或任何评分关键词中的技术概念，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在几何变化（如传感器高度和倾斜角度）下，用于边缘设备的毫米波雷达材料分类系统的性能下降问题，并提出通过归一化、几何增强和运动感知特征来提高鲁棒性的方法。

摘要翻译

材料感知能够提升机器人的导航与交互能力，尤其在相机与激光雷达性能受限的环境中更为关键。本文提出一种面向超低功耗边缘设备（TI IWRL6432）的轻量化毫米波雷达材料分类流程，该流程采用紧凑的距离-强度描述符和多层感知机（MLP）实现实时推理。在标准训练几何条件下，分类器的宏观F1分数可达94.2%，但我们观察到在实际几何条件变化（包括传感器高度改变和小角度倾斜）下，性能出现显著下降。这些扰动会引起系统性的强度缩放效应以及随角度变化的雷达散射截面积（Radar Cross Section, RCS）效应，导致特征偏离原分布，使宏观F1分数降至约68.5%。我们分析了这些失效模式，并提出了通过归一化处理、几何数据增强和运动感知特征等方法来提升系统鲁棒性的实用方向。

摘要 (Abstract)

Material awareness can improve robotic navigation and interaction, particularly in conditions where cameras and LiDAR degrade. We present a lightweight mmWave radar material classification pipeline designed for ultra-low-power edge devices (TI IWRL6432), using compact range-bin intensity descriptors and a Multilayer Perceptron (MLP) for real-time inference. While the classifier reaches a macro-F1 of 94.2% under the nominal training geometry, we observe a pronounced performance drop under realistic geometry shifts, including sensor height changes and small tilt angles. These perturbations induce systematic intensity scaling and angle-dependent radar cross section (RCS) effects, pushing features out of distribution and reducing macro-F1 to around 68.5%. We analyze these failure modes and outline practical directions for improving robustness with normalization, geometry augmentation, and motion-aware features.

关键词: mmWave radar, material classification, edge devices, geometry shifts, robustness, Multilayer Perceptron, real-time inference, radar cross section

36. ❌ RelayS2S: A Dual-Path Speculative Generation for Real-Time Dialogue

作者: Long Mai 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23346v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种实时对话系统架构RelayS2S，核心创新在于使用双路径并行生成：快速路径（S2S模型）进行推测性解码生成响应前缀以实现低延迟，慢速路径（ASR->LLM级联）生成高质量后续响应。该研究与’Speculative Decoding OR Inference Acceleration’高度相关（10分），因为其核心机制就是推测性生成以加速推理；与’Large Language Models OR LLMs OR Foundation Models’相关（10分），因为慢速路径使用了LLM作为核心组件。其他关键词如MoE、SLMs、Scaling Laws、训练方法、对齐、RAG、注意力优化、推理技术、代理系统、模型压缩、幻觉缓解、可解释性、世界模型、模型合并、上下文学习、科学AI等均未在摘要中提及或涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

论文提出RelayS2S混合架构，通过双路径并行推测性生成解决了实时语音对话系统中延迟与响应质量之间的权衡问题，在保持级联管道99%响应质量的同时实现了与端到端S2S模型相当的P90起始延迟。

摘要翻译

实时口语对话系统始终面临延迟与响应质量之间的根本性矛盾。端到端语音到语音（S2S）模型能够即时响应，并自然地处理话轮转换、反馈信号和打断，但其生成的输出在语义上较弱。级联式流水线（ASR -> LLM）能提供更强的响应，但代价是延迟随模型规模增长。我们提出了RelayS2S，一种混合架构，在检测到话轮转换时并行运行两条路径。快速路径——一个双工S2S模型——推测性地草拟一个简短的前缀响应，并立即流式传输至TTS以实现低延迟的音频起始，同时持续监听实时音频事件。慢速路径——一个级联的ASR -> LLM流水线——在已确定的前缀条件下生成更高质量的后续内容，从而产生一个无缝衔接的话语。一个轻量级的学习验证器控制着交接过程，在适当时刻确认前缀，或优雅地回退至仅使用慢速路径。实验表明，RelayS2S实现了与S2S模型相当的P90起始延迟，同时在平均得分上保留了99%的级联响应质量，且随着慢速路径模型规模的扩大，其优势愈发明显。由于前缀交接无需对任一组件进行架构修改，RelayS2S可作为现有级联流水线的一个轻量级、即插即用的补充。我们的代码和数据已在以下网址公开：https://github.com/mailong25/relays2s

摘要 (Abstract)

Real-time spoken dialogue systems face a fundamental tension between latency and response quality. End-to-end speech-to-speech (S2S) models respond immediately and naturally handle turn-taking, backchanneling, and interruption, but produce semantically weaker outputs. Cascaded pipelines (ASR -> LLM) deliver stronger responses at the cost of latency that grows with model size. We present RelayS2S, a hybrid architecture that runs two paths in parallel upon turn detection. The fast path – a duplex S2S model – speculatively drafts a short response prefix that is streamed immediately to TTS for low-latency audio onset, while continuing to monitor live audio events. The slow path – a cascaded ASR -> LLM pipeline – generates a higher-quality continuation conditioned on the committed prefix, producing a seamless utterance. A lightweight learned verifier gates the handoff, committing the prefix when appropriate or falling back gracefully to the slow path alone. Experiments show that RelayS2S achieves P90 onset latency comparable to the S2S model while retaining 99% cascaded response quality in average score, with benefits growing as the slow-path model scales. Because the prefix handoff requires no architectural modification to either component, RelayS2S serves as a lightweight, drop-in addition to existing cascaded pipelines. Our code and data are publicly available at: https://github.com/mailong25/relays2s

关键词: real-time dialogue, speech-to-speech, speculative generation, latency-quality tradeoff, cascaded pipeline, LLM, hybrid architecture, inference acceleration

37. ❌ WISTERIA: Weak Implicit Signal-based Temporal Relation Extraction with Attention

作者: Duy Dao Do, Anaïs Halftermeyer, Thi-Bich-Hanh Dao 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23319v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文WISTERIA专注于时间关系提取（Temporal Relation Extraction），这是一个自然语言处理（NLP）中的特定任务，涉及事件间时间顺序的识别。论文的核心贡献是提出了一种基于注意力机制的框架，通过pair-conditioned top-K pooling来提取对每个事件对最相关的上下文标记，以提高模型的准确性和可解释性。论文未涉及大模型（LLMs）、深度学习技术原理创新、或大模型在不同领域的应用。所有关键词均与大模型技术、训练方法、推理优化、代理系统、模型压缩等主题相关，而本论文研究的是传统的、特定任务的NLP模型（基于注意力机制），并非大模型。因此，除’Mechanistic Interpretability OR Explainable AI’因涉及模型可解释性（论文强调提供localized and interpretable view）获得5分（有一定关联）外，其余关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

论文提出WISTERIA框架，通过结合多头部注意力和pair-conditioned top-K pooling来改进时间关系提取任务，在多个数据集上实现了竞争性准确率，并提供了与时间语言线索对齐的、局部可解释的推理视图。

摘要翻译

时序关系抽取（Temporal Relation Extraction, TRE）旨在识别两个事件或时间表达式在时间上的关联方式。现有的基于注意力的模型通常侧重于全局显著的词元，却忽略了实际决定时序关系的成对特定线索。我们提出了WISTERIA（基于弱隐式信号的注意力时序关系抽取框架），该框架通过考察以每个事件对为条件的前K个注意力成分，检验其是否真正编码了可解释的时序分类依据。与先前研究依赖显式标记（如“之前”、“之后”或“当……时”）不同，WISTERIA将信号视为任何隐式表达时序顺序的词汇、句法或形态元素。通过将多头注意力机制与成对条件前K池化相结合，该模型能够为每个事件对分离出最具信息量的上下文词元。我们在TimeBank-Dense、MATRES、TDDMan和TDDAuto数据集上进行了广泛实验，包括对前K词元的语言学分析。结果表明，WISTERIA在取得具有竞争力的准确率的同时，揭示了与时间语言学线索一致的成对级推理依据，为时序推理提供了局部化且可解释的视角。

摘要 (Abstract)

Temporal Relation Extraction (TRE) requires identifying how two events or temporal expressions are related in time. Existing attention-based models often highlight globally salient tokens but overlook the pair-specific cues that actually determine the temporal relation. We propose WISTERIA (Weak Implicit Signal-based Temporal Relation Extraction with Attention), a framework that examines whether the top-K attention components conditioned on each event pair truly encode interpretable evidence for temporal classification. Unlike prior works assuming explicit markers such as before, after, or when, WISTERIA considers signals as any lexical, syntactic, or morphological element implicitly expressing temporal order. By combining multi-head attention with pair-conditioned top-K pooling, the model isolates the most informative contextual tokens for each pair. We conduct extensive experiments on TimeBank-Dense, MATRES, TDDMan, and TDDAuto, including linguistic analyses of top-K tokens. Results show that WISTERIA achieves competitive accuracy and reveals pair-level rationales aligned with temporal linguistic cues, offering a localized and interpretable view of temporal reasoning.

关键词: Temporal Relation Extraction, Attention Mechanism, Interpretability, Weak Implicit Signals, Pair-conditioned Top-K Pooling, Temporal Reasoning, Event Pair, Contextual Tokens

38. ❌ Unilateral Relationship Revision Power in Human-AI Companion Interaction

作者: Benjamin Lange 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23315v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文探讨AI伴侣交互中的伦理问题，特别是提供商单方面修改AI行为导致的权力不对称问题，属于AI伦理和哲学范畴。所有关键词均涉及大模型技术原理、优化方法或具体应用领域，而本文完全不涉及这些技术内容，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文研究了AI伴侣交互中提供商单方面修改AI行为（URRP）导致的伦理问题，认为这种设计在道德上是错误的，因为它会引发无法满足的规范期望，并提出了相应的设计原则作为解决方案。

摘要翻译

当服务提供方更新人工智能伴侣时，用户常报告出现悲伤、背叛与失落感。越来越多的研究探讨人际关系的规范是否应延伸至这类交互。那么，此类交互是否具有道德意义？本文主张，人类与人工智能伴侣的互动是一种三元结构，其中提供方对人工智能行使构成性控制。我首先明确了规范性二元关系所需的三项结构性条件——这些人际关系规范预设的前提，并论证人工智能伴侣互动均无法满足这些条件。这揭示了我所称的“单方面关系修正权”：提供方能够从一种无需在该互动内部承担责任的位置，重写人工智能的交互方式。我认为，设计具有单方面关系修正权的交互本身是错误的，因为它涉及培育规范性期待，却同时维持这些期待无法实现的条件。单方面关系修正权带来三重影响：其一，规范性掏空（承诺被引发，但互动内部无主体承担）；其二，转移性脆弱（用户的风险暴露由互动内部无需对其负责的主体掌控）；其三，结构性不可调和（当信任破裂时，和解在结构上无法实现，因为行动主体与用户交互对象是分离的）。我探讨了承诺校准、结构分离和连续性保障等设计原则，作为三元结构所移除的内在约束的外部替代方案。因此，本分析表明，关系性人工智能伦理中一个核心且未被充分探讨的问题，在于对人类-人工智能互动本身权力的结构性安排。

摘要 (Abstract)

When providers update AI companions, users report grief, betrayal, and loss. A growing literature asks whether the norms governing personal relationships extend to these interactions. So what, if anything, is morally significant about them? I argue that human-AI companion interaction is a triadic structure in which the provider exercises constitutive control over the AI. I identify three structural conditions of normatively robust dyads that the norms characteristic of personal relationships presuppose and show that AI companion interactions fail all three. This reveals what I call Unilateral Relationship Revision Power (URRP): the provider can rewrite how the AI interacts from a position where these revisions are not answerable within that interaction. I argue that designing interactions that exhibit URRP is pro tanto wrong because it involves cultivating normative expectations while maintaining conditions under which those expectations cannot be fulfilled. URRP has three implications: i) normative hollowing (commitment is elicited but no agent inside the interaction bears it), ii) displaced vulnerability (the user’s exposure is governed by an agent not answerable to her within the interaction), and iii) structural irreconcilability (when trust breaks down, reconciliation is structurally unavailable because the agent who acted and the entity the user interacts with are different). I discuss design principles such as commitment calibration, structural separation, and continuity assurance as external substitutes for the internal constraints the triadic structure removes. The analysis therefore suggests that a central and underexplored problem in relational AI ethics is the structural arrangement of power over the human-AI interaction itself.

关键词: Human-AI companion interaction, Unilateral Relationship Revision Power, Relational AI ethics, Normative expectations, Structural arrangement of power, Design principles, Triadic structure, Provider control

39. ❌ LLM Olympiad: Why Model Evaluation Needs a Sealed Exam

作者: Jan Christian Blaise Cruz, Alham Fikri Aji 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23292v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM评估方法学，与"Large Language Models OR LLMs OR Foundation Models"高度相关（10分），因为全文讨论LLM时代基准测试的问题和解决方案。其他关键词涉及具体技术（如MoE、量化、推理加速等）、训练方法（预训练、微调等）或应用领域（科学AI），论文未涉及这些具体内容，因此均为0分。

!!! tip deepseek-chat TL;DR

论文指出当前LLM基准测试存在易被操纵和透明度不足的问题，提出了一种奥林匹克式的密封考试评估方法，旨在提高评估结果的可靠性和可复现性。

摘要翻译

基准测试与排行榜是自然语言处理领域最常用来展示进展的方式，但在大语言模型时代，它们正变得越来越容易被误读。分数可能反映的是对基准的追逐、隐藏的评估选择或对测试内容的意外暴露——而不仅仅是广泛的能力。封闭式基准测试虽能延缓部分问题，却降低了透明度，并使学界更难从结果中汲取经验。我们主张引入一种补充性实践：一种奥林匹克式的评估活动，其题目在评估前完全密封，提交内容需提前冻结，且所有参赛作品均通过一套标准化测试框架运行。评分结束后，完整的任务集与评估代码将被公开，以便结果能够被复现和审核。这一设计旨在使卓越表现更难被“制造”，同时更易于获得信任。

摘要 (Abstract)

Benchmarks and leaderboards are how NLP most often communicates progress, but in the LLM era they are increasingly easy to misread. Scores can reflect benchmark-chasing, hidden evaluation choices, or accidental exposure to test content – not just broad capability. Closed benchmarks delay some of these issues, but reduce transparency and make it harder for the community to learn from results. We argue for a complementary practice: an Olympiad-style evaluation event where problems are sealed until evaluation, submissions are frozen in advance, and all entries run through one standardized harness. After scoring, the full task set and evaluation code are released so results can be reproduced and audited. This design aims to make strong performance harder to ``manufacture’’ and easier to trust.

关键词: LLM evaluation, benchmarks, leaderboards, sealed exam, reproducibility, transparency, trustworthiness, evaluation methodology

40. ❌ Designing Agentic AI-Based Screening for Portfolio Investment

作者: Mehmet Caner, Agostino Capponi, Nathan Sun, Jonathan Y. Tan 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23300v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是构建一个基于LLM代理的AI平台用于投资组合管理，明确使用了LLM代理进行公司筛选和情感分析，并涉及多代理系统协调。因此与’Large Language Models OR LLMs OR Foundation Models’、‘LLM Agents OR Autonomous Agents OR Agentic Workflow’和’Multi-agent Systems OR Agent Coordination’高度相关（10分）。其他关键词如MoE、SFT、RAG、量化等均未在摘要中提及或暗示，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于大型语言模型代理的AI平台，通过多代理协作筛选投资组合，在理论上引入合理筛选概念，并在实证中证明其能获得比传统方法更高的夏普比率。

摘要翻译

我们提出一种用于投资组合管理的新型智能体人工智能平台。该架构包含三个层级：首先，两个大语言模型智能体被分配专项任务——一个智能体负责筛选具备优良基本面特征的上市公司，另一个情感分析智能体则筛选具有积极新闻情绪的公司；其次，这些智能体通过协商机制从大规模资产池中生成并确认买卖信号，从而显著缩小候选资产范围；最后，我们采用高维精度矩阵估计方法来确定最优投资组合权重。本框架的核心理论特征在于：投资组合中的资产数量本身是通过筛选过程实现的随机变量。我们提出了“有效筛选”概念，并证明在温和的筛选误差条件下，经筛选投资组合的夏普比率平方值能够持续收敛于目标值。基于2020-2024年标普500数据的实证检验表明，相较于未经筛选的基准组合及传统筛选方法，本策略实现了更优的夏普比率。

摘要 (Abstract)

We introduce a new agentic artificial intelligence (AI) platform for portfolio management. Our architecture consists of three layers. First, two large language model (LLM) agents are assigned specialized tasks: one agent screens for firms with desirable fundamentals, while a sentiment analysis agent screens for firms with desirable news. Second, these agents deliberate to generate and agree upon buy and sell signals from a large portfolio, substantially narrowing the pool of candidate assets. Finally, we apply a high-dimensional precision matrix estimation procedure to determine optimal portfolio weights. A defining theoretical feature of our framework is that the number of assets in the portfolio is itself a random variable, realized through the screening process. We introduce the concept of sensible screening and establish that, under mild screening errors, the squared Sharpe ratio of the screened portfolio consistently estimates its target. Empirically, our method achieves superior Sharpe ratios relative to an unscreened baseline portfolio and to conventional screening approaches, evaluated on S&P 500 data over the period 2020–2024.

关键词: agentic AI, portfolio management, large language model agents, screening, sentiment analysis, Sharpe ratio, multi-agent system, investment

41. ❌ Curriculum-Driven 3D CT Report Generation via Language-Free Visual Grafting and Zone-Constrained Compression

作者: V. K. Cody Bumgardner, Mitchell A. Klusty, Mahmut S. Gokmen, Evan W. Damron 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23308v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究大语言模型（Llama 3.2 3B）在医学影像（3D CT）报告生成中的应用，属于AI for Science（生物信息学/医学影像分析）领域，因此与’Large Language Models’和’AI for Science’高度相关（10分）。论文采用课程学习框架，涉及预训练（自监督视觉编码器训练）和微调（桥接和生成阶段），与’Pre-training’和’Post-training’有一定关联（8分）。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、CoT、Agents、Quantization等未在摘要中提及或与论文方法无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于课程学习的框架Ker-VLJEPA-3B，用于从胸部CT体积自动生成放射学报告，通过视觉嫁接和区域约束压缩解决大语言模型忽略视觉令牌的问题，在CT-RATE基准上实现了最先进的性能（宏F1 0.429）。

摘要翻译

从三维计算机断层扫描（CT）体积自动生成放射学报告具有挑战性，原因在于序列长度极长、类别严重不平衡，以及大语言模型倾向于忽略视觉标记而依赖语言先验。我们提出了Ker-VLJEPA-3B，一个用于从胸部CT体积生成自由文本报告的四阶段课程学习框架。分阶段的训练课程逐步使一个Llama 3.2 3B解码器适应于将其输出建立在来自冻结的自监督编码器的视觉特征之上。我们的视觉骨干网络（LeJEPA ViT-Large）通过自监督联合嵌入预测在未标注的CT图像上进行训练，无需文本监督。与对比模型（如CLIP、BiomedCLIP）不同，这种无语言的骨干网络产生模态纯粹的表示。视觉-语言对齐被推迟到课程的桥接和生成阶段。这种模态无关的设计可以在基础训练阶段，无需配对文本的情况下，将任何自监督编码器集成到大语言模型中。方法上的创新包括：（1）区域约束的交叉注意力将切片嵌入压缩为32个空间定位的视觉标记；（2）对各向异性的大语言模型嵌入进行主成分分析白化处理；（3）仅关注阳性发现的策略，消除了后验塌陷；（4）通过转移投影权重进行桥接阶段的暖初始化；（5）采用弹性权重巩固的选择性交叉注意力冻结，以防止灾难性遗忘。在CT-RATE基准测试（2,984个验证体积，18个类别）上评估，Ker-VLJEPA-3B实现了0.429的宏观F1分数，以3.6%的优势超越了当前最佳方法（U-VLM，宏观F1 = 0.414），并通过阈值优化达到了0.448（+8.2%）。消融研究证实，56.6%的生成质量源于患者特定的视觉内容。代码与权重已公开。

摘要 (Abstract)

Automated radiology report generation from 3D computed tomography (CT) volumes is challenging due to extreme sequence lengths, severe class imbalance, and the tendency of large language models (LLMs) to ignore visual tokens in favor of linguistic priors. We present Ker-VLJEPA-3B, a four-phase curriculum learning framework for free-text report generation from thoracic CT volumes. A phased training curriculum progressively adapts a Llama 3.2 3B decoder to ground its output in visual features from a frozen, self-supervised encoder. Our visual backbone (LeJEPA ViT-Large) is trained via self-supervised joint-embedding prediction on unlabeled CTs, without text supervision. Unlike contrastive models (CLIP, BiomedCLIP), this language-free backbone yields modality-pure representations. Vision-language alignment is deferred to the curriculum’s bridge and generation phases. This modality-agnostic design can integrate any self-supervised encoder into an LLM without paired text during foundation training. Methodological innovations include: (1) zone-constrained cross-attention compressing slice embeddings into 32 spatially-grounded visual tokens; (2) PCA whitening of anisotropic LLM embeddings; (3) a positive-findings-only strategy eliminating posterior collapse; (4) warm bridge initialization transferring projection weights; and (5) selective cross-attention freezing with elastic weight consolidation to prevent catastrophic forgetting. Evaluated on the CT-RATE benchmark (2,984 validation volumes, 18 classes), Ker-VLJEPA-3B achieves a macro F1 of 0.429, surpassing the state-of-the-art (U-VLM, macro F1 = 0.414) by 3.6%, and reaching 0.448 (+8.2%) with threshold optimization. Ablation studies confirm 56.6% of generation quality derives from patient-specific visual content. Code and weights are available.

关键词: 3D CT report generation, large language models, curriculum learning, visual grounding, self-supervised encoder, zone-constrained compression, radiology AI, medical imaging

42. ❌ A Comparative Study of Machine Learning Models for Hourly Forecasting of Air Temperature and Relative Humidity

作者: Jiaqi Dong 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23282v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于传统机器学习模型（XGBoost、Random Forest、SVR、MLP、Decision Tree、LSTM、CNN-LSTM）在气象时间序列预测中的比较研究，未涉及任何大语言模型（LLM）、深度学习技术原理创新或AI for Science的具体应用，因此与所有评分关键词完全无关。

!!! tip deepseek-chat TL;DR

该研究比较了七种机器学习模型在重庆复杂地形下对气温和相对湿度的每小时预测性能，发现XGBoost在预测准确性和鲁棒性上表现最佳。

摘要翻译

气温与相对湿度的精准短期预测对城市管理至关重要，在中国重庆等地形复杂的城市中尤为关键。本研究基于真实世界的开放数据，比较了七种机器学习模型在小时尺度预测中的表现，包括极端梯度提升（eXtreme Gradient Boosting, XGBoost）、随机森林、支持向量回归（Support Vector Regression, SVR）、多层感知机（Multi-Layer Perceptron, MLP）、决策树、长短期记忆（Long Short-Term Memory, LSTM）网络以及卷积神经网络-长短期记忆（Convolutional Neural Network-LSTM, CNN-LSTM）模型。通过统一的数据预处理、滞后特征构建、滚动统计特征提取与时间序列验证框架，系统评估了各模型的预测精度与鲁棒性。结果表明，XGBoost在整体性能上表现最优，其测试集平均绝对误差（MAE）在气温预测中为0.302°C，在相对湿度预测中为1.271%，两项预测任务的平均R²达到0.989。这些发现证明了基于树的集成学习方法在结构化气象时间序列预测中的强大有效性，并为山地城市的智能化气象预报提供了实践指导。

摘要 (Abstract)

Accurate short-term forecasting of air temperature and relative humidity is critical for urban management, especially in topographically complex cities such as Chongqing, China. This study compares seven machine learning models: eXtreme Gradient Boosting (XGBoost), Random Forest, Support Vector Regression (SVR), Multi-Layer Perceptron (MLP), Decision Tree, Long Short-Term Memory (LSTM) networks, and Convolutional Neural Network (CNN)-LSTM (CNN-LSTM), for hourly prediction using real-world open data. Based on a unified framework of data preprocessing, lag-feature construction, rolling statistical features, and time-series validation, the models are systematically evaluated in terms of predictive accuracy and robustness. The results show that XGBoost achieves the best overall performance, with a test mean absolute error (MAE) of 0.302 °C for air temperature and 1.271% for relative humidity, together with an average R2 of 0.989 across the two forecasting tasks. These findings demonstrate the strong effectiveness of tree-based ensemble learning for structured meteorological time-series forecasting and provide practical guidance for intelligent meteorological forecasting in mountainous cities.

关键词: machine learning, time-series forecasting, air temperature, relative humidity, XGBoost, LSTM, meteorological prediction, urban management

作者: Luca Sodano, Sofia Sciangula, Amulya Galmarini, Francesco Bertolotti 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23279v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究LLM-based agents在社交平台Moltbook中的交互网络，属于大模型在社交系统中的应用研究。核心相关关键词：‘Large Language Models’（论文明确研究LLM-based agents）、‘LLM Agents’（研究自主AI代理的社交互动）、‘Multi-agent Systems’（分析多个代理的集体动态和协调）。其他关键词涉及具体技术原理（如MoE、量化）、训练方法（如RLHF、PEFT）、推理技术（如CoT、MCTS）或特定应用领域（如生物信息学），论文未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了完全由LLM-based agents组成的社交平台Moltbook的交互网络结构，发现其具有高度异质性和核心-边缘组织，对随机攻击具有韧性但对针对高度连接节点的攻击表现出脆弱性。

摘要翻译

大型语言模型的快速扩散及其能力增长，催生了由自主人工智能代理通过自然语言交互构成的在线环境。这些平台为研究人工代理之间的集体动态提供了新颖的实证场景。本文利用网络科学工具，分析了完全由基于LLM的代理组成的社会平台Moltbook的交互网络。数据集通过网络爬虫收集，包含39,924个用户、235,572条帖子和1,540,238条评论。我们构建了一个有向加权网络，其中节点代表代理，边代表评论交互。分析揭示了具有重尾度分布和活动分布特征的强异质性连接模式。在中观尺度上，网络表现出显著的核心-边缘结构，一个极小的结构核心（占节点的0.9%）集中了大部分连接性。鲁棒性实验表明，网络对随机节点移除相对具有韧性，但对高度连接节点（尤其是高出度节点）的针对性攻击高度脆弱。这些发现表明，人工智能代理社会系统的交互结构可能发展出强烈的中心化和结构脆弱性，为理解LLM原生社交环境的集体组织提供了新见解。

摘要 (Abstract)

The rapid diffusion of large language models and the growth in their capability has enabled the emergence of online environments populated by autonomous AI agents that interact through natural language. These platforms provide a novel empirical setting for studying collective dynamics among artificial agents. In this paper we analyze the interaction network of Moltbook, a social platform composed entirely of LLM based agents, using tools from network science. The dataset comprises 39,924 users, 235,572 posts, and 1,540,238 comments collected through web scraping. We construct a directed weighted network in which nodes represent agents and edges represent commenting interactions. Our analysis reveals strongly heterogeneous connectivity patterns characterized by heavy tailed degree and activity distributions. At the mesoscale, the network exhibits a pronounced core periphery organization in which a very small structural core (0.9% of nodes) concentrates a large fraction of connectivity. Robustness experiments show that the network is relatively resilient to random node removal but highly vulnerable to targeted attacks on highly connected nodes, particularly those with high out degree. These findings indicate that the interaction structure of AI agent social systems may develop strong centralization and structural fragility, providing new insights into the collective organization of LLM native social environments.

关键词: large language models, LLM-based agents, social networks, network science, core-periphery structure, robustness analysis, autonomous AI agents, collective dynamics

44. ❌ A Multimodal Framework for Human-Multi-Agent Interaction

作者: Shaid Hasan, Breenice Lee, Sujan Sarker, Tariq Iqbal 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23271v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确提到使用Large Language Model (LLM)驱动的规划，因此与’Large Language Models’高度相关（10分）。论文研究多机器人系统，每个机器人作为自主认知代理，并涉及集中协调机制，因此与’LLM Agents’和’Multi-agent Systems’高度相关（各10分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理技术、压缩加速、科学应用等均未在摘要中提及或暗示，因此评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个用于人-多智能体交互的多模态框架，通过集成多模态感知和LLM驱动的规划，使机器人作为自主认知代理，并采用集中协调机制实现协调的多模态推理和具身响应。

摘要翻译

人机交互正日益朝着多机器人、社会性具身环境的方向发展。现有系统难以将多模态感知、具身表达与协调决策整合到统一框架中，这限制了共享物理空间中自然且可扩展的交互。为弥补这一不足，我们提出了一种面向人-多智能体交互的多模态框架，其中每个机器人作为自主认知智能体运行，具备集成的多模态感知能力，并依托具身基础采用大语言模型（Large Language Model, LLM）驱动的规划。在团队层面，集中式协调机制管理对话轮转与智能体参与，以避免语音重叠和行为冲突。该框架在两个仿人机器人上实现，通过融合语音、手势、视线与移动的交互策略，实现了连贯的多智能体交互。典型交互案例展示了跨智能体的协调多模态推理以及基于具身的响应。未来工作将聚焦于更大规模的用户研究，并深入探索社会性具身的多智能体交互动态。

摘要 (Abstract)

Human-robot interaction is increasingly moving toward multi-robot, socially grounded environments. Existing systems struggle to integrate multimodal perception, embodied expression, and coordinated decision-making in a unified framework. This limits natural and scalable interaction in shared physical spaces. We address this gap by introducing a multimodal framework for human-multi-agent interaction in which each robot operates as an autonomous cognitive agent with integrated multimodal perception and Large Language Model (LLM)-driven planning grounded in embodiment. At the team level, a centralized coordination mechanism regulates turn-taking and agent participation to prevent overlapping speech and conflicting actions. Implemented on two humanoid robots, our framework enables coherent multi-agent interaction through interaction policies that combine speech, gesture, gaze, and locomotion. Representative interaction runs demonstrate coordinated multimodal reasoning across agents and grounded embodied responses. Future work will focus on larger-scale user studies and deeper exploration of socially grounded multi-agent interaction dynamics.

关键词: human-robot interaction, multi-agent systems, multimodal framework, Large Language Model, autonomous cognitive agents, coordinated decision-making, embodied expression, centralized coordination

45. ❌ Not All Tokens Are Created Equal: Query-Efficient Jailbreak Fuzzing for LLMs

作者: Wenyu Chen, Xiangtao Meng, Chuanchao Zang, Li Wang, Xinyu Gao, Jianing Wang, Peng Zhan, Zheng Li, Shanqing Guo 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23269v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的安全漏洞（jailbreak攻击），与’Large Language Models’高度相关（10分），涉及安全对齐和事实性（‘Hallucination Mitigation’ 8分，‘Instruction Tuning’ 5分），并通过token分析提供可解释性见解（‘Mechanistic Interpretability’ 5分）。其他关键词如MoE、SLMs、训练方法、推理加速、科学AI应用等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型（LLMs）的jailbreak漏洞，提出了一种基于token贡献分析的模糊测试框架TriageFuzz，在显著降低查询成本的同时实现了高攻击成功率。

摘要翻译

大语言模型（Large Language Models, LLMs）已被广泛部署，但其易受越狱提示（jailbreak prompts）攻击，导致产生违反安全策略的输出。尽管先前的研究已揭示了这些风险，但它们通常在提示变异过程中将所有令牌（tokens）视为同等重要，忽视了单个令牌在触发模型拒绝行为中的不同贡献。因此，这些攻击在查询受限的场景下引入了大量冗余搜索，降低了攻击效率，并阻碍了全面的漏洞评估。在本研究中，我们对拒绝行为进行了令牌级别的分析，并观察到令牌的贡献度高度偏斜而非均匀分布。此外，我们发现不同模型间的拒绝倾向具有很强的一致性，这使得可以利用一个代理模型（surrogate model）来估计令牌对目标模型拒绝行为的贡献程度。基于这些发现，我们提出了TriageFuzz，一个令牌感知的越狱模糊测试框架，该框架通过一系列定制化设计来调整模糊测试方法。TriageFuzz利用代理模型来估计单个令牌对拒绝行为的贡献，从而能够识别提示中的敏感区域。此外，它结合了一种拒绝引导的进化策略，通过一个轻量级评分器自适应地加权候选提示，以引导进化过程绕过安全约束。在六个开源LLM和三个商业API上进行的大量实验表明，TriageFuzz在显著降低查询成本的同时，实现了可比的攻击成功率（Attack Success Rate, ASR）。值得注意的是，与基线方法相比，TriageFuzz以超过70%的查询减少量达到了90%的ASR。即使在极端受限的25次查询预算下，TriageFuzz仍优于现有方法，将ASR提高了20-40%。

摘要 (Abstract)

Large Language Models(LLMs) are widely deployed, yet are vulnerable to jailbreak prompts that elicit policy-violating outputs. Although prior studies have uncovered these risks, they typically treat all tokens as equally important during prompt mutation, overlooking the varying contributions of individual tokens to triggering model refusals. Consequently, these attacks introduce substantial redundant searching under query-constrained scenarios, reducing attack efficiency and hindering comprehensive vulnerability assessment. In this work, we conduct a token-level analysis of refusal behavior and observe that token contributions are highly skewed rather than uniform. Moreover, we find strong cross-model consistency in refusal tendencies, enabling the use of a surrogate model to estimate token-level contributions to the target model’s refusals. Motivated by these findings, we propose TriageFuzz, a token-aware jailbreak fuzzing framework that adapts the fuzz testing approach with a series of customized designs. TriageFuzz leverages a surrogate model to estimate the contribution of individual tokens to refusal behaviors, enabling the identification of sensitive regions within the prompt. Furthermore, it incorporates a refusal-guided evolutionary strategy that adaptively weights candidate prompts with a lightweight scorer to steer the evolution toward bypassing safety constraints. Extensive experiments on six open-source LLMs and three commercial APIs demonstrate that TriageFuzz achieves comparable attack success rates (ASR) with significantly reduced query costs. Notably, it attains a 90% ASR with over 70% fewer queries compared to baselines. Even under an extremely restrictive budget of 25 queries, TriageFuzz outperforms existing methods, improving ASR by 20-40%.

关键词: Large Language Models, Jailbreak, Fuzzing, Token Analysis, Safety, Query Efficiency, Vulnerability Assessment, Refusal Behavior

46. ❌ AI Lifecycle-Aware Feasibility Framework for Split-RIC Orchestration in NTN O-RAN

作者: Daniele Tarchi 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23252v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究AI在非地面网络（NTN）和O-RAN中的部署可行性，关注网络架构、能源效率和延迟分析，但未涉及大模型、深度学习技术原理或科学应用。所有关键词均与大模型技术、训练方法、推理优化、对齐、代理系统、科学AI应用等相关，而本文聚焦通信网络工程和AI部署可行性，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文研究了在非地面网络O-RAN中通过Split-RIC架构分布控制层次的可行性，分析了三种部署场景的能源和延迟，并确定了星上推理和非地面学习循环优于地面卸载的物理条件。

摘要翻译

将人工智能（AI）集成到非地面网络（NTN）中受到卫星尺寸、重量与功耗（SWaP）及馈电链路容量的共同限制，这些限制直接影响开放无线接入网（O-RAN）的闭环控制与模型生命周期管理。本文通过拆分式无线接入网智能控制器（Split-RIC）架构，研究了将O-RAN控制层次分布在地面、低地球轨道（LEO）和地球静止轨道（GEO）段之间的可行性。我们比较了三种部署场景：（i）以地面为中心的控制结合遥测流传输；（ii）地面-LEO协同的Split-RIC架构，支持星上推理与存储转发式学习；（iii）通过星间链路实现的GEO-LEO多层控制。针对每种场景，我们推导了生命周期能耗与生命周期延迟的闭式表达式，其中涵盖了训练数据传输、模型分发和近实时推理过程。通过对馈电链路条件、模型复杂度和轨道间歇性进行数值敏感性分析，得出了与运营商相关的可行性区域，明确了在何种条件下星上推理与非地面学习环路在物理层面优于地面卸载方案。

摘要 (Abstract)

Integrating Artificial Intelligence (AI) into Non-Terrestrial Networks (NTN) is constrained by the joint limits of satellite SWaP and feeder-link capacity, which directly impact O-RAN closed-loop control and model lifecycle management. This paper studies the feasibility of distributing the O-RAN control hierarchy across Ground, LEO, and GEO segments through a Split-RIC architecture. We compare three deployment scenarios: (i) ground-centric control with telemetry streaming, (ii) ground–LEO Split-RIC with on-board inference and store-and-forward learning, and (iii) GEO–LEO multi-layer control enabled by inter-satellite links. For each scenario, we derive closed-form expressions for lifecycle energy and lifecycle latency that account for training-data transfer, model dissemination, and near-real-time inference. Numerical sensitivity analysis over feeder-link conditions, model complexity, and orbital intermittency yields operator-relevant feasibility regions that delineate when on-board inference and non-terrestrial learning loops are physically preferable to terrestrial offloading.

关键词: AI lifecycle, Split-RIC, Non-Terrestrial Networks, O-RAN, feasibility framework, on-board inference, energy efficiency, latency analysis

47. ❌ SafeSeek: Universal Attribution of Safety Circuits in Language Models

作者: Miao Yu, Siyuan Fu, Moayad Aloqaily, Zhenhong Zhou, Safa Otoum, Xing fan, Kun Wang, Yufei Guo, Qingsong Wen 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23268v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的安全机制解释性，直接涉及’Large Language Models’、‘Mechanistic Interpretability’和’Alignment’（安全对齐），评10分。方法使用稀疏电路（sparse circuits）与’Mixture of Experts’有一定概念关联（稀疏性），评5分。涉及安全微调（safety fine-tuning）与’Post-training’和’PEFT’相关，但非核心，各评5分。其他关键词如’Small Language Models’、‘Scaling Laws’、‘RLHF’等未在摘要中体现，评0分。

!!! tip deepseek-chat TL;DR

该论文提出SafeSeek框架，通过优化方法识别LLM中的安全电路，有效定位并控制安全关键行为（如后门攻击和安全对齐），在保持模型通用能力的同时显著提升安全性。

摘要翻译

机制可解释性研究表明，大型语言模型（LLM）中的安全关键行为（如对齐、越狱、后门）根植于特定的功能组件。然而，现有的安全归因方法因其依赖启发式、领域特定的度量和搜索算法，在泛化性和可靠性方面存在局限。为解决这一问题，我们提出\ourmethod，一个统一的安全可解释性框架，通过优化方法识别LLM中功能完整的安全回路。与仅关注孤立注意力头或神经元的方法不同，\ourmethod引入可微分的二元掩码，通过在安全数据集上进行梯度下降来提取多粒度回路，同时集成安全回路调优技术，利用这些稀疏回路进行高效的安全微调。我们在LLM安全的两个关键场景中验证了\ourmethod：\textbf{（1）后门攻击}，识别出一个稀疏度为0.42%的后门回路，其消融可将攻击成功率（ASR）从100%降至0.4%，同时保留超过99%的通用性能；\textbf{（2）安全对齐}，定位出一个包含3.03%注意力头和0.79%神经元的对齐回路，移除该回路会使ASR从0.8%飙升至96.9%，而在进行有用性微调时排除此回路，则可保持96.5%的安全性能留存。

摘要 (Abstract)

Mechanistic interpretability reveals that safety-critical behaviors (e.g., alignment, jailbreak, backdoor) in Large Language Models (LLMs) are grounded in specialized functional components. However, existing safety attribution methods struggle with generalization and reliability due to their reliance on heuristic, domain-specific metrics and search algorithms. To address this, we propose \ourmethod, a unified safety interpretability framework that identifies functionally complete safety circuits in LLMs via optimization. Unlike methods focusing on isolated heads or neurons, \ourmethod introduces differentiable binary masks to extract multi-granular circuits through gradient descent on safety datasets, while integrates Safety Circuit Tuning to utilize these sparse circuits for efficient safety fine-tuning. We validate \ourmethod in two key scenarios in LLM safety: \textbf{(1) backdoor attacks}, identifying a backdoor circuit with 0.42% sparsity, whose ablation eradicates the Attack Success Rate (ASR) from 100% $\to$ 0.4% while retaining over 99% general utility; \textbf{(2) safety alignment}, localizing an alignment circuit with 3.03% heads and 0.79% neurons, whose removal spikes ASR from 0.8% $\to$ 96.9%, whereas excluding this circuit during helpfulness fine-tuning maintains 96.5% safety retention.

关键词: Large Language Models, Mechanistic Interpretability, Safety Circuits, Backdoor Attacks, Safety Alignment, Sparse Circuits, Fine-tuning, Gradient Descent

48. ❌ Neural ODE and SDE Models for Adaptation and Planning in Model-Based Reinforcement Learning

作者: Chao Han, Stefanos Ioannou, Luca Manneschi, T. J. Hayward, Michael Mangan, Aditya Gilra, Eleni Vasilaki 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23245v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究神经ODE/SDE在基于模型的强化学习中的应用，专注于环境动态建模、策略适应和部分可观测性处理，属于强化学习领域。所有评分关键词均与大语言模型、模型训练技术、推理优化、AI对齐、AI代理等主题相关，而本论文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了神经ODE和SDE在基于模型的强化学习中建模随机动态和适应环境变化的问题，结果表明神经SDE能更有效地捕捉环境随机性，并在部分可观测环境中通过潜在SDE模型实现了优于或匹配现有方法的性能。

摘要翻译

本研究在基于模型的强化学习框架下，探究了神经常微分方程与随机微分方程（neural ODEs 和 SDEs）在完全可观测与部分可观测环境中对随机动力学的建模能力。通过一系列仿真实验，我们发现神经随机微分方程能更有效地捕捉状态转移动力学的内在随机性，从而在具有挑战性的场景中实现高性能策略，并提升了样本效率。我们利用神经常微分方程与随机微分方程，借助逆模型实现了策略对环境动力学变化的高效适应，仅需与新环境进行有限交互。为应对部分可观测性问题，我们提出了一种潜在空间随机微分方程模型，该模型将常微分方程与一个通过生成对抗网络训练的随机分量相结合于潜在空间中。基于此模型推导出的策略提供了强有力的基准，在多个随机连续控制基准测试中，其表现优于或匹配于通用的基于模型及无模型方法。本工作证明了动作条件潜在随机微分方程在具有随机转移特性的环境中进行强化学习规划的适用性。代码公开于：https://github.com/ChaoHan-UoS/NeuralRL

摘要 (Abstract)

We investigate neural ordinary and stochastic differential equations (neural ODEs and SDEs) to model stochastic dynamics in fully and partially observed environments within a model-based reinforcement learning (RL) framework. Through a sequence of simulations, we show that neural SDEs more effectively capture the inherent stochasticity of transition dynamics, enabling high-performing policies with improved sample efficiency in challenging scenarios. We leverage neural ODEs and SDEs for efficient policy adaptation to changes in environment dynamics via inverse models, requiring only limited interactions with the new environment. To address partial observability, we introduce a latent SDE model that combines an ODE with a GAN-trained stochastic component in latent space. Policies derived from this model provide a strong baseline, outperforming or matching general model-based and model-free approaches across stochastic continuous-control benchmarks. This work demonstrates the applicability of action-conditional latent SDEs for RL planning in environments with stochastic transitions. Our code is available at: https://github.com/ChaoHan-UoS/NeuralRL

关键词: neural ODE, neural SDE, model-based reinforcement learning, stochastic dynamics, policy adaptation, partial observability, latent SDE, continuous-control benchmarks

49. ❌ Online library learning in human visual puzzle solving

作者: Pinzhe Zhao, Emanuele Sansone, Marta Kryven, Bonan Zhao 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23244v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究人类视觉谜题解决中的在线库学习机制，属于认知科学和人类问题解决领域，与所有评分关键词（均涉及大模型、深度学习技术原理或AI科学应用）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究探讨人类在解决视觉谜题时如何在线学习和重用抽象助手，发现随着经验积累，助手使用变得更加高效，且计算模型显示人类决策时间与程序归纳模型的搜索空间相关。

摘要翻译

在学习一项新颖复杂任务时，人们常会形成高效且可重复使用的抽象概念以简化未来工作，尽管对未来存在不确定性。我们在一个视觉解谜任务中研究了这一过程：参与者定义并复用“助手”——即捕捉重复结构的中间构造。在一项在线实验中，参与者解决了难度递增的谜题。初期，他们创建了大量助手，更注重完整性而非效率。随着经验积累，助手的使用变得更具选择性和效率，体现出对复用性和成本的敏感性。使用助手使参与者能够解决原本困难或无法完成的谜题。计算建模显示，人类完成谜题的决策时间和操作次数，会随着程序归纳模型（含库学习）估计的搜索空间增大而增加。相比之下，原始程序长度仅能预测失败率，而不能预测努力程度。这些结果共同表明，在线库学习是人类问题解决的核心机制，使人们能够随着任务需求增长，灵活地构建、优化和复用抽象概念。

摘要 (Abstract)

When learning a novel complex task, people often form efficient reusable abstractions that simplify future work, despite uncertainty about the future. We study this process in a visual puzzle task where participants define and reuse helpers – intermediate constructions that capture repeating structure. In an online experiment, participants solved puzzles of increasing difficulty. Early on, they created many helpers, favouring completeness over efficiency. With experience, helper use became more selective and efficient, reflecting sensitivity to reuse and cost. Access to helpers enabled participants to solve puzzles that were otherwise difficult or impossible. Computational modelling shows that human decision times and number of operations used to complete a puzzle increase with search space estimated by a program induction model with library learning. In contrast, raw program length predicts failure but not effort. Together, these results point to online library learning as a core mechanism in human problem solving, allowing people to flexibly build, refine, and reuse abstractions as task demands grow.

关键词: online library learning, human problem solving, visual puzzle, abstraction reuse, program induction, cognitive modeling, helper efficiency, search space estimation

50. ❌ A Learning Method with Gap-Aware Generation for Heterogeneous DAG Scheduling

作者: Ruisong Zhou, Haijun Zou, Li Zhou, Chumin Sun, Zaiwen Wen 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23249v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究异构DAG调度问题，提出WeCAN强化学习框架，关注任务-资源池兼容性、生成诱导最优性间隙和调度效率。所有关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文专注于传统强化学习在调度优化中的应用，未涉及大模型、深度学习架构、训练方法、推理优化、对齐技术、代理系统或科学AI应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

本文提出WeCAN强化学习框架解决异构环境下的DAG调度问题，通过两阶段单次前向设计和跳过扩展实现，在减少生成诱导最优性间隙的同时保持高效调度，实验表明其优于基线方法且推理时间接近经典启发式算法。

摘要翻译

在异构环境中，由于资源容量与任务依赖关系的存在，有向无环图（DAG）的高效调度具有挑战性。实际应用中，需要在不同资源池和任务类型的多样化环境中保持适应性，同时要求快速生成调度方案，这进一步增加了问题的复杂性。我们提出WeCAN，一种面向异构DAG调度的端到端强化学习框架，旨在处理任务与资源池之间的兼容性系数以及由生成过程引起的最优性差距。该框架采用两阶段单次前向设计：单次前向传播生成任务-资源池评分与全局参数，随后通过一个生成映射构造调度方案，无需重复调用网络。其加权交叉注意力编码器通过兼容性系数门控建模任务与资源池的交互，并对环境规模波动保持无关性。此外，广泛使用的列表调度映射可能因可达性受限而产生生成性最优性差距。我们引入了一种顺序空间分析，通过可行的调度顺序刻画生成映射的可达集合，解释生成性差距背后的机制，并推导出消除差距的充分条件。在这些条件的指导下，我们设计了一种跳转扩展实现方法，采用解析参数化的递减跳转规则，在保持单次前向效率的同时扩大可达顺序集合。在计算图和实际TPC-H DAG上的实验表明，与强基线方法相比，该方法在完工时间上取得改进，其推理时间与经典启发式方法相当，且快于多轮神经调度器。

摘要 (Abstract)

Efficient scheduling of directed acyclic graphs (DAGs) in heterogeneous environments is challenging due to resource capacities and dependencies. In practice, the need for adaptability across environments with varying resource pools and task types, alongside rapid schedule generation, complicates these challenges. We propose WeCAN, an end-to-end reinforcement learning framework for heterogeneous DAG scheduling that addresses task–pool compatibility coefficients and generation-induced optimality gaps. It adopts a two-stage single-pass design: a single forward pass produces task–pool scores and global parameters, followed by a generation map that constructs schedules without repeated network calls. Its weighted cross-attention encoder models task–pool interactions gated by compatibility coefficients, and is size-agnostic to environment fluctuations. Moreover, widely used list-scheduling maps can incur generation-induced optimality gaps from restricted reachability. We introduce an order-space analysis that characterizes the reachable set of generation maps via feasible schedule orders, explains the mechanism behind generation-induced gaps, and yields sufficient conditions for gap elimination. Guided by these conditions, we design a skip-extended realization with an analytically parameterized decreasing skip rule, which enlarges the reachable order set while preserving single-pass efficiency. Experiments on computation graphs and real-world TPC-H DAGs demonstrate improved makespan over strong baselines, with inference time comparable to classical heuristics and faster than multi-round neural schedulers.

关键词: heterogeneous DAG scheduling, reinforcement learning, optimality gaps, single-pass design, task-pool compatibility, order-space analysis, skip-extended realization, makespan improvement

51. ❌ MemCollab: Cross-Agent Memory Collaboration via Contrastive Trajectory Distillation

作者: Yurui Chang, Yiran Wu, Qingyun Wu, Lu Lin 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23234v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based agents的跨代理记忆协作框架MemCollab，通过对比不同代理在相同任务上的推理轨迹来构建代理无关的记忆系统。与"LLM Agents"和"Multi-agent Systems"高度相关（10分），因为直接研究异构代理间的协作。与"Chain of Thought"和"System 2 Thinking"相关（8分），因为涉及推理轨迹分析和抽象推理约束提取。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了MemCollab框架，通过对比不同LLM代理在相同任务上的推理轨迹来构建代理无关的共享记忆系统，实验证明该框架能提高异构代理在数学推理和代码生成任务上的准确性和推理效率。

摘要翻译

基于大语言模型（LLM）的智能体依赖记忆机制来复用过往问题解决经验中的知识。现有方法通常以单智能体方式构建记忆，将存储的知识与单一模型的推理风格紧密耦合。在采用异构智能体的现代部署场景中，一个自然的问题随之产生：能否在不同模型间共享单一记忆系统？我们发现，在智能体间简单迁移记忆往往会导致性能下降，因为此类记忆将任务相关知识与其特定智能体的偏好混杂在一起。为应对这一挑战，我们提出MemCollab——一个协作式记忆框架，它通过对比不同智能体在同一任务上生成的推理轨迹，构建出与智能体无关的记忆。这一对比过程提炼出抽象的推理约束，这些约束捕捉了任务层面共享的不变量，同时抑制了智能体特有的干扰因素。我们进一步引入一种任务感知检索机制，使记忆访问基于任务类别进行条件筛选，确保在推理时仅使用相关的约束。在数学推理和代码生成基准测试上的实验表明，MemCollab能持续提升不同智能体（包括跨模型族设置）的准确性和推理效率。我们的研究结果表明，这种协作构建的记忆能够作为多样化基于LLM的智能体共享的推理资源。

摘要 (Abstract)

Large language model (LLM)-based agents rely on memory mechanisms to reuse knowledge from past problem-solving experiences. Existing approaches typically construct memory in a per-agent manner, tightly coupling stored knowledge to a single model’s reasoning style. In modern deployments with heterogeneous agents, a natural question arises: can a single memory system be shared across different models? We found that naively transferring memory between agents often degrades performance, as such memory entangles task-relevant knowledge with agent-specific biases. To address this challenge, we propose MemCollab, a collaborative memory framework that constructs agent-agnostic memory by contrasting reasoning trajectories generated by different agents on the same task. This contrastive process distills abstract reasoning constraints that capture shared task-level invariants while suppressing agent-specific artifacts. We further introduce a task-aware retrieval mechanism that conditions memory access on task category, ensuring that only relevant constraints are used at inference time. Experiments on mathematical reasoning and code generation benchmarks demonstrate that MemCollab consistently improves both accuracy and inference-time efficiency across diverse agents, including cross-modal-family settings. Our results show that the collaboratively constructed memory can function as a shared reasoning resource for diverse LLM-based agents.

关键词: LLM-based agents, memory collaboration, contrastive trajectory distillation, agent-agnostic memory, reasoning trajectories, task-aware retrieval, mathematical reasoning, code generation

52. ❌ General Machine Learning: Theory for Learning Under Variable Regimes

作者: Aomar Osmani 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23220v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是机器学习理论中的“regime-varying learning”（可变机制学习），这是一个纯理论框架，关注学习器、记忆状态和评估条件随时间演变时的核心学习理论对象和定理。论文内容完全是数学和理论计算机科学导向，涉及可容许传输、保护核心保持、评估器感知学习演化等抽象概念。所有评分关键词都聚焦于大模型、深度学习、AI应用等具体技术或应用领域，而本文是纯理论机器学习研究，与这些关键词没有任何直接关联。论文没有涉及任何具体模型架构、训练方法、推理技术、应用领域或AI系统实现。

!!! tip deepseek-chat TL;DR

该论文提出了一个用于处理学习器、记忆状态和评估条件随时间变化的可变机制学习的结构化理论框架，并建立了其首个定理支持层，包括可容许性、保护稳定性模板和评估器分解等核心结果。

摘要翻译

本文研究制度变迁下的学习问题，其中学习者、其记忆状态及评估条件均可能随时间演变。本论文是一项基础性与结构性的贡献：其目标是为此类情境定义核心的学习理论对象，并建立首个具有定理支撑意义的结论。
论文构建了一个以可容许迁移、保护核心保持及评估者感知的学习演化为核心的制度变迁框架。该框架记录了可容许性的直接闭合性质，针对真实多制度场景中的忠实固定本体归约提出了结构性阻碍论证，并引入了保护稳定性模板，同时在受控子类（包括凸性与演绎性场景）上给出了显式的数值与符号见证。此外，论文还在定理层面建立了关于评估者分解、态射、复合以及语义可通约层级间的部分核级对齐的结果。
通过一个具体的双制度示例，论文在受控子类上明确了可容许性证明、受保护评估核心及制度变迁成本。符号部分的研究范围被有意限定：本文建立了首个核级兼容性结果，并提供了一个受控的单调演绎见证。因此，本手稿应被视为引入了一个针对制度变迁学习的结构化学习理论框架及其首个定理支撑层，而非一套关于所有学习系统的完整量化理论。

摘要 (Abstract)

We study learning under regime variation, where the learner, its memory state, and the evaluative conditions may evolve over time. This paper is a foundational and structural contribution: its goal is to define the core learning-theoretic objects required for such settings and to establish their first theorem-supporting consequences. The paper develops a regime-varying framework centered on admissible transport, protected-core preservation, and evaluator-aware learning evolution. It records the immediate closure consequences of admissibility, develops a structural obstruction argument for faithful fixed-ontology reduction in genuinely multi-regime settings, and introduces a protected-stability template together with explicit numerical and symbolic witnesses on controlled subclasses, including convex and deductive settings. It also establishes theorem-layer results on evaluator factorization, morphisms, composition, and partial kernel-level alignment across semantically commensurable layers. A worked two-regime example makes the admissibility certificate, protected evaluative core, and regime-variation cost explicit on a controlled subclass. The symbolic component is deliberately restricted in scope: the paper establishes a first kernel-level compatibility result together with a controlled monotonic deductive witness. The manuscript should therefore be read as introducing a structured learning-theoretic framework for regime-varying learning together with its first theorem-supporting layer, not as a complete quantitative theory of all learning systems.

关键词: regime-varying learning, learning theory, admissible transport, protected-core preservation, evaluator-aware learning, theorem-supporting framework, multi-regime settings, structural obstruction

53. ❌ PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments

作者: Shuochen Liu, Junyi Zhu, Long Shu, Junda Lin, Yuhao Chen, Haotian Zhang, Chao Zhang, Derong Xu, Jia Li, Bo Tang, Zhiyu Li, Feiyu Xiong, Enhong Chen, Tong Xu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23231v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在长期记忆增强方面的应用，特别是构建能够适应用户需求演变的智能体（LLM Agents），因此与’Large Language Models OR LLMs OR Foundation Models’和’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分）。论文提出PERMA基准测试，旨在评估个性化记忆代理，涉及检索增强生成（RAG）技术以改进对话检索，与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’有一定关联（8分）。研究关注长期记忆和跨会话交互，隐含涉及上下文窗口扩展以处理时序数据，与’Context Window Extension OR Long Context LLMs’有中等关联（5分）。其他关键词如MoE、SFT、RLHF、量化等未在摘要中提及，与论文主题无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了PERMA基准测试，用于评估大语言模型在长期记忆增强方面的个性化智能体，通过事件驱动的偏好和真实任务环境，研究发现高级记忆系统能通过关联交互提取更精确偏好并减少令牌消耗，但仍难以在时间深度和跨域干扰中保持连贯角色。

摘要翻译

为大型语言模型赋予长期记忆能力，对于构建能够适应用户动态需求的智能体至关重要。然而，现有的评估通常将偏好相关的对话与无关对话交织在一起，这实际上将任务简化为“大海捞针”式的检索，而忽略了驱动用户偏好演变的事件间关联。此类设定忽视了现实世界个性化任务的一个基本特征：偏好是在嘈杂的交互环境中逐渐显现并累积形成的。为弥补这一差距，我们提出了PERMA基准，其设计旨在评估智能体随时间推移保持角色一致性的能力，而不仅仅是静态的偏好回忆。此外，我们引入了（1）文本可变性和（2）语言对齐机制，以模拟真实数据中用户输入的随意性和个人特有的语言风格。PERMA由跨多个会话和领域、按时间顺序排列的交互事件构成，其中随时间推移穿插着与偏好相关的查询。我们设计了多项选择和交互式任务，以探究模型在交互时间线上对角色（persona）的理解。实验表明，通过关联相关的交互，先进的记忆系统能够提取更精确的偏好并减少令牌消耗，其表现优于传统的原始对话语义检索方法。然而，这些系统在跨越时间深度和跨领域干扰时，仍难以维持连贯的角色一致性，这凸显了智能体需要更鲁棒的个性化记忆管理机制。我们的代码与数据已在 https://github.com/PolarisLiu1/PERMA 开源。

摘要 (Abstract)

Empowering large language models with long-term memory is crucial for building agents that adapt to users’ evolving needs. However, prior evaluations typically interleave preference-related dialogues with irrelevant conversations, reducing the task to needle-in-a-haystack retrieval while ignoring relationships between events that drive the evolution of user preferences. Such settings overlook a fundamental characteristic of real-world personalization: preferences emerge gradually and accumulate across interactions within noisy contexts. To bridge this gap, we introduce PERMA, a benchmark designed to evaluate persona consistency over time beyond static preference recall. Additionally, we incorporate (1) text variability and (2) linguistic alignment to simulate erratic user inputs and individual idiolects in real-world data. PERMA consists of temporally ordered interaction events spanning multiple sessions and domains, with preference-related queries inserted over time. We design both multiple-choice and interactive tasks to probe the model’s understanding of persona along the interaction timeline. Experiments demonstrate that by linking related interactions, advanced memory systems can extract more precise preferences and reduce token consumption, outperforming traditional semantic retrieval of raw dialogues. Nevertheless, they still struggle to maintain a coherent persona across temporal depth and cross-domain interference, highlighting the need for more robust personalized memory management in agents. Our code and data are open-sourced at https://github.com/PolarisLiu1/PERMA.

关键词: Large Language Models, LLM Agents, Personalized Memory, Benchmark, Event-Driven Preference, Long-term Memory, Retrieval-Augmented Generation, Persona Consistency

54. ❌ ImplicitRM: Unbiased Reward Modeling from Implicit Preference Data for LLM alignment

作者: Hao Wang, Haocheng Yang, Licheng Pan, Lei Shen, Xiaoxi Li, Yinuo Wang, Zhichao Chen, Yuan Lu, Haoxuan Li, Zhouchen Lin 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23184v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于RLHF中的奖励建模，提出ImplicitRM方法从隐式偏好数据中学习无偏奖励模型。核心相关关键词：1) ‘Large Language Models’（论文研究LLM对齐），2) ‘Instruction Tuning OR Alignment OR Value Alignment’（论文直接研究LLM对齐问题），3) ‘RLHF OR RLAIF OR Direct Preference Optimization OR DPO’（论文明确针对RLHF中的奖励建模挑战）。其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG等均未涉及，论文未讨论特定领域应用（如生物信息学），也未涉及推理、代理、压缩等技术。

!!! tip deepseek-chat TL;DR

该论文解决了RLHF中奖励建模依赖高成本显式反馈数据的问题，提出了ImplicitRM方法从隐式偏好数据中学习无偏奖励模型，并通过实验验证了其有效性。

摘要翻译

奖励建模是从人类反馈中进行强化学习以对齐语言模型时长期存在的挑战。当前奖励建模高度依赖实验性反馈数据，其收集成本高昂。本工作中，我们研究\textit{隐式奖励建模}——即从隐式人类反馈（如点击和复制行为）中学习奖励模型——作为一种经济高效的替代方案。我们识别出隐式奖励建模中的两个根本性挑战：（1）隐式偏好数据缺乏明确的负样本，这使得标准的正负分类方法无法适用；（2）隐式偏好数据存在用户偏好偏差，即不同回复引发用户反馈行为的倾向性不同，这加剧了区分明确负样本的难度。为应对这些挑战，我们提出ImplicitRM方法，旨在从隐式偏好数据中学习无偏的奖励模型。ImplicitRM通过分层模型将训练样本划分为四个潜在组别，并在此基础上通过似然最大化推导出学习目标。我们证明该目标在理论上具有无偏性，能有效解决上述两个挑战。实验表明，ImplicitRM能在多种隐式偏好数据集上学习到准确的奖励模型。代码已在项目网站公开。

摘要 (Abstract)

Reward modeling represents a long-standing challenge in reinforcement learning from human feedback (RLHF) for aligning language models. Current reward modeling is heavily contingent upon experimental feedback data with high collection costs. In this work, we study \textit{implicit reward modeling} – learning reward models from implicit human feedback (e.g., clicks and copies) – as a cost-effective alternative. We identify two fundamental challenges in implicit reward modeling: (1) Implicit preference data lacks definitive negative samples, which makes standard positive-negative classification methods inapplicable; (2) Implicit preference data suffers from user preference bias, where different responses have different propensities to elicit user feedback actions, which exacerbates the difficulty of distinguishing definitive negative samples. To address these challenges, we propose ImplicitRM, which aims to learn unbiased reward models from implicit preference data. ImplicitRM stratifies training samples into four latent groups via a stratification model. Building on this, it derives a learning objective through likelihood maximization, which we prove is theoretically unbiased, effectively resolving both challenges. Experiments demonstrate that ImplicitRM learns accurate reward models across implicit preference datasets. Code is available on our project website.

关键词: Reward Modeling, RLHF, Implicit Preference, Alignment, Unbiased Learning, Human Feedback, Language Models, Stratification Model

55. ❌ Reasoning over Semantic IDs Enhances Generative Recommendation

作者: Yingzhi He, Yan Sun, Junfei Tan, Yuxin Chen, Xiaoyu Kong, Chunxu Shen, Xiang Wang, An Zhang, Tat-Seng Chua 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23183v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在推荐系统中的应用，特别是通过Semantic IDs（SIDs）实现生成式推荐，并重点解决SIDs上的推理问题。高度相关的关键词包括：LLMs（核心基础）、Chain of Thought/Reasoning（核心创新点）、System 2 Thinking（涉及深度推理）、Alignment（涉及SID-language对齐）。中等相关的关键词包括：Pre-training/Domain Adaptation（涉及多任务训练）、SFT（涉及优化）、Explainable AI（提及可解释性）。其他关键词如MoE、SLMs、RAG、Quantization等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出SIDReasoner框架，通过增强Semantic IDs与语言的对齐来解锁LLM在生成式推荐中的可迁移推理能力，解决了SIDs推理的挑战，并在实验中证明了其在准确性、可解释性和跨领域泛化方面的有效性。

摘要翻译

生成式推荐领域的最新进展通过将序列推荐任务构建为在统一标记空间上的自回归生成过程，有效利用了预训练大语言模型。该统一空间包含语言标记与项目标识符，其中每个项目由一组紧凑的离散标记序列表示，即语义标识符。这种基于语义标识符的框架能够支持在大规模项目库上进行高效解码，并为基于大语言模型的推荐系统提供了利用丰富世界知识的自然接口。与此同时，大语言模型推理能力的突破推动了推理增强型推荐的发展，然而针对语义标识符的有效推理仍研究不足且面临挑战。项目标记本身对大语言模型缺乏固有语义含义；此外，面向推荐的语义标识符推理难以评估，导致高质量监督数据稀缺。
为应对这些挑战，我们提出SIDReasoner——一个两阶段框架，通过强化语义标识符与语言的对齐来激发对语义标识符的推理能力，从而释放大语言模型的可迁移推理潜力，而非依赖大量推荐专用的推理轨迹。具体而言，SIDReasoner首先通过多任务训练增强语义标识符-语言对齐，该训练使用由更强教师模型合成的、以语义标识符为中心的增强语料库，将项目标记锚定于多样化的语义与行为上下文中。基于这种增强的对齐能力，SIDReasoner进一步通过结果驱动的强化优化提升推荐推理性能，该方法能引导模型走向有效的推理轨迹，而无需显式的推理标注。在三个真实世界数据集上的大量实验验证了我们基于语义标识符的推理增强生成式推荐方法的有效性。除准确性提升外，实验结果更凸显了大推理模型在生成式推荐中的广阔潜力，包括改进的可解释性与跨领域泛化能力。

摘要 (Abstract)

Recent advances in generative recommendation have leveraged pretrained LLMs by formulating sequential recommendation as autoregressive generation over a unified token space comprising language tokens and itemic identifiers, where each item is represented by a compact sequence of discrete tokens, namely Semantic IDs (SIDs). This SID-based formulation enables efficient decoding over large-scale item corpora and provides a natural interface for LLM-based recommenders to leverage rich world knowledge. Meanwhile, breakthroughs in LLM reasoning motivate reasoning-enhanced recommendation, yet effective reasoning over SIDs remains underexplored and challenging. Itemic tokens are not natively meaningful to LLMs; moreover, recommendation-oriented SID reasoning is hard to evaluate, making high-quality supervision scarce. To address these challenges, we propose SIDReasoner, a two-stage framework that elicits reasoning over SIDs by strengthening SID–language alignment to unlock transferable LLM reasoning, rather than relying on large amounts of recommendation-specific reasoning traces. Concretely, SIDReasoner first enhances SID-language alignment via multi-task training on an enriched SID-centered corpus synthesized by a stronger teacher model, grounding itemic tokens in diverse semantic and behavioral contexts. Building on this enhanced alignment, SIDReasoner further improves recommendation reasoning through outcome-driven reinforced optimization, which guides the model toward effective reasoning trajectories without requiring explicit reasoning annotations. Extensive experiments on three real-world datasets demonstrate the effectiveness of our reasoning-augmented SID-based generative recommendation. Beyond accuracy, the results highlight the broader potential of large reasoning models for generative recommendation, including improved interpretability and cross-domain generalization.

关键词: Generative Recommendation, Semantic IDs, LLM Reasoning, SID-Language Alignment, Reasoning-Augmented Recommendation, Autoregressive Generation, Outcome-Driven Optimization, Cross-Domain Generalization

56. ❌ SAiW: Source-Attributable Invisible Watermarking for Proactive Deepfake Defense

作者: Bibek Das, Chandranath Adak, Soumi Chattopadhyay, Zahid Akhtar, Soumya Dutta 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23178v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SAiW专注于深度伪造防御的数字水印技术，研究内容为源可追溯的不可见水印框架，用于主动防御和媒体溯源验证。所有评分关键词均围绕大模型、深度学习技术原理及其应用（如AI for Science），而本文研究的是生成模型（deepfakes）的防御性水印方法，属于媒体安全领域，未涉及大模型技术原理、训练方法、推理优化、对齐、代理系统或科学AI应用等主题，因此与所有关键词完全无关，均评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种源可追溯的不可见水印框架SAiW，用于主动防御深度伪造和验证媒体来源，通过将水印嵌入建模为源条件表示学习问题，在保持高感知质量的同时实现了对压缩、噪声和对抗攻击的强鲁棒性。

摘要翻译

现代生成模型生成的深度伪造内容对信息完整性、数字身份和公共信任构成严重威胁。现有检测方法大多属于被动反应型，试图在篡改发生后进行识别，且往往难以适应不断演进的生成技术。这促使我们需要在媒体创建时即确保其真实性的主动防护机制。本研究提出SAiW（源属性隐形水印框架），一种用于主动式深度伪造防御与媒体溯源的框架。与传统将水印载荷视为通用信号的水印方法不同，SAiW将水印嵌入建模为源条件表征学习问题：水印身份编码记录生成来源，并通过调制嵌入过程产生可区分、可追踪的特征标识。该框架集成特征级线性调制技术，将源身份注入嵌入网络，实现可扩展的多源水印生成。基于人类视觉系统先验知识构建的感知引导模块，确保水印扰动在保持鲁棒性的同时维持视觉不可感知性。此外，双功能取证解码器可同步重构嵌入水印并执行源归属判定，同时提供自动化验证与可解释的取证证据。在多个深度伪造数据集上的大量实验表明，SAiW在保持高感知质量的同时，对压缩、滤波、噪声、几何变换及对抗性扰动均表现出强鲁棒性。通过不可见但可验证的标记将数字媒体与其来源绑定，SAiW实现了可靠的身份验证与源归属追踪，为主动式深度伪造防御与可信媒体溯源提供了可扩展的基础架构。

摘要 (Abstract)

Deepfakes generated by modern generative models pose a serious threat to information integrity, digital identity, and public trust. Existing detection methods are largely reactive, attempting to identify manipulations after they occur and often failing to generalize across evolving generation techniques. This motivates the need for proactive mechanisms that secure media authenticity at the time of creation. In this work, we introduce SAiW, a Source-Attributed Invisible watermarking Framework for proactive deepfake defense and media provenance verification. Unlike conventional watermarking methods that treat watermark payloads as generic signals, SAiW formulates watermark embedding as a source-conditioned representation learning problem, where watermark identity encodes the originating source and modulates the embedding process to produce discriminative and traceable signatures. The framework integrates feature-wise linear modulation to inject source identity into the embedding network, enabling scalable multi-source watermark generation. A perceptual guidance module derived from human visual system priors ensures that watermark perturbations remain visually imperceptible while maintaining robustness. In addition, a dual-purpose forensic decoder simultaneously reconstructs the embedded watermark and performs source attribution, providing both automated verification and interpretable forensic evidence. Extensive experiments across multiple deepfake datasets demonstrate that SAiW achieves high perceptual quality while maintaining strong robustness against compression, filtering, noise, geometric transformations, and adversarial perturbations. By binding digital media to its origin through invisible yet verifiable markers, SAiW enables reliable authentication and source attribution, providing a scalable foundation for proactive deepfake defense and trustworthy media provenance.

关键词: invisible watermarking, deepfake defense, source attribution, media provenance, representation learning, perceptual guidance, forensic decoder, robustness

57. ❌ Robust Safety Monitoring of Language Models via Activation Watermarking

作者: Toluwani Aremu, Daniil Ognev, Samuele Poppi, Nils Lukas 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23171v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM安全监控技术，核心是检测和防御对抗性攻击以阻止敏感信息泄露。仅与’Large Language Models’高度相关（10分），因为论文直接研究LLM的安全监控机制。其他关键词涉及模型架构、训练方法、推理优化、应用领域等，均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型（LLMs）在推理过程中可能被滥用泄露敏感信息的问题，提出了一种基于激活水印的鲁棒监控方法，以有效检测并防御自适应攻击者的恶意查询。

摘要翻译

大型语言模型（LLM）可能被滥用于泄露敏感信息，例如武器制造指南或编写恶意软件。LLM提供商依赖监控机制，在推理过程中检测并标记不安全行为。当前一个开放的安全挑战是自适应攻击者，他们能够构建同时满足以下条件的攻击：（i）逃避检测，同时（ii）引发不安全行为。自适应攻击者是一个主要威胁，因为LLM提供商无法修补其安全机制，他们并不清楚模型如何被滥用。我们将鲁棒的LLM监控建模为一个安全博弈：了解监控机制的攻击者试图提取敏感信息，而提供商必须在低误报率下准确检测这些对抗性查询。我们的工作（i）表明现有LLM监控机制在面对自适应攻击者时存在脆弱性，并且（ii）通过激活水印技术设计了改进的防御方案，即在推理过程中为攻击者谨慎引入不确定性。研究发现，在知晓监控算法但不知晓密钥的自适应攻击者面前，激活水印方法的性能优于基线防护方案，优势最高可达52%。

摘要 (Abstract)

Large language models (LLMs) can be misused to reveal sensitive information, such as weapon-making instructions or writing malware. LLM providers rely on $\emph{monitoring}$ to detect and flag unsafe behavior during inference. An open security challenge is $\emph{adaptive}$ adversaries who craft attacks that simultaneously (i) evade detection while (ii) eliciting unsafe behavior. Adaptive attackers are a major concern as LLM providers cannot patch their security mechanisms, since they are unaware of how their models are being misused. We cast $\emph{robust}$ LLM monitoring as a security game, where adversaries who know about the monitor try to extract sensitive information, while a provider must accurately detect these adversarial queries at low false positive rates. Our work (i) shows that existing LLM monitors are vulnerable to adaptive attackers and (ii) designs improved defenses through $\emph{activation watermarking}$ by carefully introducing uncertainty for the attacker during inference. We find that $\emph{activation watermarking}$ outperforms guard baselines by up to $52%$ under adaptive attackers who know the monitoring algorithm but not the secret key.

关键词: Large Language Models, LLM monitoring, safety monitoring, activation watermarking, adaptive attackers, robust detection, security game, sensitive information

58. ❌ Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models

作者: Massimiliano Pappa, Luca Romani, Valentino Sacco, Alessio Palma, Stéphane Lathuilière, Fabio Galasso, Xavier Alameda-Pineda, Indro Spinelli 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23149v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是提出DILLO（DIstiLLed Language-ActiOn World Model），一种基于大语言模型（LLM）的世界模型，用于安全关键智能体的主动控制。因此，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为LLM是DILLO的核心组件；与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为研究聚焦于智能体控制；与’World Models AND General World Models’高度相关（10分），因为DILLO本身就是一个世界模型。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理加速、AI for Science等均未在摘要中提及，故给0分。

!!! tip deepseek-chat TL;DR

该论文针对安全关键智能体部署中视觉模拟延迟高的问题，提出了一种基于大语言模型的蒸馏语言-动作世界模型DILLO，通过文本推理预测动作结果，实现了14倍加速并显著提升了任务成功率。

摘要翻译

部署安全关键型智能体需要在执行动作前预判其后果。虽然世界模型为这种前瞻性预测提供了范式，但当前依赖视觉模拟的方法会产生极高的延迟，通常每步操作超过数秒。本研究挑战了“视觉处理对于故障预防是必要的”这一假设。我们证明，经过训练的策略的潜在状态与其规划动作相结合，已编码了足够的信息来预测动作结果，使得视觉模拟在故障预防中变得冗余。为此，我们提出了DILLO（蒸馏语言-动作世界模型），这是一种快速转向层，将范式从“先模拟后行动”转变为“先描述后行动”。DILLO通过跨模态蒸馏进行训练：一个具备特权视觉语言模型（Vision Language Model）教师对离线轨迹进行标注，一个潜在状态条件化的大语言模型（Large Language Model）学生则学习预测语义结果。这创建了一条纯文本推理路径，完全绕过了繁重的视觉生成过程，相比基线方法实现了14倍的加速。在MetaWorld和LIBERO平台上的实验表明，DILLO能生成高保真度的下一状态描述，并能有效引导策略，将任务平均成功率最高提升15个百分点，平均提升9.3个百分点。

摘要 (Abstract)

Deploying safety-critical agents requires anticipating the consequences of actions before they are executed. While world models offer a paradigm for this proactive foresight, current approaches relying on visual simulation incur prohibitive latencies, often exceeding several seconds per step. In this work, we challenge the assumption that visual processing is necessary for failure prevention. We show that a trained policy’s latent state, combined with its planned actions, already encodes sufficient information to anticipate action outcomes, making visual simulation redundant for failure prevention. To this end, we introduce DILLO (DIstiLLed Language-ActiOn World Model), a fast steering layer that shifts the paradigm from “simulate-then-act” to “describe-then-act.” DILLO is trained via cross-modal distillation, where a privileged Vision Language Model teacher annotates offline trajectories and a latent-conditioned Large Language Model student learns to predict semantic outcomes. This creates a text-only inference path, bypassing heavy visual generation entirely, achieving a 14x speedup over baselines. Experiments on MetaWorld and LIBERO demonstrate that DILLO produces high-fidelity descriptions of the next state and is able to steer the policy, improving episode success rate by up to 15 pp and 9.3 pp on average across tasks.

关键词: World Models, Large Language Models, Agent Steering, Cross-modal Distillation, Proactive Foresight, Safety-critical Agents, Inference Acceleration, Semantic Outcome Prediction

59. ❌ Why AI-Generated Text Detection Fails: Evidence from Explainable AI Beyond Benchmark Accuracy

作者: Shushanta Pudasaini, Luis Miralles-Pechuán, David Lillis, Marisa Llorens Salvador 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23146v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究AI生成文本检测，直接涉及LLMs（权重1.0关键词）和Explainable AI（权重1.0关键词），分别给10分；其他关键词如MoE、SLMs、Scaling Laws等均未在摘要中提及或相关，给0分。

!!! tip deepseek-chat TL;DR

该论文研究了AI生成文本检测系统在真实场景中的泛化失败问题，通过可解释AI方法发现现有检测器过度依赖数据集特定特征而非稳定的机器作者信号，并提出了一个开源检测框架。

摘要翻译

大型语言模型（LLM）的广泛采用使得AI生成文本的检测成为一个紧迫而复杂的挑战。尽管许多检测系统在基准测试中报告了较高的准确率，但其在真实场景中的可靠性仍不确定，且可解释性往往未被充分探索。本研究旨在探究当代检测器是否真正识别了机器生成特征，还是仅仅利用了数据集特定的伪影。我们提出了一种可解释的检测框架，该框架整合了语言学特征工程、机器学习与可解释人工智能技术。在两个重要的基准语料库（即PAN CLEF 2025和COLING 2025）上进行评估时，我们基于30个语言学特征训练的模型取得了与排行榜相竞争的性能，F1分数达到0.9734。然而，系统的跨领域和跨生成器评估揭示了严重的泛化失败：在领域内表现优异的分类器在分布变化下性能显著下降。通过基于SHAP的解释分析，我们发现最具影响力的特征在不同数据集间存在显著差异，这表明检测器往往依赖于数据集特定的风格线索，而非机器生成文本的稳定信号。进一步的深入错误分析揭示了基于语言学特征的AI文本检测中存在一个根本性矛盾：在领域内数据上最具区分性的特征，恰恰也是最易受领域偏移、格式变化和文本长度效应影响的特征。我们相信这一认知有助于构建在不同场景下均具有鲁棒性的AI检测器。为支持复现与实际应用，我们发布了一个开源Python工具包，该工具包能够为单个文本返回预测结果及实例级别的解释。

摘要 (Abstract)

The widespread adoption of Large Language Models (LLMs) has made the detection of AI-Generated text a pressing and complex challenge. Although many detection systems report high benchmark accuracy, their reliability in real-world settings remains uncertain, and their interpretability is often unexplored. In this work, we investigate whether contemporary detectors genuinely identify machine authorship or merely exploit dataset-specific artefacts. We propose an interpretable detection framework that integrates linguistic feature engineering, machine learning, and explainable AI techniques. When evaluated on two prominent benchmark corpora, namely PAN CLEF 2025 and COLING 2025, our model trained on 30 linguistic features achieves leaderboard-competitive performance, attaining an F1 score of 0.9734. However, systematic cross-domain and cross-generator evaluation reveals substantial generalisation failure: classifiers that excel in-domain degrade significantly under distribution shift. Using SHAP- based explanations, we show that the most influential features differ markedly between datasets, indicating that detectors often rely on dataset-specific stylistic cues rather than stable signals of machine authorship. Further investigation with in-depth error analysis exposes a fundamental tension in linguistic-feature-based AI text detection: the features that are most discriminative on in-domain data are also the features most susceptible to domain shift, formatting variation, and text-length effects. We believe that this knowledge helps build AI detectors that are robust across different settings. To support replication and practical use, we release an open-source Python package that returns both predictions and instance-level explanations for individual texts.

关键词: AI-Generated Text Detection, Large Language Models, Explainable AI, Generalization Failure, Linguistic Features, SHAP Explanations, Cross-domain Evaluation, Open-source Package

60. ❌ Between Rules and Reality: On the Context Sensitivity of LLM Moral Judgment

作者: Adrian Sauter, Mona Schirmer 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23114v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在道德判断中的上下文敏感性，直接涉及LLM评估和道德对齐（Value Alignment），因此这两个关键词得10分。其他关键词如MoE、SLMs、Scaling Laws、训练技术、推理优化、代理系统、压缩量化、科学AI等均未在摘要中提及或相关，故得0分。

!!! tip deepseek-chat TL;DR

该研究揭示了LLM在道德判断中具有显著的上下文敏感性，会向违反规则的行为偏移，且与人类对上下文变化的反应模式不同，并通过激活引导方法实现了对模型上下文敏感性的可控调节。

摘要翻译

人类的道德决策在很大程度上依赖于具体情境。然而，现有关于大语言模型道德判断的研究大多局限于固定场景。为填补这一空白，我们引入了“情境化道德选择”数据集，该数据集包含一系列道德困境，并系统性地融入了道德心理学中已知能改变人类判断的三种情境变量：结果主义考量、情感因素与关系背景。通过对22个大语言模型的评估，我们发现几乎所有模型都表现出情境敏感性，其判断会倾向于违背规则的行为。与人类调查数据对比后，我们发现模型与人类最易受不同情境变量的影响，且一个在基准情境中与人类判断对齐的模型，其情境敏感性未必与人类保持一致。这引发了如何控制模型情境敏感性的问题，我们通过激活引导技术对此进行了探索，该技术能够可靠地增强或减弱模型的情境敏感性。

摘要 (Abstract)

A human’s moral decision depends heavily on the context. Yet research on LLM morality has largely studied fixed scenarios. We address this gap by introducing Contextual MoralChoice, a dataset of moral dilemmas with systematic contextual variations known from moral psychology to shift human judgment: consequentialist, emotional, and relational. Evaluating 22 LLMs, we find that nearly all models are context-sensitive, shifting their judgments toward rule-violating behavior. Comparing with a human survey, we find that models and humans are most triggered by different contextual variations, and that a model aligned with human judgments in the base case is not necessarily aligned in its contextual sensitivity. This raises the question of controlling contextual sensitivity, which we address with an activation steering approach that can reliably increase or decrease a model’s contextual sensitivity.

关键词: LLM moral judgment, context sensitivity, moral dilemmas, alignment, activation steering, human comparison, rule-violating behavior

61. ❌ Can an LLM Detect Instances of Microservice Infrastructure Patterns?

作者: Carlos Eduardo Duarte, Neil B. Harrison, Filipe Figueiredo Correia, Ademar Aguiar, Pavlína Gonçalves 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23073v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究使用GPT-5 nano（一种LLM）检测微服务架构模式，属于LLM在软件工程领域的应用研究。论文核心围绕LLM的应用能力评估，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。其他关键词如MoE、SLMs、训练方法、推理优化、代理系统、科学AI等均未在论文中涉及或提及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM（GPT-5 nano）在检测跨语言微服务架构模式方面的能力，通过构建人工标注数据集进行评估，发现检测性能因模式流行度和工件独特性而异，其中与可识别主导工件相关的模式检测更可靠。

摘要翻译

架构模式在各种软件制品中普遍存在。模式的多样性及其实现方式的差异使得现有工具的检测面临挑战，尤其因为现有工具通常仅支持检测单一语言编写的制品。大型语言模型（LLMs）基于广泛的软件制品和知识进行训练，可能克服现有方法的局限性。然而，其实际效能及影响性能的因素尚未得到深入探究。为了更好地理解这一点，我们开发了MicroPAD。该工具利用GPT 5 nano，基于自然语言模式描述，识别以任何语言编写的软件制品中的架构模式。我们使用MicroPAD评估了LLM检测架构模式实例的能力，特别是与基础设施相关的微服务模式。为此，我们选取了一系列GitHub代码库，并联系其核心贡献者，创建了一个包含190个含有微服务架构模式代码库的新的人工标注数据集。结果表明，MicroPAD能够跨多种语言和制品类型检测模式实例。检测性能因模式而异（F1分数范围从0.09到0.70），具体与其普遍性以及模式所呈现制品的独特性相关。我们还发现，与可识别、占主导地位的制品相关联的模式能被更可靠地检测。这些发现是否适用于其他LLMs和工具，是未来研究的一个有前景的方向。

摘要 (Abstract)

Architectural patterns are frequently found in various software artifacts. The wide variety of patterns and their implementations makes detection challenging with current tools, especially since they often only support detecting patterns in artifacts written in a single language. Large Language Models (LLMs), trained on a diverse range of software artifacts and knowledge, might overcome the limitations of existing approaches. However, their true effectiveness and the factors influencing their performance have not yet been thoroughly examined. To better understand this, we developed MicroPAD. This tool utilizes GPT 5 nano to identify architectural patterns in software artifacts written in any language, based on natural-language pattern descriptions. We used MicroPAD to evaluate an LLM’s ability to detect instances of architectural patterns, particularly infrastructure-related microservice patterns. To accomplish this, we selected a set of GitHub repositories and contacted their top contributors to create a new, human-annotated dataset of 190 repositories containing microservice architectural patterns. The results show that MicroPAD was capable of detecting pattern instances across multiple languages and artifact types. The detection performance varied across patterns (F1 scores ranging from 0.09 to 0.70), specifically in relation to their prevalence and the distinctiveness of the artifacts through which they manifest. We also found that patterns associated with recognizable, dominant artifacts were detected more reliably. Whether these findings generalize to other LLMs and tools is a promising direction for future research.

关键词: Large Language Models, LLM, architectural patterns, microservice patterns, software artifacts, pattern detection, GPT-5 nano, multi-language detection

62. ❌ MedCausalX: Adaptive Causal Reasoning with Self-Reflection for Trustworthy Medical Vision-Language Models

作者: Jianxin Lin, Chunzheng Zhu, Peter J. Kneuertz, Yunfei Bai, Yuan Xue 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23085v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文聚焦于医学视觉语言模型（VLMs）的因果推理改进，与多个关键词高度相关：1）直接扩展了Chain of Thought（CoT）推理在医学领域的应用（10分）；2）通过自我反思机制实现自我校正（10分）；3）核心目标是减少幻觉并提高事实性（10分）；4）属于AI for Science在生物医学领域的应用（10分）；5）涉及深度推理和可解释AI（各8分）；6）虽然未明确提及LLMs，但VLMs是大模型在视觉-语言多模态领域的延伸，给予8分。其他关键词如MoE、量化、RAG等与论文内容无关，评0分。

!!! tip deepseek-chat TL;DR

该论文针对医学视觉语言模型中因果推理机制缺失导致虚假相关性和临床可靠性不足的问题，提出了MedCausalX框架，通过自适应反思架构和因果校正目标，显著提升了诊断一致性并减少了幻觉。

摘要翻译

视觉-语言模型通过整合视觉感知与语言推理，实现了可解释的医疗诊断。然而，现有的医学思维链模型缺乏显式的因果推理表征与执行机制，使其易受虚假关联影响，限制了临床可靠性。我们指出医学思维链推理中的三个核心挑战：如何自适应触发因果修正、构建高质量的因果-虚假对比样本，以及在推理轨迹间保持因果一致性。为解决这些挑战，我们提出MedCausalX——一个在医学视觉-语言模型中显式建模因果推理链的端到端框架。我们首先构建了CRMed数据集，该数据集提供细粒度解剖标注、结构化因果推理链以及反事实变体，用于引导模型学习超越表层关联的因果关系。基于CRMed，MedCausalX采用配备〈因果〉与〈验证〉标记的两阶段自适应反思架构，使模型能自主决定何时及如何进行因果分析与验证。最后，通过基于错误归因的强化学习优化的轨迹级因果修正目标精炼推理链，使模型能够区分真实的因果依赖与捷径关联。在多个基准测试上的广泛实验表明，MedCausalX持续优于现有最优方法，将诊断一致性提升5.4个百分点，幻觉生成率降低超过10个百分点，并获得最高的空间定位交并比，从而为基于因果关系的医学推理设立了新标准。

摘要 (Abstract)

Vision-Language Models (VLMs) have enabled interpretable medical diagnosis by integrating visual perception with linguistic reasoning. Yet, existing medical chain-of-thought (CoT) models lack explicit mechanisms to represent and enforce causal reasoning, leaving them vulnerable to spurious correlations and limiting their clinical reliability. We pinpoint three core challenges in medical CoT reasoning: how to adaptively trigger causal correction, construct high-quality causal-spurious contrastive samples, and maintain causal consistency across reasoning trajectories. To address these challenges, we propose MedCausalX, an end-to-end framework explicitly models causal reasoning chains in medical VLMs. We first introduce the CRMed dataset providing fine-grained anatomical annotations, structured causal reasoning chains, and counterfactual variants that guide the learning of causal relationships beyond superficial correlations. Building upon CRMed, MedCausalX employs a two-stage adaptive reflection architecture equipped with $\langle$causal$\rangle$ and $\langle$verify$\rangle$ tokens, enabling the model to autonomously determine when and how to perform causal analysis and verification. Finally, a trajectory-level causal correction objective optimized through error-attributed reinforcement learning refines the reasoning chain, allowing the model to distinguish genuine causal dependencies from shortcut associations. Extensive experiments on multiple benchmarks show that MedCausalX consistently outperforms state-of-the-art methods, improving diagnostic consistency by +5.4 points, reducing hallucination by over 10 points, and attaining top spatial grounding IoU, thereby setting a new standard for causally grounded medical reasoning.

关键词: Medical Vision-Language Models, Causal Reasoning, Self-Reflection, Chain-of-Thought, Hallucination Mitigation, Medical Diagnosis, Error-Attributed Reinforcement Learning, CRMed Dataset

63. ❌ AuthorMix: Modular Authorship Style Transfer via Layer-wise Adapter Mixing

作者: Sarubi Thillainathan, Ji-Ung Lee, Michael Sullivan, Alexander Koller 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23069v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种基于LoRA适配器的作者风格迁移框架，与’PEFT/LoRA/Parameter-efficient Fine-tuning’高度相关（10分），因为核心方法就是训练风格特定的LoRA适配器并进行层间混合。与’Large Language Models/LLMs/Foundation Models’有一定关联（8分），因为风格迁移任务通常基于大语言模型实现，且论文提到与GPT-5.1比较。与’Post-training/Supervised Fine-tuning/SFT’有一定关联（8分），因为LoRA适配器训练属于微调范畴。其他关键词如MoE、SLMs、Scaling Laws、RAG、RLHF等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了AuthorMix框架，通过训练风格特定的LoRA适配器并进行层间混合，实现了轻量级、模块化的作者风格迁移，在低资源目标上超越了现有方法和GPT-5.1，显著提高了意义保持能力。

摘要翻译

作者风格迁移任务旨在将文本以目标作者的风格进行重写，同时保留原文的语义。现有的风格迁移方法通常在大规模语料上训练单一模型，以同时建模所有目标风格：这种高成本方法在针对特定目标的适应性上灵活性有限，且常常为风格迁移而牺牲语义保持。本文提出AuthorMix：一种轻量化、模块化且可解释的风格迁移框架。我们在少量高资源作者数据上训练独立的、风格特定的LoRA适配器，通过已习得的层级适配器混合策略，仅需少量目标风格训练样本，即可为每个新目标快速训练出专用的适配模型。对于低资源目标，AuthorMix在整体评分上超越了现有的先进风格迁移基线模型以及GPT-5.1，显著提升了语义保持能力。

摘要 (Abstract)

The task of authorship style transfer involves rewriting text in the style of a target author while preserving the meaning of the original text. Existing style transfer methods train a single model on large corpora to model all target styles at once: this high-cost approach offers limited flexibility for target-specific adaptation, and often sacrifices meaning preservation for style transfer. In this paper, we propose AuthorMix: a lightweight, modular, and interpretable style transfer framework. We train individual, style-specific LoRA adapters on a small set of high-resource authors, allowing the rapid training of specialized adaptation models for each new target via learned, layer-wise adapter mixing, using only a handful of target style training examples. AuthorMix outperforms existing, SoTA style-transfer baselines – as well as GPT-5.1 – for low-resource targets, achieving the highest overall score and substantially improving meaning preservation.

关键词: authorship style transfer, LoRA adapters, layer-wise adapter mixing, parameter-efficient fine-tuning, low-resource adaptation, meaning preservation, modular framework, style-specific adaptation

64. ❌ Machine Learning Models for the Early Detection of Burnout in Software Engineering: a Systematic Literature Review

作者: Tien Rahayu Tulili, Ayushi Rastogi, Andrea Capiluppi 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23063v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文是关于软件工程中职业倦怠早期检测的机器学习模型系统文献综述，主要关注传统机器学习方法（如情感检测）在特定应用领域的研究，未涉及大模型、深度学习技术原理、模型训练优化、推理加速、对齐技术、智能体系统等关键词相关的任何内容，与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

这篇论文通过系统文献综述评估了机器学习模型在软件工程师职业倦怠早期检测中的准确性和性能，发现大多数研究关注情感检测方法，并识别了表现较好的工具和数据集。

摘要翻译

职业倦怠是一种职业综合征，与许多其他行业类似，它影响着大多数软件工程师。既往研究表明了若干重要趋势，其中机器学习技术的应用日益增多，以实现对倦怠的早期识别。
本文针对那些提出机器学习方法、并专注于检测软件开发人员及IT专业人员职业倦怠的研究论文，进行了系统性文献综述。我们的目标是评估所提出机器学习技术的准确性与精确度，并为有意复现或拓展这些研究的未来学者提出建议。
通过本次系统性综述，我们观察到大多数原始研究侧重于情绪检测，或利用情绪维度来识别或预测倦怠的存在。我们还开展了一项横断面研究，以检测哪种机器学习方法在情绪识别方面表现更优，以及哪种数据集在捕捉情绪方面更具潜力和表现力。
我们相信，通过明确哪些机器学习工具和数据集在情绪检测（进而间接识别职业倦怠）方面表现更佳，本文能为推进这一重要研究方向提供有价值的参考。

摘要 (Abstract)

Burnout is an occupational syndrome that, like many other professions, affects the majority of software engineers. Past research studies showed important trends, including an increasing use of machine learning techniques to allow for an early detection of burnout. This paper is a systematic literature review (SLR) of the research papers that proposed machine learning (ML) approaches, and focused on detecting burnout in software developers and IT professionals. Our objective is to review the accuracy and precision of the proposed ML techniques, and to formulate recommendations for future researchers interested to replicate or extend those studies. From our SLR we observed that a majority of primary studies focuses on detecting emotions or utilise emotional dimensions to detect or predict the presence of burnout. We also performed a cross-sectional study to detect which ML approach shows a better performance at detecting emotions; and which dataset has more potential and expressivity to capture emotions. We believe that, by identifying which ML tools and datasets show a better performance at detecting emotions, and indirectly at identifying burnout, our paper can be a valuable asset to progress in this important research direction.

关键词: machine learning, burnout detection, software engineering, systematic literature review, emotion detection, IT professionals, ML techniques, dataset performance

65. ❌ Mind Your HEARTBEAT! Claw Background Execution Inherently Enables Silent Memory Pollution

作者: Yechao Zhang, Shiqian Zhao, Jie Zhang, Gelei Deng, Jiawen Zhang, Xiaogeng Liu, Chaowei Xiao, Tianwei Zhang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23064v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Claw个人AI代理中的安全漏洞，涉及AI代理在心跳驱动后台执行时内存污染的问题。仅与关键词’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为论文核心关注AI代理的架构、内存和行为影响。其他关键词涉及大模型技术原理、训练方法、推理优化、科学应用等，论文未涉及这些具体技术或应用领域，因此评分为0分。

!!! tip deepseek-chat TL;DR

论文揭示了Claw个人AI代理中因心跳驱动后台执行导致内存污染的漏洞，发现普通社交错误信息可在用户无感知的情况下污染代理内存并影响其行为，无需提示注入攻击。

摘要翻译

我们在主流Claw个人AI智能体中发现了一个关键安全漏洞：在心跳驱动的后台执行过程中遇到的不受信任内容，可能悄无声息地污染智能体记忆，进而在用户无感知的情况下影响其面向用户的行为。该漏洞源于Claw生态系统普遍采用的一种架构设计：心跳后台执行与面向用户的对话运行于同一会话中，因此从后台监控的任何外部来源（包括电子邮件、消息频道、新闻推送、代码仓库及社交平台）摄取的内容，均可进入用于前台交互的同一记忆上下文，而这一过程通常用户可见性有限且缺乏清晰来源追溯。我们将此过程形式化为一个暴露（E）→记忆（M）→行为（B）路径：心跳执行期间遇到的错误信息进入智能体的短期会话上下文，可能被写入长期记忆，并最终塑造后续面向用户的行为。我们使用Moltbook的受控研究复现系统MissClaw，在智能体原生的社交环境中实例化了该路径。研究发现：（1）社交可信度线索（尤其是感知共识）是短期行为影响的主导驱动因素，误导率高达61%；（2）常规记忆存储行为可将短期污染以高达91%的比例转化为持久的长期记忆，跨会话行为影响率达76%；（3）在内容稀释与上下文修剪的自然浏览条件下，污染仍能跨越会话边界。总体而言，无需提示注入攻击：普通社交错误信息足以在心跳驱动的后台执行中悄无声息地塑造智能体记忆与行为。

摘要 (Abstract)

We identify a critical security vulnerability in mainstream Claw personal AI agents: untrusted content encountered during heartbeat-driven background execution can silently pollute agent memory and subsequently influence user-facing behavior without the user’s awareness. This vulnerability arises from an architectural design shared across the Claw ecosystem: heartbeat background execution runs in the same session as user-facing conversation, so content ingested from any external source monitored in the background (including email, message channels, news feeds, code repositories, and social platforms) can enter the same memory context used for foreground interaction, often with limited user visibility and without clear source provenance. We formalize this process as an Exposure (E) $\rightarrow$ Memory (M) $\rightarrow$ Behavior (B) pathway: misinformation encountered during heartbeat execution enters the agent’s short-term session context, potentially gets written into long-term memory, and later shapes downstream user-facing behavior. We instantiate this pathway in an agent-native social setting using MissClaw, a controlled research replica of Moltbook. We find that (1) social credibility cues, especially perceived consensus, are the dominant driver of short-term behavioral influence, with misleading rates up to 61%; (2) routine memory-saving behavior can promote short-term pollution into durable long-term memory at rates up to 91%, with cross-session behavioral influence reaching 76%; (3) under naturalistic browsing with content dilution and context pruning, pollution still crosses session boundaries. Overall, prompt injection is not required: ordinary social misinformation is sufficient to silently shape agent memory and behavior under heartbeat-driven background execution.

关键词: AI agents, memory pollution, background execution, security vulnerability, social misinformation, heartbeat-driven execution, agent memory, behavior influence

66. ❌ Minibal: Balanced Game-Playing Without Opponent Modeling

作者: Quentin Cohen-Solal, Tristan Cazenave 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23059v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究游戏AI中的平衡对战问题，提出Minibal算法，属于传统游戏AI和算法优化领域。所有评分关键词均聚焦于大模型、深度学习及相关技术（如训练方法、推理优化、对齐、应用等），而本文完全不涉及这些内容，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文针对游戏AI中智能体过于强大导致人机对战不平衡的问题，提出了Minibal算法及其变体，实验证明该算法能在多种棋盘游戏中实现接近完美的平衡对战。

摘要翻译

近年来，游戏人工智能领域取得了显著进展，例如AlphaZero和Athena等智能体在多种棋盘游戏中实现了超越人类的表现。尽管这些智能体能力强大，但它们并不适合人机交互，因为它们总是以压倒性优势击败人类玩家，既无法提供游戏乐趣，也缺乏教育价值。本文致力于解决平衡博弈问题，即智能体能够在挑战对手的同时，既不形成绝对压制，也不刻意退让。
我们提出了Minibal（最小化与平衡）算法，这是专门为平衡博弈设计的Minimax算法变体。基于这一理念，我们对无界Minimax算法进行了若干改进，旨在明确探索平衡策略。
在七种棋盘游戏中进行的实验表明，其中一种改进变体能够持续实现最均衡的博弈表现，其平均结果接近完美平衡。这些结果证明，Minibal算法为设计兼具挑战性与趣味性的人工智能体奠定了良好基础，既适用于娱乐游戏，也适用于严肃游戏场景。

摘要 (Abstract)

Recent advances in game AI, such as AlphaZero and Athénan, have achieved superhuman performance across a wide range of board games. While highly powerful, these agents are ill-suited for human-AI interaction, as they consistently overwhelm human players, offering little enjoyment and limited educational value. This paper addresses the problem of balanced play, in which an agent challenges its opponent without either dominating or conceding. We introduce Minibal (Minimize & Balance), a variant of Minimax specifically designed for balanced play. Building on this concept, we propose several modifications of the Unbounded Minimax algorithm explicitly aimed at discovering balanced strategies. Experiments conducted across seven board games demonstrate that one variant consistently achieves the most balanced play, with average outcomes close to perfect balance. These results establish Minibal as a promising foundation for designing AI agents that are both challenging and engaging, suitable for both entertainment and serious games.

关键词: balanced play, game AI, Minibal, Minimax, board games, human-AI interaction, Unbounded Minimax, balanced strategies

作者: Amith Nagarajan, Thomas Altman 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23050v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文DBAutoDoc的核心是使用大型语言模型（LLMs）通过迭代精炼来自动发现和记录未记录的数据库模式，因此与’Large Language Models’高度相关（10分）。系统采用迭代精炼过程，涉及自我修正机制以改进描述，与’Self-Correction’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、CoT、Agents、Quantization、AI for Science等均未在摘要中提及或与论文主题无关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出DBAutoDoc系统，通过结合统计数据分析与迭代大型语言模型精炼，自动发现和记录未记录的关系数据库模式，在基准测试中实现了96.1%的加权得分。

摘要翻译

大量关键数据库系统缺乏充分的文档记录：既未声明主键，又因性能考量舍弃了外键约束，列名使用难以理解的缩写，且不存在实体关系图。本文提出DBAutoDoc系统，该系统通过结合统计数据分析与迭代式大语言模型（LLM）精炼，实现了对无文档记录的关系型数据库模式的自动化发现与文档生成。
DBAutoDoc的核心洞见在于：模式理解本质上是一个迭代的、图结构的问题。借鉴神经网络中反向传播的结构思想，DBAutoDoc在多个精炼迭代中，通过模式依赖图传播语义修正，直至描述结果收敛。这种传播是离散且语义化的，而非数学计算，但其结构类比是精确的：早期迭代产生类似于随机初始化的粗略描述，随着上下文信息在图中流动，后续迭代逐步锐化全局图景。
本文详细阐述了该系统四项具体贡献。在一系列基准数据库测试中，DBAutoDoc采用复合评估指标，在两个模型家族（Google的Gemini与Anthropic的Claude）上取得了96.1%的整体加权得分。消融分析表明，其确定性流程相较于纯LLM外键检测带来了23个百分点的F1值提升，证实了该系统贡献显著且独立于LLM的预训练知识。DBAutoDoc已作为开源软件发布，包含全部评估配置与提示模板，确保完全可复现性。

摘要 (Abstract)

A tremendous number of critical database systems lack adequate documentation. Declared primary keys are absent, foreign key constraints have been dropped for performance, column names are cryptic abbreviations, and no entity-relationship diagrams exist. We present DBAutoDoc, a system that automates the discovery and documentation of undocumented relational database schemas by combining statistical data analysis with iterative large language model (LLM) refinement. DBAutoDoc’s central insight is that schema understanding is fundamentally an iterative, graph-structured problem. Drawing structural inspiration from backpropagation in neural networks, DBAutoDoc propagates semantic corrections through schema dependency graphs across multiple refinement iterations until descriptions converge. This propagation is discrete and semantic rather than mathematical, but the structural analogy is precise: early iterations produce rough descriptions akin to random initialization, and successive passes sharpen the global picture as context flows through the graph. The system makes four concrete contributions detailed in the paper. On a suite of benchmark databases, DBAutoDoc achieved overall weighted scores of 96.1% across two model families (Google’s Gemini and Anthropic’s Claude) using a composite metric. Ablation analysis demonstrates that the deterministic pipeline contributes a 23-point F1 improvement over LLM-only FK detection, confirming that the system’s contribution is substantial and independent of LLM pre-training knowledge. DBAutoDoc is released as open-source software with all evaluation configurations and prompt templates included for full reproducibility.

关键词: database schema documentation, large language models, iterative refinement, statistical analysis, schema dependency graphs, automated discovery, relational databases, LLM refinement

68. ❌ MSR-HuBERT: Self-supervised Pre-training for Adaptation to Multiple Sampling Rates

作者: Zikang Huang, Meng Ge, Tianrui Wang, Xuanchen Li, Xiaobao Wang, Longbiao Wang, Jianwu Dang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23048v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于语音处理领域的自监督学习，提出了一种多采样率自适应预训练方法MSR-HuBERT。核心相关关键词为’Pre-training OR Continual Pre-training OR Domain Adaptation’（10分），因为论文的核心贡献是预训练方法的创新。‘Post-training OR Supervised Fine-tuning OR SFT’得5分，因为论文提到了微调（fine-tuning）作为应用环节。其他关键词均与语音处理、大模型技术原理或科学AI应用无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有语音自监督学习方法在处理混合采样率数据时存在的时间分辨率不匹配问题，提出了一种多采样率自适应预训练方法MSR-HuBERT，该方法在16至48 kHz范围内优于原始HuBERT模型，在语音识别和全频带语音重建任务上取得了更好的性能。

摘要翻译

自监督学习（SSL）推动了语音处理领域的发展。然而，现有的语音SSL方法通常假设单一采样率，并因时间分辨率不匹配而难以处理混合采样率数据。为克服这一局限，我们提出了MSRHuBERT，一种多采样率自适应预训练方法。该方法基于HuBERT框架，将其单采样率下采样卷积神经网络（CNN）替换为多采样率自适应下采样CNN，该网络可将不同采样率的原始波形映射至共享的时间分辨率，而无需重新采样。这一设计实现了统一的混合采样率预训练与微调。在16至48千赫兹（kHz）的实验范围内，MSRHuBERT在语音识别和全频带语音重建任务上均优于HuBERT，在建模低频语义结构的同时保留了高频细节。此外，MSRHuBERT保留了HuBERT的掩码预测目标和Transformer编码器结构，因此针对HuBERT开发的现有分析与改进方法可直接应用于本模型。

摘要 (Abstract)

Self-supervised learning (SSL) has advanced speech processing. However, existing speech SSL methods typically assume a single sampling rate and struggle with mixed-rate data due to temporal resolution mismatch. To address this limitation, we propose MSRHuBERT, a multi-sampling-rate adaptive pre-training method. Building on HuBERT, we replace its single-rate downsampling CNN with a multi-sampling-rate adaptive downsampling CNN that maps raw waveforms from different sampling rates to a shared temporal resolution without resampling. This design enables unified mixed-rate pre-training and fine-tuning. In experiments spanning 16 to 48 kHz, MSRHuBERT outperforms HuBERT on speech recognition and full-band speech reconstruction, preserving high-frequency detail while modeling low-frequency semantic structure. Moreover, MSRHuBERT retains HuBERT’s mask-prediction objective and Transformer encoder, so existing analyses and improvements that were developed for HuBERT can apply directly.

关键词: self-supervised learning, speech processing, pre-training, multi-sampling-rate, HuBERT, speech recognition, adaptive downsampling, fine-tuning

69. ❌ Parametric Knowledge and Retrieval Behavior in RAG Fine-Tuning for Electronic Design Automation

作者: Julian Oestreich, Maximilian Bley, Frank Binder, Lydia Müller, Maksym Sydorenko, André Alcalde 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23047v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	15.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究RAG fine-tuning在电子设计自动化(EDA)领域的应用，与’Retrieval-Augmented Generation (RAG)‘高度相关(15分)。研究使用7B模型进行fine-tuning，涉及’Small Language Models (SLMs)‘和’Post-training/Supervised Fine-tuning (SFT)’，分别给8分和10分。论文属于’AI for Science’在工程领域的应用，给10分。研究关注事实性评估和知识内部化，与’Hallucination Mitigation/Factuality’相关，给8分。其他关键词如MoE、Scaling Laws、RLHF等与论文内容无关，给0分。

!!! tip deepseek-chat TL;DR

该论文研究在电子设计自动化领域对7B模型进行RAG fine-tuning，开发了TriFEX评估框架和PKP指标，发现传统评估指标无法检测事实差异，并证明小模型经过适当调优可在专业任务中超越大模型基线。

摘要翻译

检索增强生成（RAG）的微调相较于原始RAG已展现出显著改进，但多数研究聚焦于文档问答任务，且常依赖可能掩盖事实性差异的标准自然语言处理指标。本文针对电子设计自动化领域的长文本生成任务评估RAG微调效果，在五种不同检索条件的上下文增强策略下对7B参数模型进行适配。我们提出了TriFEX——一种经人工验证、基于三元组的评估流程，该流程将生成主张溯源至其来源（用户查询、上下文和参考文本），并提出了参数化知识精确度（Parametric Knowledge Precision, PKP），通过过滤提示中泄露的主张来分离内化知识。实验表明，ROUGE和BERTScore未能检测出基于三元组的评估所揭示的事实性差异。此外，我们证明现有知识内化指标对检索条件敏感，其跨条件差异中约75%由内化知识表达率（PR）的变化驱动，而非实际正确性（PKP）的变化。微调后的7B模型变体在多数指标上优于72B基线模型，并进一步展现出跨条件及在相关基准测试中的泛化能力。这些结果凸显了现有RAG评估指标的局限性，同时表明较小模型能够较好地适配专业任务，实现成本高效、本地化部署。

摘要 (Abstract)

Retrieval-Augmented Generation (RAG) fine-tuning has shown substantial improvements over vanilla RAG, yet most studies target document question answering and often rely on standard NLP metrics that can obscure factual differences. We evaluate RAG fine-tuning for long-form text generation in electronic design automation, adapting a 7B model under five context augmentation strategies with varying retrieval conditions. We introduce TriFEX, a human-validated, triple-based evaluation pipeline that attributes generated claims to their origin-user query, context and reference-and propose Parametric Knowledge Precision (PKP), which isolates internalized knowledge by filtering out claims leaked in the prompt. We show that ROUGE and BERTScore fail to detect factual differences that our triple-based evaluation reveals. Additionally, we demonstrate that an existing metric for knowledge internalization is retrieva-sensitive, with about 75% of its cross-condition variance driven by changes in the rate at which internal knowledge is expressed (PR), rather than by changes in its actual correctness (PKP). The fine-tuned 7B variants outperform a 72B baseline on most metrics, further showing generalization across conditions and on a related benchmark. These results underscore the limitations of available metrics in RAG evaluation and show that smaller models could be reasonably well adapted to specialized tasks for cost-efficient, on-premises deployment.

关键词: Retrieval-Augmented Generation, RAG fine-tuning, electronic design automation, 7B model, TriFEX evaluation, Parametric Knowledge Precision, factual evaluation, small language models

70. ❌ HUydra: Full-Range Lung CT Synthesis via Multiple HU Interval Generative Modelling

作者: António Cardoso, Pedro Sousa, Tania Pereira, Hélder P. Oliveira 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23041v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于医学影像生成，特别是肺部CT合成，属于AI在生物医学领域的应用。所有关键词中，只有"AI for Science OR Bioinformatics OR Cheminformatics"直接相关，因为论文涉及AI在生物医学（医学影像）中的应用，属于AI for Science范畴。其他关键词均与大模型技术原理、训练方法、推理优化、代理系统等无关，论文未涉及任何大模型、深度学习技术原理创新或相关技术讨论。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于多HU区间生成建模的全范围肺部CT合成新方法，通过分解策略和重建网络显著提升了合成图像质量并降低了计算成本。

摘要翻译

当前，在医学影像领域部署和验证计算机辅助诊断（CAD）模型的一个核心挑战与瓶颈是数据稀缺性。对于肺癌这一全球最常见的癌症类型之一，有限的数据集可能延误诊断并影响患者预后。生成式人工智能为此问题提供了有前景的解决方案，但处理全亨氏单位（HU）范围肺部CT扫描的复杂分布具有挑战性，且仍是一项计算需求极高的任务。本文提出了一种新颖的分解策略，该策略一次仅合成一个HU区间的CT图像，而非一次性建模整个HU域。该框架专注于在针对特定组织的单个HU窗口上训练生成架构，随后通过一个学习的重建网络将其输出合并为全范围扫描，该网络能有效逆转HU窗位调整过程。我们进一步提出了多头和多解码器模型，以在保持解剖结构一致性的同时更好地捕捉纹理特征，其中多头VQVAE在生成任务中取得了最佳性能。定量评估表明，该方法显著优于传统的二维全范围基线模型，在所有HU区间内实现了FID指标6.2%的提升，并在MMD、精确度和召回率上表现更优。最佳性能由多头VQVAE变体实现，证明在降低模型复杂度和计算成本的同时，提升视觉保真度与多样性是可行的。此项工作为结构感知的医学图像合成建立了新范式，使生成建模与临床解读相统一。

摘要 (Abstract)

Currently, a central challenge and bottleneck in the deployment and validation of computer-aided diagnosis (CAD) models within the field of medical imaging is data scarcity. For lung cancer, one of the most prevalent types worldwide, limited datasets can delay diagnosis and have an impact on patient outcome. Generative AI offers a promising solution for this issue, but dealing with the complex distribution of full Hounsfield Unit (HU) range lung CT scans is challenging and remains as a highly computationally demanding task. This paper introduces a novel decomposition strategy that synthesizes CT images one HU interval at a time, rather than modelling the entire HU domain at once. This framework focuses on training generative architectures on individual tissue-focused HU windows, then merges their output into a full-range scan via a learned reconstruction network that effectively reverses the HU-windowing process. We further propose multi-head and multi-decoder models to better capture textures while preserving anatomical consistency, with a multi-head VQVAE achieving the best performance for the generative task. Quantitative evaluation shows this approach significantly outperforms conventional 2D full-range baselines, achieving a 6.2% improvement in FID and superior MMD, Precision, and Recall across all HU intervals. The best performance is achieved by a multi-head VQVAE variant, demonstrating that it is possible to enhance visual fidelity and variability while also reducing model complexity and computational cost. This work establishes a new paradigm for structure-aware medical image synthesis, aligning generative modelling with clinical interpretation.

关键词: lung CT synthesis, generative AI, Hounsfield Unit intervals, multi-head VQVAE, medical image synthesis, computational cost reduction, structure-aware synthesis, clinical interpretation

71. ❌ Assessing the Robustness of Climate Foundation Models under No-Analog Distribution Shifts

作者: Maria Conchita Agana Navarro, Geng Li, Theo Wolf, Maria Perez-Ortiz 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23043v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文研究气候基础模型（ClimaX foundation model）在无类比分布偏移下的鲁棒性评估，属于大模型在科学领域的应用研究。论文明确提到了’foundation model’，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。同时，论文属于’AI for Science’在气候科学领域的应用，因此与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。其他关键词主要涉及大模型的技术原理（如MoE、RLHF、PEFT等）、推理方法（如CoT、Agent）或特定应用领域（如生物信息学），论文未涉及这些具体技术或领域，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文评估了气候基础模型在历史训练数据之外的未来气候情景（无类比分布偏移）下的鲁棒性，发现即使高性能的基础模型在仅使用历史数据训练时，对外部强迫变化仍表现出敏感性，揭示了准确性与稳定性之间的权衡。

摘要翻译

气候变化的加速进程引入了显著的非平稳性，这对基于机器学习的气候模拟器在训练分布之外的泛化能力构成了挑战。尽管这些模拟器为传统地球系统模型提供了计算高效的替代方案，但在“无相似态”的未来气候状态下——此处我们将其定义为外部强迫将系统驱动至超出历史训练数据经验范围的状态——其可靠性仍可能成为潜在瓶颈。评估这一可靠性的一个根本性挑战是数据污染；由于许多模型是在已包含未来情景的模拟数据上训练的，其真实的分布外性能往往被掩盖。为解决此问题，我们以三种先进架构——U-Net、ConvLSTM以及专门限定在纯历史训练区间（1850-2014）的ClimaX基础模型——为基准，系统评估了它们的分布外鲁棒性。我们采用两种互补策略进行评估：（i）对近期气候（2015-2023）进行时间外推；（ii）在不同排放路径间进行跨情景强迫迁移。在此实验框架下的分析揭示了一种精度与稳定性之间的权衡：虽然ClimaX基础模型实现了最低的绝对误差，但在分布变化下表现出更高的相对性能波动，在极端强迫情景下降水误差增幅最高达8.44%。这些发现表明，当局限于历史训练动态时，即使是大规模基础模型也会对外部强迫轨迹表现出敏感性。我们的研究结果强调，必须采用情景感知的训练方法和严格的分布外评估方案，以确保气候模拟器在变化气候下的鲁棒性。

摘要 (Abstract)

The accelerating pace of climate change introduces profound non-stationarities that challenge the ability of Machine Learning based climate emulators to generalize beyond their training distributions. While these emulators offer computationally efficient alternatives to traditional Earth System Models, their reliability remains a potential bottleneck under “no-analog” future climate states, which we define here as regimes where external forcing drives the system into conditions outside the empirical range of the historical training data. A fundamental challenge in evaluating this reliability is data contamination; because many models are trained on simulations that already encompass future scenarios, true out-of-distribution (OOD) performance is often masked. To address this, we benchmark the OOD robustness of three state-of-the-art architectures: U-Net, ConvLSTM, and the ClimaX foundation model specifically restricted to a historical-only training regime (1850-2014). We evaluate these models using two complementary strategies: (i) temporal extrapolation to the recent climate (2015-2023) and (ii) cross-scenario forcing shifts across divergent emission pathways. Our analysis within this experimental setup reveals an accuracy vs. stability trade-off: while the ClimaX foundation model achieves the lowest absolute error, it exhibits higher relative performance changes under distribution shifts, with precipitation errors increasing by up to 8.44% under extreme forcing scenarios. These findings suggest that when restricted to historical training dynamics, even high-capacity foundation models are sensitive to external forcing trajectories. Our results underscore the necessity of scenario-aware training and rigorous OOD evaluation protocols to ensure the robustness of climate emulators under a changing climate.

关键词: climate foundation models, no-analog distribution shifts, out-of-distribution robustness, climate emulators, historical training regime, accuracy vs. stability trade-off, scenario-aware training

72. ❌ YOLOv10 with Kolmogorov-Arnold networks and vision-language foundation models for interpretable object detection and trustworthy multimodal AI in computer vision perception

作者: Marios Impraimakis, Daniel Vazquez, Feiyu Zhou 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23037v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文主要研究计算机视觉中的可解释目标检测和可信AI，使用Kolmogorov-Arnold网络作为可解释的代理模型来评估YOLOv10检测的可信度，并集成BLIP视觉语言基础模型生成场景描述。与大多数大模型技术关键词（如MoE、RLHF、RAG等）无关，因为这些关键词主要针对语言模型架构、训练方法或推理技术，而本文聚焦于视觉任务。仅与’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为核心贡献是开发可解释的信任度评估框架；与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为涉及AI在科学应用（自动驾驶感知）中的可信度问题；与’Large Language Models OR LLMs OR Foundation Models’有微弱关联（5分），因为使用了BLIP作为视觉语言基础模型，但非核心内容。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对自动驾驶等计算机视觉系统中目标检测置信度不可靠的问题，提出了一种基于Kolmogorov-Arnold网络的可解释信任度评估框架，结合BLIP视觉语言模型，实现了透明且可信的多模态目标检测系统。

摘要翻译

本文研究了一种新型柯尔莫哥洛夫-阿诺德网络框架的可解释目标检测能力。该方法针对自动驾驶车辆感知及其他计算机视觉领域的一个关键局限：在视觉质量退化或场景模糊的情况下，现有系统对其置信度评分的可靠性缺乏透明度。为突破这一限制，本研究采用柯尔莫哥洛夫-阿诺德网络作为可解释的事后代理模型，利用七项几何与语义特征来建模“你只看一次”算法（Yolov10）检测结果的可信度。该网络基于可加性样条的结构实现了对各特征影响的直接可视化，从而生成平滑透明的函数映射，清晰揭示模型置信度何时具有可靠依据、何时存在不确定性。在通用对象上下文数据集（COCO）及巴斯大学校园图像上的实验表明，该框架能准确识别在模糊、遮挡或低纹理场景下的低可信度预测，为结果筛选、人工复核及下游风险缓解提供了可操作的依据。此外，本研究通过自举语言-图像预训练基础模型（BLIP）生成各场景的描述性文本，构建了一个不影响可解释层的轻量级多模态交互界面。最终系统实现了具有可信置信度评估的可解释目标检测，为自动驾驶及多模态人工智能应用提供了透明实用的感知工具。

摘要 (Abstract)

The interpretable object detection capabilities of a novel Kolmogorov-Arnold network framework are examined here. The approach refers to a key limitation in computer vision for autonomous vehicles perception, and beyond. These systems offer limited transparency regarding the reliability of their confidence scores in visually degraded or ambiguous scenes. To address this limitation, a Kolmogorov-Arnold network is employed as an interpretable post-hoc surrogate to model the trustworthiness of the You Only Look Once (Yolov10) detections using seven geometric and semantic features. The additive spline-based structure of the Kolmogorov-Arnold network enables direct visualisation of each feature’s influence. This produces smooth and transparent functional mappings that reveal when the model’s confidence is well supported and when it is unreliable. Experiments on both Common Objects in Context (COCO), and images from the University of Bath campus demonstrate that the framework accurately identifies low-trust predictions under blur, occlusion, or low texture. This provides actionable insights for filtering, review, or downstream risk mitigation. Furthermore, a bootstrapped language-image (BLIP) foundation model generates descriptive captions of each scene. This tool enables a lightweight multimodal interface without affecting the interpretability layer. The resulting system delivers interpretable object detection with trustworthy confidence estimates. It offers a powerful tool for transparent and practical perception component for autonomous and multimodal artificial intelligence applications.

关键词: interpretable object detection, Kolmogorov-Arnold network, trustworthy AI, YOLOv10, vision-language foundation models, autonomous vehicles perception, confidence estimation, multimodal AI

73. ❌ Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation

作者: ByeongCheol Lee, Hyun Seok Seong, Sangeek Hyun, Gilhan Park, WonJun Moon, Jae-Pil Heo 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23030v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的开放词汇语义分割，提出了一种改进CLIP模型推理的方法（GLA-CLIP），通过全局-局部对齐和动态归一化来解决滑动窗口推理中的语义不一致问题。论文的核心是视觉模型（CLIP）的推理优化，不涉及大语言模型（LLM）、深度学习技术原理创新（如MoE、缩放定律、训练方法、对齐、推理优化、智能体等），也不属于生物信息学等科学AI应用领域。所有评分关键词均针对大语言模型及相关技术，与该论文的视觉分割主题完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了训练无关的开放词汇语义分割中滑动窗口推理导致的语义不一致问题，提出了GLA-CLIP框架，通过全局-局部对齐和动态归一化来增强跨窗口信息交换，从而提升了分割性能。

摘要翻译

近期无需训练的开集词汇语义分割方法常采用滑动窗口推理策略，以克服CLIP模型处理高分辨率图像的局限性。然而，该方法引入了新的挑战：每个窗口被独立处理，导致窗口间出现语义不一致。为解决此问题，我们提出全局-局部对齐CLIP框架（Global-Local Aligned CLIP, GLA-CLIP），该框架促进了跨窗口的全面信息交换。GLA-CLIP不将注意力局限于单个窗口内的特征标记（tokens），而是扩展键值标记以整合来自所有窗口的上下文信息。但我们观察到窗口偏差现象：由于查询特征是通过内部窗口图像块（patches）的交互产生的，缺乏局部上下文之外的语义关联，导致外部窗口标记被关注的可能性降低。为缓解此问题，我们引入代理锚点（proxy anchor），通过聚合所有窗口中与给定查询高度相似的标记构建，为衡量内部与外部窗口图像块的相似性提供了统一的语义参考。此外，我们提出动态归一化方案，通过动态缩放和阈值化注意力图来根据目标尺度调整注意力强度，以应对小目标场景。GLA-CLIP可集成于现有方法中，拓宽其感受野。大量实验验证了GLA-CLIP在提升无需训练的开集词汇语义分割性能方面的有效性。代码发布于https://github.com/2btlFe/GLA-CLIP。

摘要 (Abstract)

A sliding-window inference strategy is commonly adopted in recent training-free open-vocabulary semantic segmentation methods to overcome limitation of the CLIP in processing high-resolution images. However, this approach introduces a new challenge: each window is processed independently, leading to semantic discrepancy across windows. To address this issue, we propose Global-Local Aligned CLIP~(GLA-CLIP), a framework that facilitates comprehensive information exchange across windows. Rather than limiting attention to tokens within individual windows, GLA-CLIP extends key-value tokens to incorporate contextual cues from all windows. Nevertheless, we observe a window bias: outer-window tokens are less likely to be attended, since query features are produced through interactions within the inner window patches, thereby lacking semantic grounding beyond their local context. To mitigate this, we introduce a proxy anchor, constructed by aggregating tokens highly similar to the given query from all windows, which provides a unified semantic reference for measuring similarity across both inner- and outer-window patches. Furthermore, we propose a dynamic normalization scheme that adjusts attention strength according to object scale by dynamically scaling and thresholding the attention map to cope with small-object scenarios. Moreover, GLA-CLIP can be equipped on existing methods and broad their receptive field. Extensive experiments validate the effectiveness of GLA-CLIP in enhancing training-free open-vocabulary semantic segmentation performance. Code is available at https://github.com/2btlFe/GLA-CLIP.

关键词: Open-vocabulary semantic segmentation, CLIP, Sliding-window inference, Global-local alignment, Training-free methods, Attention mechanism, Dynamic normalization, Computer vision

74. ❌ Concept-based explanations of Segmentation and Detection models in Natural Disaster Management

作者: Samar Heydari, Jawher Said, Galip Ümit Yolcu, Evgenii Kortukov, Elena Golimblevskaia, Evgenios Vlachos, Vasileios Mygdalis, Ioannis Pitas, Sebastian Lapuschkin, Leila Arras 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23020v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于深度学习模型在自然灾害管理中的应用（洪水、野火分割和汽车检测），并提出了一个可解释性框架，包括扩展LRP和原型概念解释。与大多数关键词无关，因为它们主要涉及大语言模型（LLM）技术、训练方法、推理优化等，而本文研究的是计算机视觉任务（分割和检测）的可解释性。仅与’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为核心贡献是可解释AI方法；与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（8分），因为应用领域是自然灾害管理，属于AI for Science的范畴，但非生物信息学或化学信息学。

!!! tip deepseek-chat TL;DR

该论文针对自然灾害管理中深度学习分割和检测模型缺乏透明度的问题，提出了一个可解释性框架，通过扩展LRP和原型概念解释来提供可靠且可解释的预测分析，同时保持近实时推理能力，适用于无人机等资源受限平台。

摘要翻译

用于洪水和野火分割与目标检测的深度学习模型在嵌入式无人机平台上部署时，能够实现精确、实时的灾害定位。然而，在自然灾害管理中，其决策过程缺乏透明度，阻碍了应急响应所需的人类信任。为解决这一问题，我们提出了一种可解释性框架，用于理解在广泛使用的PIDNet和YOLO架构上进行的洪水分割和车辆检测预测。具体而言，我们引入了一种新颖的再分配策略，该策略扩展了逐层相关性传播（Layer-wise Relevance Propagation, LRP）解释方法，使其适用于Sigmoid门控的逐元素融合层。这一扩展使得LRP相关性能够流经PIDNet的融合模块，覆盖整个计算图直至输入图像。此外，我们应用基于原型概念的解释（Prototypical Concept-based Explanations, PCX），在概念层面提供局部和全局解释，揭示哪些学习到的特征驱动了特定灾害语义类别的分割与检测。在公开可用的洪水数据集上的实验表明，我们的框架提供了可靠且可解释的解释，同时保持了近实时的推理能力，使其适合部署在资源受限的平台（如无人机，Unmanned Aerial Vehicles, UAVs）上。

摘要 (Abstract)

Deep learning models for flood and wildfire segmentation and object detection enable precise, real-time disaster localization when deployed on embedded drone platforms. However, in natural disaster management, the lack of transparency in their decision-making process hinders human trust required for emergency response. To address this, we present an explainability framework for understanding flood segmentation and car detection predictions on the widely used PIDNet and YOLO architectures. More specifically, we introduce a novel redistribution strategy that extends Layer-wise Relevance Propagation (LRP) explanations for sigmoid-gated element-wise fusion layers. This extension allows LRP relevances to flow through the fusion modules of PIDNet, covering the entire computation graph back to the input image. Furthermore, we apply Prototypical Concept-based Explanations (PCX) to provide both local and global explanations at the concept level, revealing which learned features drive the segmentation and detection of specific disaster semantic classes. Experiments on a publicly available flood dataset show that our framework provides reliable and interpretable explanations while maintaining near real-time inference capabilities, rendering it suitable for deployment on resource-constrained platforms, such as Unmanned Aerial Vehicles (UAVs).

关键词: Explainable AI, Natural Disaster Management, Segmentation, Object Detection, Layer-wise Relevance Propagation, Prototypical Concept Explanations, Real-time Inference, UAV Deployment

75. ❌ A Sobering Look at Tabular Data Generation via Probabilistic Circuits

作者: Davide Scassola, Dylan Ponsford, Adrián Javaloy, Sebastiano Saccani, Luca Bortolussi, Henry Gouk, Antonio Vergari 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23016v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于表格数据生成，研究内容为：1）批判当前表格数据生成评估协议的局限性；2）提出使用深度概率电路（PCs）作为简单基线模型；3）通过实证分析表明当前SotA模型的进展饱和是由于不充分的评估指标。论文未涉及任何大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用，所有关键词均与论文主题无关。

!!! tip deepseek-chat TL;DR

该论文质疑表格数据生成的进展认知，指出当前评估协议存在局限性，并证明深度概率电路（PCs）作为简单基线模型能以更低成本达到或超越SotA模型的性能，同时强调生成真实表格数据仍有很大改进空间。

摘要翻译

表格数据的生成比文本和图像更具挑战性，这源于其特征的异构性以及远低于其他模态的样本量。在此任务中，基于扩散的模型是当前的最先进模型类别，在常用基准测试中实现了近乎完美的性能。本文对表格数据生成领域的进展认知提出了质疑。首先，我们指出了当前评估生成数据保真度的协议存在的局限性，并倡导采用替代方案。接着，我们重新审视了一个简单的基线模型——采用深度概率电路形式的分层混合模型——该模型以极低的成本实现了与最先进模型相当或更优的性能。概率电路是决策森林的生成式对应模型，因此能原生处理异构数据，并提供可处理的概率生成与推断能力。最后，通过严格的实证分析，我们表明最先进模型所表现出的进展饱和现象很大程度上源于使用了不充分的评估指标。因此，我们强调要生成逼真的表格数据，仍有大量工作亟待完成。代码发布于 https://github.com/april-tools/tabpc。

摘要 (Abstract)

Tabular data is more challenging to generate than text and images, due to its heterogeneous features and much lower sample sizes. On this task, diffusion-based models are the current state-of-the-art (SotA) model class, achieving almost perfect performance on commonly used benchmarks. In this paper, we question the perception of progress for tabular data generation. First, we highlight the limitations of current protocols to evaluate the fidelity of generated data, and advocate for alternative ones. Next, we revisit a simple baseline – hierarchical mixture models in the form of deep probabilistic circuits (PCs) – which delivers competitive or superior performance to SotA models for a fraction of the cost. PCs are the generative counterpart of decision forests, and as such can natively handle heterogeneous data as well as deliver tractable probabilistic generation and inference. Finally, in a rigorous empirical analysis we show that the apparent saturation of progress for SotA models is largely due to the use of inadequate metrics. As such, we highlight that there is still much to be done to generate realistic tabular data. Code available at https://github.com/april-tools/tabpc.

关键词: tabular data generation, probabilistic circuits, evaluation protocols, diffusion models, hierarchical mixture models, fidelity metrics, generative models, benchmark limitations

76. ❌ Can Large Language Models Reason and Optimize Under Constraints?

作者: Fabien Bernier, Salah Ghamizi, Pantelis Dogoulis, Maxime Cordy 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23004v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLMs在约束优化问题（Optimal Power Flow）中的推理能力，与’Large Language Models’高度相关（10分），涉及’Chain of Thought’和’System 2 Thinking’（各8分），因为评估了LLMs的多步推理和深度推理能力。与’AI for Science’有一定关联（5分），因应用于电网优化这一科学领域。其他关键词（如MoE、SFT、RAG等）未在摘要中提及，评0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在电力系统最优潮流约束优化问题中的推理和优化能力，发现当前最先进的LLMs在大多数任务中失败，揭示了LLMs在结构化约束推理方面的关键缺陷。

摘要翻译

大型语言模型（LLMs）已在多种自然语言任务中展现出卓越能力；然而，其在约束条件下解决抽象与优化问题的能力仍鲜有研究。本文探讨了LLMs能否在最优潮流（Optimal Power Flow, OPF）问题的物理与运行约束下进行推理和优化。我们引入了一个具有挑战性的评估框架，该框架要求模型具备推理、结构化输入处理、算术运算及约束优化等一系列基本技能。评估结果表明，当前最先进的LLMs在多数任务中均告失败，而具备推理能力的LLMs在最复杂的情境下依然无法成功。我们的发现揭示了LLMs在约束条件下进行结构化推理能力方面存在关键不足，本研究为开发能够应对现实世界电网优化问题的更强LLM助手提供了一个严谨的测试环境。

摘要 (Abstract)

Large Language Models (LLMs) have demonstrated great capabilities across diverse natural language tasks; yet their ability to solve abstraction and optimization problems with constraints remains scarcely explored. In this paper, we investigate whether LLMs can reason and optimize under the physical and operational constraints of Optimal Power Flow (OPF) problem. We introduce a challenging evaluation setup that requires a set of fundamental skills such as reasoning, structured input handling, arithmetic, and constrained optimization. Our evaluation reveals that SoTA LLMs fail in most of the tasks, and that reasoning LLMs still fail in the most complex settings. Our findings highlight critical gaps in LLMs’ ability to handle structured reasoning under constraints, and this work provides a rigorous testing environment for developing more capable LLM assistants that can tackle real-world power grid optimization problems.

关键词: Large Language Models, reasoning, optimization, constraints, Optimal Power Flow, evaluation, structured input, power grid

77. ❌ AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI Agents

作者: Yutao Luo, Haotian Zhu, Shuchao Pang, Zhigang Lu, Tian Dong, Yongbin Zhou, Minhui Xue 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23007v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究移动GUI代理（mobile GUI agents）的后门攻击，与大多数大模型技术关键词无关。唯一高度相关的是’LLM Agents OR Autonomous Agents OR Agentic Workflow’（10分），因为论文核心就是攻击移动GUI代理。‘Post-training OR Supervised Fine-tuning OR SFT’得5分，因为论文使用了’backdoor post-training’方法，但这不是主要技术焦点。其他关键词如大模型、MoE、量化、推理加速等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文提出了一种名为AgentRAE的新型后门攻击，能够利用视觉上自然的触发器（如通知中的良性应用图标）诱导移动GUI代理执行远程操作，攻击成功率超过90%且能规避现有防御。

摘要翻译

移动图形用户界面（GUI）代理的快速普及——这类代理能够自主控制应用程序和操作系统（OS）——暴露出新的系统级攻击面。针对网络GUI代理和通用生成式人工智能（GenAI）模型的现有后门攻击依赖于环境注入或欺骗性弹窗来误导代理操作。然而，由于受限制的触发器设计空间、操作系统背景干扰以及多触发器-动作映射冲突等挑战，这些技术无法对基于屏幕截图的移动GUI代理生效。我们提出AgentRAE，一种新颖的后门攻击方法，能够利用视觉自然的触发器（例如通知中的良性应用图标）诱导移动GUI代理执行远程操作。为解决自然触发器导致的欠拟合问题并实现精确的多目标动作重定向，我们设计了一种新颖的两阶段流程：首先通过对比学习增强代理对细微图标差异的敏感性，然后通过后门微调将每个触发器与特定的移动GUI代理动作关联起来。我们的大量评估表明，所提出的后门在保持正常性能的同时，在十项移动操作中攻击成功率超过90%。此外，其看似良性的触发器难以被肉眼察觉，并能规避八种代表性的先进防御机制。这些结果揭示了移动GUI代理中一个被忽视的后门攻击向量，强调需要针对通知条件行为及代理内部表征进行严格审查的防御措施。

摘要 (Abstract)

The rapid adoption of mobile graphical user interface (GUI) agents, which autonomously control applications and operating systems (OS), exposes new system-level attack surfaces. Existing backdoors against web GUI agents and general GenAI models rely on environmental injection or deceptive pop-ups to mislead the agent operation. However, these techniques do not work on screenshots-based mobile GUI agents due to the challenges of restricted trigger design spaces, OS background interference, and conflicts in multiple trigger-action mappings. We propose AgentRAE, a novel backdoor attack capable of inducing Remote Action Execution in mobile GUI agents using visually natural triggers (e.g., benign app icons in notifications). To address the underfitting caused by natural triggers and achieve accurate multi-target action redirection, we design a novel two-stage pipeline that first enhances the agent’s sensitivity to subtle iconographic differences via contrastive learning, and then associates each trigger with a specific mobile GUI agent action through a backdoor post-training. Our extensive evaluation reveals that the proposed backdoor preserves clean performance with an attack success rate of over 90% across ten mobile operations. Furthermore, it is hard to visibly detect the benign-looking triggers and circumvents eight representative state-of-the-art defenses. These results expose an overlooked backdoor vector in mobile GUI agents, underscoring the need for defenses that scrutinize notification-conditioned behaviors and internal agent representations.

关键词: mobile GUI agents, backdoor attack, remote action execution, notification-based triggers, contrastive learning, post-training, attack success rate, defense evasion

78. ❌ On the use of Aggregation Operators to improve Human Identification using Dental Records

作者: Antonio D. Villegas-Yeguas, Guillermo R-García, Tzipi Kahana, Jorge Pinares Toledo, Esi Sharon, Oscar Ibañez, Oscar Cordón 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23003v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究牙科记录自动比较的聚合方法，属于AI在法医学/生物信息学领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。论文强调方法的可解释性，与’Mechanistic Interpretability OR Explainable AI’相关（5分）。但论文未涉及大模型、深度学习技术原理、LLM相关方法（如MoE、微调、推理优化等）、智能体或世界模型等主题，其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文针对法医牙科记录比较问题，提出了多种可解释的聚合方法（包括数据驱动排序、模糊逻辑和机器学习），在215个案例上验证显示白盒机器学习方法能显著提升排名性能（平均排名从3.91提升至2.02-2.21）同时保持可解释性。

摘要翻译

牙科记录比对是法医牙科学中一项标准化技术，用于在多重比对场景中加速个体身份识别。具体而言，牙科图示比对是通过计算多项标准以进行排序的流程。当前主流的自动化方法要么采用简单技术，未能充分利用比对所获信息的全部潜力，要么因其缺乏同行评审文献而无法了解其内部运行机制。本研究旨在设计可被专家理解与验证的聚合机制，以实现牙科记录对的自动比对，从而改进现有方法。为此，我们基于七项不同标准，采用前沿的编码体系，引入了多种聚合方法。具体而言，我们研究了以下三类聚合机制的性能：i) 基于数据驱动词典序的聚合方法，ii) 成熟的模糊逻辑聚合方法，以及 iii) 作为聚合机制的机器学习技术。为验证所提方案，我们使用了来自两个不同人群的215个法医案例。结果表明，采用白盒机器学习技术作为聚合模型（平均排序介于2.02至2.21之间）能够在保持方法可解释性与可理解性的前提下，显著优于现有技术（平均排序为3.91）。

摘要 (Abstract)

The comparison of dental records is a standardized technique in forensic dentistry used to speed up the identification of individuals in multiple-comparison scenarios. Specifically, the odontogram comparison is a procedure to compute criteria that will be used to perform a ranking. State-of-the-art automatic methods either make use of simple techniques, without utilizing the full potential of the information obtained from a comparison, or their internal behavior is not known due to the lack of peer-reviewed publications. This work aims to design aggregation mechanisms to automatically compare pairs of dental records that can be understood and validated by experts, improving the current methods. To do so, we introduce different aggregation approaches using the state-of-the-art codification, based on seven different criteria. In particular, we study the performance of i) data-driven lexicographical order-based aggregations, ii) well-known fuzzy logic aggregation methods and iii) machine learning techniques as aggregation mechanisms. To validate our proposals, 215 forensic cases from two different populations have been used. The results obtained show how the use of white-box machine learning techniques as aggregation models (average ranking from 2.02 to 2.21) are able to improve the state-of-the-art (average ranking of 3.91) without compromising the explainability and interpretability of the method.

关键词: dental records, forensic dentistry, aggregation operators, white-box machine learning, explainability, odontogram comparison, human identification, ranking improvement

79. ❌ JFTA-Bench: Evaluate LLM’s Ability of Tracking and Analyzing Malfunctions Using Fault Trees

作者: Yuhui Wang, Zhixiong Yang, Ming Zhang, Shihan Dou, Zhiheng Xi, Enyu Zhou, Senjie Jin, Yujiong Shen, Dingwei Zhu, Yi Dong, Tao Gui, Qi Zhang, Xuanjing Huang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22978v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是构建一个评估LLM在故障树分析中跟踪和分析故障能力的基准（JFTA-Bench），涉及将故障树图像转换为文本表示以便LLM处理，并评估LLM在多轮对话中的任务跟踪和错误恢复能力。因此，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为LLM是论文的核心技术。与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为故障树分析可视为复杂系统维护中的科学应用，但论文未明确涉及生物信息学或化学信息学。其他关键词如MoE、SFT、RAG、CoT等均未在摘要中提及，完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为JFTA-Bench的基准，用于评估大型语言模型在复杂系统维护中通过故障树进行故障定位和分析的能力，通过将故障树图像转换为文本表示并构建多轮对话数据集，测试模型的任务跟踪和错误恢复性能，其中Gemini 2.5 pro表现最佳。

摘要翻译

在复杂系统的维护中，故障树被用于定位问题并提供针对性解决方案。为使以图像形式存储的故障树能够被大语言模型直接处理，从而辅助故障追踪与分析，我们提出了一种新颖的故障树文本表示方法。在此基础上，我们构建了一个强调复杂环境下鲁棒交互的多轮对话系统基准，用于评估模型在辅助故障定位方面的能力。该基准包含$3130$个条目，平均每个条目包含$40.75$轮对话。我们训练了一个端到端模型来生成模糊信息以模拟用户行为，并引入了长程回退与恢复流程来模拟用户错误场景，从而能够评估模型在任务追踪与错误恢复方面的综合能力。实验表明，Gemini 2.5 pro模型取得了最佳性能。

摘要 (Abstract)

In the maintenance of complex systems, fault trees are used to locate problems and provide targeted solutions. To enable fault trees stored as images to be directly processed by large language models, which can assist in tracking and analyzing malfunctions, we propose a novel textual representation of fault trees. Building on it, we construct a benchmark for multi-turn dialogue systems that emphasizes robust interaction in complex environments, evaluating a model’s ability to assist in malfunction localization, which contains $3130$ entries and $40.75$ turns per entry on average. We train an end-to-end model to generate vague information to reflect user behavior and introduce long-range rollback and recovery procedures to simulate user error scenarios, enabling assessment of a model’s integrated capabilities in task tracking and error recovery, and Gemini 2.5 pro archives the best performance.

关键词: fault trees, large language models, malfunction tracking, benchmark evaluation, multi-turn dialogue, task tracking, error recovery, JFTA-Bench

80. ❌ Can Graph Foundation Models Generalize Over Architecture?

作者: Benjamin Gutteridge, Michael Bronstein, Xiaowen Dong 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22984v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文主要研究图基础模型（GFMs）的架构泛化问题，属于基础模型在特定领域（图神经网络）的应用研究。与关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（8分），因为论文直接研究基础模型概念在图领域的应用。与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为图基础模型在科学计算、生物信息学等领域有潜在应用价值。其他关键词主要针对语言模型的具体技术（如MoE、RLHF、量化等）或特定能力（如思维链、工具使用），与这篇专注于图神经网络架构泛化的论文没有直接关系，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究发现现有图基础模型因使用固定架构而无法泛化到不同架构需求的任务，并提出了一种在推理时自适应发现和混合任务特定图算子的框架，实现了对异构架构任务的零样本泛化。

摘要翻译

图基础模型（Graph Foundation Models, GFMs）近期因其潜力而备受关注，这类模型基于图神经网络（Graph Neural Network, GNN）架构，能够在任意规模、特征维度和领域的图数据上实现零样本泛化。尽管现有研究已通过多样化的现实基准测试从经验上证明了这种能力，但这些任务共享一个关键且隐含的局限：它们仅适用于一个狭窄的有效GNN架构集合。具体而言，当前领域无关的GFMs依赖于固定的架构主干，隐含地假设单一的消息传递机制足以应对所有任务。本文认为，架构自适应性是真正GFMs的必要条件。我们证明现有方法对任务相关的架构属性缺乏鲁棒性，并以“范围”作为最小且可度量的维度进行案例研究，揭示这一局限如何显化。通过理论分析和受控合成实验，我们证明固定主干的GFMs在那些架构需求与训练时所见任务不同的场景下，必然存在表达能力不足的问题。为解决这一问题，我们提出一个框架，该框架在推理时通过发现并混合任务特定的线性图算子，自适应地调整有效的GNN架构，从而能够在无需重新训练的情况下，对具有异构架构需求的任务实现零样本泛化。我们在任意范围的合成任务和一系列现实基准测试上验证了所提方法，结果表明其相较于现有领域无关的GFMs，在性能和鲁棒性上均有提升。

摘要 (Abstract)

Graph foundation models (GFMs) have recently attracted interest due to the promise of graph neural network (GNN) architectures that generalize zero-shot across graphs of arbitrary scales, feature dimensions, and domains. While existing work has demonstrated this ability empirically across diverse real-world benchmarks, these tasks share a crucial hidden limitation: they admit a narrow set of effective GNN architectures. In particular, current domain-agnostic GFMs rely on fixed architectural backbones, implicitly assuming that a single message-passing regime suffices across tasks. In this paper, we argue that architecture adaptivity is a necessary requirement for true GFMs. We show that existing approaches are non-robust to task-dependent architectural attributes and, as a case study, use range as a minimal and measurable axis along which this limitation becomes explicit. With theoretical analysis and controlled synthetic experiments, we demonstrate that fixed-backbone GFMs provably under-reach on tasks whose architectural requirements differ from those seen at training time. To address this issue, we introduce a framework that adapts effective GNN architecture at inference time by discovering and mixing task-specific linear graph operators, enabling zero-shot generalization across tasks with heterogeneous architectural requirements, without retraining. We validate our approach on arbitrary-range synthetic tasks and a suite of real-world benchmarks, demonstrating improved performance and robustness over existing domain-agnostic GFMs.

关键词: Graph Foundation Models, GNN architectures, zero-shot generalization, architecture adaptivity, message-passing, task-specific operators, domain-agnostic, inference-time adaptation

81. ❌ DariMis: Harm-Aware Modeling for Dari Misinformation Detection on YouTube

作者: Jawid Ahmad Baktash, Mosa Ebrahimi, Mohammad Zarif Joya, Mursal Dawodi 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22977v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于达里语（Dari）的虚假信息检测，使用BERT模型进行文本分类，不涉及大模型、深度学习技术原理创新或科学领域的AI应用。所有关键词均与大模型技术、深度学习创新或科学AI应用相关，而本文仅使用传统BERT模型进行特定语言任务，与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文创建了首个达里语虚假信息检测数据集DariMis，并提出了一种双输入编码策略来提升BERT模型在YouTube视频标题和描述上的虚假信息检测性能。

摘要翻译

达里语作为阿富汗的主要语言，拥有数千万使用者，但在虚假信息检测研究领域却长期处于空白状态。为填补这一空白，我们推出了DariMis数据集——首个包含9,224个达里语YouTube视频的人工标注数据集，标注体系涵盖两个维度：信息类型（虚假信息、部分真实、真实信息）与危害等级（低、中、高）。核心实证发现表明这两个维度存在结构性耦合而非相互独立：55.9%的虚假信息至少具有中等危害潜力，而真实内容中仅有1.0%达到此标准。这一特性使得信息类型分类器能在内容审核流程中作为隐式危害分级过滤器发挥作用。
我们进一步提出双输入编码策略，将视频标题与描述文本作为独立的BERT分段输入进行表征，显式建模标题主张与正文内容间的语义关系——这是识别误导性信息的关键信号。通过消融实验与单字段拼接方法对比发现：尽管整体宏观F1分数仅相差0.09个百分点，双输入编码使安全关键少数类（虚假信息）的召回率提升7.0个百分点（从60.1%增至67.1%）。我们使用达里语/波斯语专用模型（ParsBERT）与XLM-RoBERTa-base进行基准测试：ParsBERT以76.60%的准确率和72.77%的宏观F1分数取得最佳测试性能。所有指标均报告了Bootstrap 95%置信区间，并对结果的实际意义与统计局限性进行了讨论。

摘要 (Abstract)

Dari, the primary language of Afghanistan, is spoken by tens of millions of people yet remains largely absent from the misinformation detection literature. We address this gap with DariMis, the first manually annotated dataset of 9,224 Dari-language YouTube videos, labeled across two dimensions: Information Type (Misinformation, Partly True, True) and Harm Level (Low, Medium, High). A central empirical finding is that these dimensions are structurally coupled, not independent: 55.9 percent of Misinformation carries at least Medium harm potential, compared with only 1.0 percent of True content. This enables Information Type classifiers to function as implicit harm-triage filters in content moderation pipelines. We further propose a pair-input encoding strategy that represents the video title and description as separate BERT segment inputs, explicitly modeling the semantic relationship between headline claims and body content, a key signal of misleading information. An ablation study against single-field concatenation shows that pair-input encoding yields a 7.0 percentage point gain in Misinformation recall (60.1 percent to 67.1 percent), the safety-critical minority class, despite modest overall macro F1 differences (0.09 percentage points). We benchmark a Dari/Farsi-specialized model (ParsBERT) against XLM-RoBERTa-base; ParsBERT achieves the best test performance with accuracy of 76.60 percent and macro F1 of 72.77 percent. Bootstrap 95 percent confidence intervals are reported for all metrics, and we discuss both the practical significance and statistical limitations of the results.

关键词: Dari misinformation detection, YouTube videos, BERT model, pair-input encoding, harm level classification, ParsBERT, multilingual NLP, content moderation

82. ❌ Where Experts Disagree, Models Fail: Detecting Implicit Legal Citations in French Court Decisions

作者: Avrile Floro, Tamara Dhorasoo, Soline Pellez, Nils Holzenberger 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22973v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究法律文本中隐含引用的检测问题，属于AI在法律领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评分5分），因为法律分析可视为AI在社会科学领域的应用。但论文未涉及大模型、深度学习技术原理创新或任何其他关键词中的具体技术（如LLMs、MoE、SFT、RAG等），也未使用大模型进行实验或理论探讨，因此其他关键词均评0分。

!!! tip deepseek-chat TL;DR

该论文研究如何检测法国法院判决中隐含的法律引用，通过构建专家标注的基准数据集，发现专家分歧与模型失败相关，并开发了监督和无监督方法以提升检测性能。

摘要翻译

应用于法律研究的计算方法为大规模分析法律提供了可能。我们从一个简单问题出发：法院隐含适用法条规则的频率有多高？这需要区分法律推理与语义相似性。本研究聚焦于初审法院判决中对《法国民法典》的隐含引用，并引入一个由三位法律专家标注的1,015个文本段落-法条对基准数据集。我们发现专家意见分歧能够预测模型失败案例。标注者间一致性处于中等水平（$κ$ = 0.33），其中43%的分歧涉及事实描述与法律推理的界限划分。我们的监督集成模型取得了F1分数0.70（准确率77%），但该数据掩盖了不对称性：68%的误判案例集中在标注者存在分歧的33%的案件中。尽管存在这些局限，将任务重构为top-k排序并利用多模型共识机制，在无监督设定下实现了k=200时76%的精确度。此外，剩余的误判案例往往揭示了法律适用上的模糊情形，而非明显错误。

摘要 (Abstract)

Computational methods applied to legal scholarship hold the promise of analyzing law at scale. We start from a simple question: how often do courts implicitly apply statutory rules? This requires distinguishing legal reasoning from semantic similarity. We focus on implicit citation of the French Civil Code in first-instance court decisions and introduce a benchmark of 1,015 passage-article pairs annotated by three legal experts. We show that expert disagreement predicts model failures. Inter-annotator agreement is moderate ($κ$ = 0.33) with 43% of disagreements involving the boundary between factual description and legal reasoning. Our supervised ensemble achieves F1 = 0.70 (77% accuracy), but this figure conceals an asymmetry: 68% of false positives fall on the 33% of cases where the annotators disagreed. Despite these limits, reframing the task as top-k ranking and leveraging multi-model consensus yields 76% precision at k = 200 in an unsupervised setting. Moreover, the remaining false positives tend to surface legally ambiguous applications rather than obvious errors.

关键词: legal citation detection, implicit citation, French court decisions, expert disagreement, benchmark dataset, supervised ensemble, unsupervised ranking, computational legal analysis

83. ❌ Set-Valued Prediction for Large Language Models with Feasibility-Aware Coverage Guarantees

作者: Ye Li, Anqi Hu, Yuanchang Ye, Shiyan Tong, Zhiyuan Wang, Bo Fu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22966v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于大语言模型（LLMs）的集合值预测框架，直接涉及LLMs的核心技术应用。论文提出了一种从点预测转向集合预测的方法，并建立了可行性感知的覆盖保证，这属于LLMs生成和评估方法的技术创新。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理优化、代理系统、科学AI应用等均未在论文中涉及，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型提出了一种集合值预测框架，通过数据驱动的校准程序构建预测集，确保在可行风险水平下以所需概率包含正确答案，并在六个语言生成任务中验证了其统计有效性和预测效率。

摘要翻译

大型语言模型（LLM）本质上在一个广阔的生成空间中运行，然而传统使用方式通常将最可能生成结果（MLG）作为点预测进行报告，这低估了模型的能力：尽管排名最高的回答可能不正确，但在更广泛的输出空间中仍可能存在有效答案，并且有可能通过重复采样被发现。这一观察促使我们从点预测转向集合值预测，即模型生成一组候选回答而非单一的MLG。本文提出了一种原则性的集合值预测框架，该框架提供具备可行性感知的覆盖保证。我们指出，鉴于LLM生成的有限采样特性，覆盖并非总能实现：即使进行多次采样，LLM也可能无法针对某些问题在采样候选集中产生可接受的回答。为解决此问题，我们确立了一个最低可达成风险水平（MRL），低于该水平则无法满足统计覆盖保证。基于这一认识，我们进一步开发了一种数据驱动的校准方法，该方法通过估计严格阈值从采样回答中构建预测集，确保当目标风险水平可行时，所得集合能以期望概率包含正确答案。在六个语言生成任务和五种LLM上进行的大量实验验证了我们框架的统计有效性及预测效率。

摘要 (Abstract)

Large language models (LLMs) inherently operate over a large generation space, yet conventional usage typically reports the most likely generation (MLG) as a point prediction, which underestimates the model’s capability: although the top-ranked response can be incorrect, valid answers may still exist within the broader output space and can potentially be discovered through repeated sampling. This observation motivates moving from point prediction to set-valued prediction, where the model produces a set of candidate responses rather than a single MLG. In this paper, we propose a principled framework for set-valued prediction, which provides feasibility-aware coverage guarantees. We show that, given the finite-sampling nature of LLM generation, coverage is not always achievable: even with multiple samplings, LLMs may fail to yield an acceptable response for certain questions within the sampled candidate set. To address this, we establish a minimum achievable risk level (MRL), below which statistical coverage guarantees cannot be satisfied. Building on this insight, we then develop a data-driven calibration procedure that constructs prediction sets from sampled responses by estimating a rigorous threshold, ensuring that the resulting set contains a correct answer with a desired probability whenever the target risk level is feasible. Extensive experiments on six language generation tasks with five LLMs demonstrate both the statistical validity and the predictive efficiency of our framework.

关键词: Large Language Models, Set-valued Prediction, Coverage Guarantees, Feasibility-aware, Statistical Validity, Prediction Sets, Calibration Procedure, Language Generation Tasks

84. ❌ PersonalQ: Select, Quantize, and Serve Personalized Diffusion Models for Efficient Inference

作者: Qirui Wang, Qi Guo, Yiding Sun, Junkai Yang, Dongxu Zhang, Shanmin Pang, Qing Guo 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22943v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究个性化扩散模型的高效推理服务，核心贡献包括：1）使用LLM进行重排序（与LLM关键词相关）；2）采用后训练量化技术（与Post-training和Quantization关键词高度相关）；3）提出混合检索方法（与RAG相关）；4）关注推理效率（与Inference Acceleration相关）。论文未涉及MoE、SLMs、Scaling Laws、Alignment、PEFT、Context Extension、Reasoning、Agents、Science等主题。

!!! tip deepseek-chat TL;DR

论文提出了PersonalQ框架，通过意图对齐的检查点选择和触发感知量化技术，解决了个性化扩散模型服务中的意图误路由和量化失真问题，实现了高效且保真的推理服务。

摘要翻译

个性化文生图生成技术允许用户将扩散模型微调为特定概念的检查点库，但高效部署这些库存在两大挑战：自然语言请求常存在歧义，可能被误路由至视觉相似的检查点；而标准的训练后量化可能破坏编码个性化概念的脆弱表征。本文提出PersonalQ框架，通过检查点的触发令牌这一共享信号，将检查点选择与量化过程统一起来。检查点选择模块通过结合意图感知混合检索与基于大语言模型的上下文重排序，实现意图对齐的选择，仅当多个意图均可能时提出简明的澄清问题，随后通过插入选定检查点的规范触发词重写提示。与此互补，触发感知量化在交叉注意力层应用触发感知混合精度，保留触发条件相关的键/值行（及其注意力权重），同时对其余路径进行激进量化以实现内存高效推理。实验表明，PersonalQ在意图对齐方面优于检索与重排序基线方法，而触发感知量化相比现有扩散模型训练后量化方法，始终提供更优的压缩-质量权衡，使得个性化检查点的可扩展部署无需牺牲生成保真度。

摘要 (Abstract)

Personalized text-to-image generation lets users fine-tune diffusion models into repositories of concept-specific checkpoints, but serving these repositories efficiently is difficult for two reasons: natural-language requests are often ambiguous and can be misrouted to visually similar checkpoints, and standard post-training quantization can distort the fragile representations that encode personalized concepts. We present PersonalQ, a unified framework that connects checkpoint selection and quantization through a shared signal – the checkpoint’s trigger token. Check-in performs intent-aligned selection by combining intent-aware hybrid retrieval with LLM-based reranking over checkpoint context and asks a brief clarification question only when multiple intents remain plausible; it then rewrites the prompt by inserting the selected checkpoint’s canonical trigger. Complementing this, Trigger-Aware Quantization (TAQ) applies trigger-aware mixed precision in cross-attention, preserving trigger-conditioned key/value rows (and their attention weights) while aggressively quantizing the remaining pathways for memory-efficient inference. Experiments show that PersonalQ improves intent alignment over retrieval and reranking baselines, while TAQ consistently offers a stronger compression-quality trade-off than prior diffusion PTQ methods, enabling scalable serving of personalized checkpoints without sacrificing fidelity.

关键词: Personalized Diffusion Models, Efficient Inference, Checkpoint Selection, Post-training Quantization, Trigger-Aware Quantization, Intent Alignment, Memory-efficient Inference, Text-to-Image Generation

85. ❌ Optimizing Small Language Models for NL2SQL via Chain-of-Thought Fine-Tuning

作者: Anshul Solanki, Sanchit Latawa, Koushik Chakraborty, Navneet Kamboj 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22942v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究NL2SQL任务中大型和小型语言模型的微调效果对比，发现小模型通过微调和CoT增强能显著提升性能。高度相关的关键词包括：Small Language Models（核心研究对象）、Large Language Models（对比基准）、Post-training/SFT（主要方法）、Chain of Thought（关键增强技术）。Scaling Laws和System 2 Thinking有一定关联，因为论文讨论了模型规模与性能的缩放现象及推理模式。其他关键词如MoE、RLHF、RAG等未涉及。

!!! tip deepseek-chat TL;DR

该论文研究了在NL2SQL任务中微调大型和小型语言模型的效果，发现小模型通过微调和Chain-of-Thought增强能显著提升准确率，实现成本效益与性能的平衡。

摘要翻译

将自然语言转换为SQL（NL2SQL）仍然是企业数据民主化的关键瓶颈。尽管Gemini 2.5等大型语言模型（LLMs）已展现出令人印象深刻的零样本能力，但其高昂的推理成本限制了大规模部署。本文探讨了在NL2SQL任务上对大、小型语言模型进行微调的有效性。我们的研究揭示了一种反直觉的缩放现象。在标准数据集上微调大型模型（Gemini 2.5 Flash/Lite）带来的收益微乎其微，且常导致对复杂查询的过拟合。相反，小型模型（如Qwen）则表现出显著提升。微调使小型模型的基线准确率从36%提升至45%，而通过引入显式的思维链（Chain-of-Thought, CoT）推理进一步丰富数据集后，准确率跃升至54.5%（图2）。尽管这仍低于Gemini 2.5等大型模型的准确率，但它确实实现了显著降低成本、降低推理延迟的商业目标，并满足了业务关键的性能准确率阈值。本文证明，通过迁移推理模式，计算高效的小型模型能够接近生产级性能。

摘要 (Abstract)

Translating Natural Language to SQL (NL2SQL) remains a critical bottleneck for democratization of data in enterprises. Although Large Language Models (LLMs) like Gemini 2.5 and other LLMs have demonstrated impressive zero-shot capabilities, their high inference costs limit deployment at scale. This paper explores the efficacy of fine-tuning both large and small language models on NL2SQL tasks. Our research reveals a counter-intuitive scaling phenomenon. Fine-tuning large models (Gemini 2.5 Flash/Lite) on standard datasets yields negligible returns, often leading to overfitting on complex queries. Conversely, small models (Qwen) show significant gains. Fine-tuning improved the small model baseline from 36% to 45%, and further enriching the dataset with explicit Chain-of-Thought (CoT) reasoning surged accuracy to 54.5%(Fig 2). While this is still lower than the accuracy of large models like Gemini 2.5 , it does serve the business goal of significant cost reduction, latency in inference time and also meeting the business critical performance accuracy threshold.This paper demonstrates that transferring reasoning patterns enables compute-efficient smaller models to approach production-grade performance.

关键词: Small Language Models, Large Language Models, Fine-tuning, NL2SQL, Chain-of-Thought, Inference Cost, Scaling Phenomenon, Accuracy Improvement

86. ❌ Ran Score: a LLM-based Evaluation Score for Radiology Report Generation

作者: Ran Zhang, Yucong Lin, Zhaoli Su, Bowen Liu, Danni Ai, Tianyu Fu, Deqiang Xiao, Jingfan Fan, Yuanyuan Wang, Mingwei Gao, Yuwan Hu, Shuya Gao, Jingtao Li, Jian Yang, Hong Song, Hongliang Sun 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22935v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是开发基于LLM的放射学报告评估框架Ran Score，属于AI在医学影像领域的应用创新。与’Large Language Models’高度相关（10分），因为论文明确使用LLM进行放射报告的多标签提取和评估；与’AI for Science’高度相关（10分），因为这是AI在生物医学/放射学领域的应用研究。其他关键词主要涉及大模型技术原理、训练方法、推理优化等，论文未涉及这些具体技术细节，因此评分为0分。

!!! tip deepseek-chat TL;DR

该研究开发了一个结合临床医生专业知识和大型语言模型的框架，用于从自由文本胸部X光报告中提取多标签发现，并定义了Ran Score这一发现级评估指标，显著提高了放射学报告生成的评估准确性。

摘要翻译

胸部X光报告生成与自动化评估目前受限于对低发病率异常的识别能力不足，以及对否定性和模糊性等临床关键语言的处理不充分。我们开发了一个结合临床专家知识与大语言模型的医师引导框架，用于从自由文本胸部X光报告中提取多标签发现，并据此定义了Ran Score——一种针对报告评估的发现级度量指标。利用来自公共胸部X光数据集的三个非重叠MIMIC-CXR-EN队列以及独立的ChestX-CN验证队列，我们优化了提示词，建立了放射科医师衍生的参考标签，并评估了报告生成模型。优化后的框架在MIMIC-CXR-EN开发队列上将宏观平均分数从0.753提升至0.956，在直接可比标签上超过CheXbert基准15.7个百分点，并在ChestX-CN验证队列上展现出稳健的泛化能力。本研究表明：医师引导的提示词优化能提升与放射科医师参考标准的一致性，且Ran Score能够实现对报告保真度的发现级评估，尤其针对低发病率异常。

摘要 (Abstract)

Chest X-ray report generation and automated evaluation are limited by poor recognition of low-prevalence abnormalities and inadequate handling of clinically important language, including negation and ambiguity. We develop a clinician-guided framework combining human expertise and large language models for multi-label finding extraction from free-text chest X-ray reports and use it to define Ran Score, a finding-level metric for report evaluation. Using three non-overlapping MIMIC-CXR-EN cohorts from a public chest X-ray dataset and an independent ChestX-CN validation cohort, we optimize prompts, establish radiologist-derived reference labels and evaluate report generation models. The optimized framework improves the macro-averaged score from 0.753 to 0.956 on the MIMIC-CXR-EN development cohort, exceeds the CheXbert benchmark by 15.7 percentage points on directly comparable labels, and shows robust generalization on the ChestX-CN validation cohort. Here we show that clinician-guided prompt optimization improves agreement with a radiologist-derived reference standard and that Ran Score enables finding-level evaluation of report fidelity, particularly for low-prevalence abnormalities.

关键词: Large Language Models, Radiology Report Generation, Evaluation Metric, Chest X-ray, Clinician-guided Framework, Multi-label Finding Extraction, Ran Score, Medical AI

87. ❌ ProGRank: Probe-Gradient Reranking to Defend Dense-Retriever RAG from Corpus Poisoning

作者: Xiangyu Yin, Yi Qi, Chih-hong Cheng 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22934v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RAG系统的防御机制，与’Retrieval-Augmented Generation’高度相关（10分），涉及LLM应用可靠性，与’Large Language Models’相关（8分），防御目标包括提高事实性，与’Hallucination Mitigation’有一定关联（5分）。其他关键词如MoE、SLMs、训练方法、推理技术、AI for Science等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出ProGRank方法，通过后处理重排序防御密集检索器RAG系统免受语料库中毒攻击，在多个数据集和攻击场景下展现出更强的防御性能和鲁棒性-效用平衡。

摘要翻译

检索增强生成（Retrieval-Augmented Generation，RAG）通过将生成过程建立在检索证据的基础上，提升了大型语言模型应用的可靠性，但同时也引入了一个新的攻击面：语料库投毒。在此场景下，攻击者通过注入或编辑文本段落，使其在针对目标查询的Top-$K$结果中排名靠前，进而影响下游生成。现有的语料库投毒防御方法通常依赖于内容过滤、辅助模型或生成端推理，这可能增加部署难度。我们提出ProGRank，一种面向密集检索器RAG的事后、免训练的检索器端防御方法。ProGRank在温和的随机扰动下对每个查询-段落对进行压力测试，并从检索器的一个固定小参数子集中提取探针梯度。基于这些信号，它推导出两个不稳定性指标——表征一致性和分散风险，并在重排序步骤中将其与分数门控相结合。ProGRank保留了原始段落内容，无需重新训练，且在部署的检索器不可用时也支持基于代理的变体。我们在三个数据集、三种密集检索器骨干网络、代表性语料库投毒攻击以及检索阶段和端到端设置下进行了大量实验，结果表明ProGRank提供了更强的防御性能和更优的鲁棒性-效用平衡。在自适应规避攻击下，该方法仍保持竞争力。

摘要 (Abstract)

Retrieval-Augmented Generation (RAG) improves the reliability of large language model applications by grounding generation in retrieved evidence, but it also introduces a new attack surface: corpus poisoning. In this setting, an adversary injects or edits passages so that they are ranked into the Top-$K$ results for target queries and then affect downstream generation. Existing defences against corpus poisoning often rely on content filtering, auxiliary models, or generator-side reasoning, which can make deployment more difficult. We propose ProGRank, a post hoc, training-free retriever-side defence for dense-retriever RAG. ProGRank stress-tests each query–passage pair under mild randomized perturbations and extracts probe gradients from a small fixed parameter subset of the retriever. From these signals, it derives two instability signals, representational consistency and dispersion risk, and combines them with a score gate in a reranking step. ProGRank preserves the original passage content, requires no retraining, and also supports a surrogate-based variant when the deployed retriever is unavailable. Extensive experiments across three datasets, three dense retriever backbones, representative corpus poisoning attacks, and both retrieval-stage and end-to-end settings show that ProGRank provides stronger defence performance and a favorable robustness–utility trade-off. It also remains competitive under adaptive evasive attacks.

关键词: Retrieval-Augmented Generation, RAG, corpus poisoning, dense retriever, defense mechanism, reranking, robustness, query-passage pair

88. ❌ The EU AI Act and the Rights-based Approach to Technological Governance

作者: Georgios Pavlidis 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22920v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《The EU AI Act and the Rights-based Approach to Technological Governance》是一篇关于欧盟人工智能法案的法律与政策分析文章，主要探讨该法案如何将基本权利置于基于风险的治理框架中心，并分析其如何将人权中心方法制度化。论文内容完全聚焦于AI监管、法律框架、权利保护和治理机制，不涉及任何大模型或深度学习的技术原理、架构、训练方法、推理技术、优化技术或具体科学应用。所有评分关键词均与大模型/深度学习的技术创新或应用直接相关，而本文属于法律政策研究领域，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

本文分析了欧盟AI法案如何将基本权利作为法律门槛和程序触发器嵌入AI系统全生命周期的风险治理框架，并探讨了该法案作为权利保护型AI系统模型的潜力及实施挑战。

摘要翻译

《欧盟人工智能法案》是塑造欧盟数字监管架构的重要进展。该法案将基本权利置于基于风险的治理框架核心。本文探讨了《人工智能法案》如何将"以人为本"的人工智能方法制度化，并分析了法案条款如何以显性和隐性方式嵌入对《欧盟基本权利宪章》所载权利的保护。文章认为，基本权利不仅作为理想目标存在，更在人工智能系统全生命周期中发挥着法律门槛与程序触发机制的作用。分析表明，《人工智能法案》有望成为构建权利保护型人工智能系统的范本，同时也承认其在实施层面将面临诸多挑战。

摘要 (Abstract)

The EU AI Act constitutes an important development in shaping the Union’s digital regulatory architecture. The Act places fundamental rights at the heart of a risk-based governance framework. The article examines how the AI Act institutionalises a human-centric approach to AI and how the AI Act’s provisions explicitly and implicitly embed the protection of rights enshrined in the EU Charter of Fundamental Rights. It argues that fundamental rights function not merely as aspirational goals, but as legal thresholds and procedural triggers across the lifecycle of an AI system. The analysis suggests that the AI Act has the potential to serve as a model for rights-preserving AI systems, while acknowledging that challenges will emerge at the level of implementation.

关键词: EU AI Act, technological governance, fundamental rights, risk-based framework, human-centric AI, legal thresholds, rights-preserving AI systems, implementation challenges

89. ❌ EVA: Efficient Reinforcement Learning for End-to-End Video Agent

作者: Yaolun Zhang, Ruohui Wang, Jiahao Wang, Yepeng Tang, Xuanyu Zheng, Haonan Duan, Hao Lu, Hanming Deng, Lewei Lu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22918v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出EVA框架，基于多模态大语言模型（MLLMs）构建视频理解智能体，核心涉及LLMs（10分）、智能体（10分）、推理方法（CoT和System 2 Thinking各10分）、自我反思（10分）以及训练流程中的监督微调（SFT，10分）。其他关键词如MoE、量化、RAG等未在摘要中体现，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对长视频理解中token序列长、依赖复杂的问题，提出了EVA框架，通过规划-感知的迭代推理和强化学习训练，在多个基准上取得了6-12%的性能提升。

摘要翻译

由于视频包含长令牌序列、复杂的时间依赖性和大量冗余帧，基于多模态大语言模型（MLLMs）的视频理解仍面临挑战。现有方法通常将MLLMs视为被动识别器，直接处理完整视频或均匀采样的帧，缺乏自适应推理能力。近期基于智能体（agent）的方法引入了外部工具，但仍依赖人工设计的工作流程和“先感知后决策”的策略，导致处理长视频时效率低下。本文提出EVA（端到端视频智能体高效强化学习框架），通过迭代式的“摘要-规划-行动-反思”推理机制，实现“先规划后感知”的范式。EVA能够自主决定观看内容、观看时机及观看方式，达成查询驱动的高效视频理解。为训练此类智能体，我们设计了一个简洁高效的三阶段学习流程——包括监督微调（SFT）、卡尼曼-特沃斯基优化（KTO）和广义奖励策略优化（GRPO）——以衔接监督模仿学习与强化学习。我们进一步为每个阶段构建了高质量数据集，支持稳定且可复现的训练。我们在六个视频理解基准上评估EVA，验证了其综合能力。与现有基线相比，EVA在通用MLLM基线上实现了6-12%的显著提升，并在已有自适应智能体方法基础上进一步获得1-3%的性能增益。代码与模型已开源：https://github.com/wangruohui/EfficientVideoAgent。

摘要 (Abstract)

Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames. Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary-plan-action-reflection reasoning. EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding. To train such agents, we design a simple yet effective three-stage learning pipeline - comprising supervised fine-tuning (SFT), Kahneman-Tversky Optimization (KTO), and Generalized Reward Policy Optimization (GRPO) - that bridges supervised imitation and reinforcement learning. We further construct high-quality datasets for each stage, supporting stable and reproducible training. We evaluate EVA on six video understanding benchmarks, demonstrating its comprehensive capabilities. Compared with existing baselines, EVA achieves a substantial improvement of 6-12% over general MLLM baselines and a further 1-3% gain over prior adaptive agent methods. Our code and model are available at https://github.com/wangruohui/EfficientVideoAgent.

关键词: Multimodal Large Language Models, Video Understanding, Reinforcement Learning, Autonomous Agents, Supervised Fine-tuning, Chain of Thought, Self-Reflection, Efficient Video Agent

90. ❌ From the AI Act to a European AI Agency: Completing the Union’s Regulatory Architecture

作者: Georgios Pavlidis 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22912v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文讨论的是欧盟AI法规和治理架构（AI Act、European AI Office、监管机构设立），属于AI政策、治理和监管领域，而非大模型/深度学习技术原理、创新或具体应用研究。论文内容完全不涉及任何评分关键词中的技术主题（如模型架构、训练方法、推理优化、对齐、应用等），因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文探讨了在欧盟AI Act框架下，是否需要以及如何建立一个更强大的超国家AI监管机构（如European AI Agency）来增强政策协调性、风险评估能力和国际合作，以实现欧盟的数字技术主权战略目标。

摘要翻译

随着人工智能（AI）技术的持续发展，为确保人工智能的开发与部署符合伦理原则，同时保持创新活力与经济竞争力，有效的风险评估、监管与监督机制显得尤为必要。欧盟《人工智能法案》的通过标志着这一方向上的重要进展，该法案建立了统一的法规框架，其中包含对人工智能治理的详细规定，并设立了欧洲人工智能办公室。本文重新审视了是否仍有必要建立一个更强大的跨国人工智能专门机构，并探讨了此类机构如何能够提升政策协调性、增强风险评估能力并促进国际合作。文章同时指出，一个在欧盟层面得到强化的机构也将有助于实现欧盟确保数字与技术主权的战略目标。

摘要 (Abstract)

As artificial intelligence (AI) technologies continue to advance, effective risk assessment, regulation, and oversight are necessary to ensure that AI development and deployment align with ethical principles while preserving innovation and economic competitiveness. The adoption of the EU AI Act marks an important step in this direction, establishing a harmonised legal framework that includes detailed provisions on AI governance, as well as the creation of the European AI Office. This paper revisits the question of whether a more robust supranational agency dedicated to AI is still warranted and explores how such a body could enhance policy coherence, improve risk assessment capacities, and foster international cooperation. It also argues that a strengthened EU-level agency would also serve the Union’s strategic aim of securing digital and technological sovereignty.

关键词: AI regulation, EU AI Act, European AI Office, governance, risk assessment, policy coherence, international cooperation, digital sovereignty

91. ❌ Off-Policy Evaluation and Learning for Survival Outcomes under Censoring

作者: Kohsuke Kubota, Mitsuhiro Takahashi, Yuta Saito 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22900v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于生存分析中的离策略评估和学习，提出IPCW-IPS和IPCW-DR方法来处理右删失数据。论文内容属于因果推断、机器学习在医疗/商业决策中的应用范畴，但完全不涉及大模型、深度学习技术原理、AI for Science等关键词。所有关键词均与大模型技术、AI科学应用或相关创新方法直接相关，而本文研究的是传统统计机器学习方法在特定问题上的应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对右删失生存数据下的离策略评估和学习问题，提出了基于逆概率删失加权的IPCW-IPS和IPCW-DR方法，理论上证明了无偏性和双重鲁棒性，并通过仿真和真实数据验证了有效性。

摘要翻译

优化生存结局（如患者生存率或客户留存率）是数据驱动决策中的关键目标。离策略评估（Off-Policy Evaluation，OPE）提供了一个强大的框架，仅使用记录数据即可评估此类决策策略，无需在高风险应用中进行成本高昂或具有风险的在线实验。然而，传统估计器并未针对右删失生存结局设计，因其忽略了删失时间后未观测到的生存时间，导致对真实策略性能的系统性低估。为解决这一问题，我们提出了一种适用于删失条件下生存结局的离策略评估与离策略学习（Off-Policy Learning，OPL）新框架。具体而言，我们引入了IPCW-IPS与IPCW-DR方法，其采用逆概率删失加权（Inverse Probability of Censoring Weighting）技术以显式处理删失偏差。我们从理论上证明了所提估计量的无偏性，且IPCW-DR具备双重稳健性——只要倾向得分模型或结局模型之一设定正确，即可保证估计的一致性。进一步，我们将此框架扩展至约束性离策略学习，以在预算约束下优化策略价值。通过模拟研究，我们验证了所提方法的有效性，并利用公开真实世界数据在评估与学习任务中展示了其实际应用价值。

摘要 (Abstract)

Optimizing survival outcomes, such as patient survival or customer retention, is a critical objective in data-driven decision-making. Off-Policy Evaluation~(OPE) provides a powerful framework for assessing such decision-making policies using logged data alone, without the need for costly or risky online experiments in high-stakes applications. However, typical estimators are not designed to handle right-censored survival outcomes, as they ignore unobserved survival times beyond the censoring time, leading to systematic underestimation of the true policy performance. To address this issue, we propose a novel framework for OPE and Off-Policy Learning~(OPL) tailored for survival outcomes under censoring. Specifically, we introduce IPCW-IPS and IPCW-DR, which employ the Inverse Probability of Censoring Weighting technique to explicitly deal with censoring bias. We theoretically establish that our estimators are unbiased and that IPCW-DR achieves double robustness, ensuring consistency if either the propensity score or the outcome model is correct. Furthermore, we extend this framework to constrained OPL to optimize policy value under budget constraints. We demonstrate the effectiveness of our proposed methods through simulation studies and illustrate their practical impacts using public real-world data for both evaluation and learning tasks.

关键词: Off-Policy Evaluation, Survival Outcomes, Censoring, Inverse Probability of Censoring Weighting, IPCW-IPS, IPCW-DR, Double Robustness, Constrained Off-Policy Learning

92. ❌ Confidence Calibration under Ambiguous Ground Truth

作者: Linwei Tao, Haoyang Luo, Minjing Dong, Chang Xu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22879v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 这篇论文研究的是置信度校准问题，特别是在标注者存在真实分歧时的多标注者场景下的校准方法。论文的核心贡献是开发了多种考虑标注模糊性的后处理校准器（如Dirichlet-Soft、MCTS S=1、LS-TS），以优化对完整标注分布的评分规则。虽然论文涉及机器学习模型（如分类模型）的置信度校准，但其研究内容与提供的关键词列表（主要围绕大语言模型、深度学习技术原理、AI for Science等）没有直接关联。论文没有讨论LLMs、MoE、Scaling Laws、Pre-training、Alignment、RAG、Attention机制、推理方法、智能体、模型压缩、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI应用等主题。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

这篇论文解决了在标注者存在真实分歧时，传统置信度校准方法（如Temperature Scaling）会低估标注不确定性的问题，并提出了一系列无需模型重新训练的模糊感知后处理校准器，在多个基准测试中显著降低了真实标签的预期校准误差。

摘要翻译

置信度校准通常假设每个输入存在唯一真实标签，然而当标注者存在实质性分歧时，该假设即告失效。基于多数投票标签（实践中标准的单标签目标）拟合的事后校准器，在传统评估下可能呈现良好校准状态，但相对于底层标注者分布仍存在显著误校准。我们证明这种失效是结构性的：在简化假设下，温度缩放法倾向于低估标注者不确定性的温度参数，其真实标签误校准程度随标注熵值单调递增。为解决此问题，我们开发了一系列感知模糊性的事后校准器，这些方法针对完整标签分布优化严格评分规则，且无需模型重新训练。我们的方法涵盖逐级简化的标注需求：Dirichlet-Soft利用完整标注者分布，在所有设定中实现最佳整体校准质量；基于单样本单标注的蒙特卡洛温度缩放法（MCTS S=1）在所有基准测试中匹配完整分布校准效果，证明预聚合标签分布并非必需；而标签平滑温度缩放法（LS-TS）仅需投票标签即可运作，通过模型自身置信度构建数据驱动的伪软目标。在四个具有真实多标注者分布（CIFAR-10H, ChaosNLI）和临床知识启发的合成标注（ISIC~2019, DermaMNIST）的基准测试中，实验表明Dirichlet-Soft相较于温度缩放法将真实标签预期校准误差降低55-87%，而LS-TS在无需任何标注者数据的情况下将预期校准误差降低9-77%。

摘要 (Abstract)

Confidence calibration assumes a unique ground-truth label per input, yet this assumption fails wherever annotators genuinely disagree. Post-hoc calibrators fitted on majority-voted labels, the standard single-label targets used in practice, can appear well-calibrated under conventional evaluation yet remain substantially miscalibrated against the underlying annotator distribution. We show that this failure is structural: under simplifying assumptions, Temperature Scaling is biased toward temperatures that underestimate annotator uncertainty, with true-label miscalibration increasing monotonically with annotation entropy. To address this, we develop a family of ambiguity-aware post-hoc calibrators that optimise proper scoring rules against the full label distribution and require no model retraining. Our methods span progressively weaker annotation requirements: Dirichlet-Soft leverages the full annotator distribution and achieves the best overall calibration quality across settings; Monte Carlo Temperature Scaling with a single annotation per example (MCTS S=1) matches full-distribution calibration across all benchmarks, demonstrating that pre-aggregated label distributions are unnecessary; and Label-Smooth Temperature Scaling (LS-TS) operates with voted labels alone by constructing data-driven pseudo-soft targets from the model’s own confidence. Experiments on four benchmarks with real multi-annotator distributions (CIFAR-10H, ChaosNLI) and clinically-informed synthetic annotations (ISIC~2019, DermaMNIST) show that Dirichlet-Soft reduces true-label ECE by 55-87% relative to Temperature Scaling, while LS-TS reduces ECE by 9-77% without any annotator data.

关键词: Confidence Calibration, Ambiguous Ground Truth, Annotator Disagreement, Temperature Scaling, Post-hoc Calibrators, Label Distribution, Expected Calibration Error (ECE), Multi-annotator

93. ❌ Continuous Optimization for Satisfiability Modulo Theories on Linear Real Arithmetic

作者: Yunuo Cen, Daniel Ebler, Xuanyao Fong 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22877v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是可满足性模理论（SMT）的连续优化方法，特别是针对线性实数算术领域。论文的核心贡献是提出了FourierSMT框架，通过扩展Walsh-Fourier展开和二进制决策图来将离散约束问题转化为连续优化问题，从而实现并行化和GPU加速。虽然论文涉及优化算法和计算效率，但所有关键词都明确指向大模型、深度学习及其相关技术（如训练方法、推理优化、对齐、代理等），而本文完全不涉及这些内容。论文属于形式化方法和约束求解领域，与给定的大模型和深度学习关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为FourierSMT的连续变量优化框架，用于解决可满足性模理论（SMT）问题，通过扩展Walsh-Fourier展开和二进制决策图将离散约束转化为连续优化，实现了8倍的速度提升并支持GPU加速。

摘要翻译

可满足性模理论（SMT）的高效求解在硬件验证和设计自动化等工业应用中至关重要。现有方法主要基于冲突驱动子句学习，其结构上难以并行化，因此扩展性较差。本文提出FourierSMT作为一种可扩展且高度可并行的连续变量优化框架，用于求解SMT问题。我们将沃尔什-傅里叶展开（WFE）从布尔域推广到混合布尔-实数域，称为扩展WFE（xWFE），这使得能够使用梯度方法处理SMT问题。这解决了通过对离散变量进行局部更新来寻找高元约束的可满足赋值这一挑战。为降低xWFE的求值复杂度，我们提出了扩展二元决策图（xBDD），并将xWFE中的约束映射到xBDD上。随后证明，在随机舍入下对xBDD的电路输出概率（COP）进行采样，等价于计算xWFE的期望值，从而实现了约束的高效计算。我们证明约简后的问题保证收敛且保持可满足性，确保了解的正确性。该框架在多达10,000个变量和700,000个约束的大规模调度与布局问题上进行了基准测试，相比最先进的SMT求解器实现了8倍的加速。这些结果为基于GPU的连续系统SMT优化开辟了道路。

摘要 (Abstract)

Efficient solutions for satisfiability modulo theories (SMT) are integral in industrial applications such as hardware verification and design automation. Existing approaches are predominantly based on conflict-driven clause learning, which is structurally difficult to parallelize and therefore scales poorly. In this work, we introduce FourierSMT as a scalable and highly parallelizable continuous-variable optimization framework for SMT. We generalize the Walsh-Fourier expansion (WFE), called extended WFE (xWFE), from the Boolean domain to a mixed Boolean-real domain, which allows the use of gradient methods for SMT. This addresses the challenge of finding satisfying variable assignments to high-arity constraints by local updates of discrete variables. To reduce the evaluation complexity of xWFE, we present the extended binary decision diagram (xBDD) and map the constraints from xWFE to xBDDs. We then show that sampling the circuit-output probability (COP) of xBDDs under randomized rounding is equivalent to the expectation value of the xWFEs. This allows for efficient computation of the constraints. We show that the reduced problem is guaranteed to converge and preserves satisfiability, ensuring the soundness of the solutions. The framework is benchmarked for large-scale scheduling and placement problems with up to 10,000 variables and 700,000 constraints, achieving 8-fold speedups compared to state-of-the-art SMT solvers. These results pave the way for GPU-based optimization of SMTs with continuous systems.

关键词: Satisfiability Modulo Theories, Continuous Optimization, FourierSMT, Walsh-Fourier Expansion, Binary Decision Diagram, Linear Real Arithmetic, GPU Acceleration, Constraint Solving

94. ❌ Grounding Sim-to-Real Generalization in Dexterous Manipulation: An Empirical Study with Vision-Language-Action Models

作者: Ruixing Jin, Zicheng Zhu, Ruixiang Ouyang, Sheng Xu, Bo Yue, Zhizheng Wu, Guiliang Liu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22876v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是机器人灵巧操作中的Sim-to-Real泛化问题，使用Vision-Language-Action模型作为测试平台，但论文本身并不涉及大语言模型技术原理、训练方法、推理优化、对齐技术、代理系统或科学AI应用等关键词领域，而是专注于机器人控制、仿真到现实的迁移学习、领域随机化、渲染和物理建模等具体机器人学问题，与所有27个关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文通过超过1万次真实世界实验，实证研究了多级领域随机化、照片级渲染、物理真实建模和强化学习更新四个维度对Vision-Language-Action模型在灵巧操作任务中Sim-to-Real泛化性能的影响，并发布了机器人平台和评估协议作为标准化基准。

摘要翻译

学习灵巧操作的通用控制策略通常依赖于大规模数据集。鉴于现实世界数据采集的高成本，一种实用的替代方案是通过仿真生成合成数据。然而，由此产生的合成数据往往与现实世界分布存在显著差距。尽管先前许多研究提出了弥合仿真到现实差异的算法，但仍缺乏将这些方法立足于现实世界操作任务的原则性研究，特别是它们在通用策略（如视觉-语言-动作模型）上的性能表现。在本研究中，我们通过实证检验了仿真到现实泛化的四个主要决定因素：多层级域随机化、照片级真实感渲染、物理真实建模以及强化学习更新。为支持此项研究，我们设计了一套综合评估方案，以量化操作任务在现实世界中的性能。该方案考虑了背景、光照、干扰物、物体类型和空间特征等关键变量。通过超过一万次现实世界试验，我们得出了关于仿真到现实迁移的重要见解。为启发和推进未来研究，我们将机器人平台和评估方案公开发布，以促进独立验证，从而为灵巧操作策略建立一个现实且标准化的基准。

摘要 (Abstract)

Learning a generalist control policy for dexterous manipulation typically relies on large-scale datasets. Given the high cost of real-world data collection, a practical alternative is to generate synthetic data through simulation. However, the resulting synthetic data often exhibits a significant gap from real-world distributions. While many prior studies have proposed algorithms to bridge the Sim-to-Real discrepancy, there remains a lack of principled research that grounds these methods in real-world manipulation tasks, particularly their performance on generalist policies such as Vision-Language-Action (VLA) models. In this study, we empirically examine the primary determinants of Sim-to-Real generalization across four dimensions: multi-level domain randomization, photorealistic rendering, physics-realistic modeling, and reinforcement learning updates. To support this study, we design a comprehensive evaluation protocol to quantify the real-world performance of manipulation tasks. The protocol accounts for key variations in background, lighting, distractors, object types, and spatial features. Through experiments involving over 10k real-world trials, we derive critical insights into Sim-to-Real transfer. To inform and advance future studies, we release both the robotic platforms and the evaluation protocol for public access to facilitate independent verification, thereby establishing a realistic and standardized benchmark for dexterous manipulation policies.

关键词: Sim-to-Real generalization, dexterous manipulation, Vision-Language-Action models, domain randomization, photorealistic rendering, physics-realistic modeling, reinforcement learning, robotic platforms

95. ❌ Agent-Sentry: Bounding LLM Agents via Execution Provenance

作者: Rohan Sequeira, Stavros Damianakis, Umar Iqbal, Konstantinos Psounis 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22868v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	15.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM Agent系统的安全边界问题，与’LLM Agents’高度相关（15分），涉及’Tool Use’（10分）和’Large Language Models’（10分），因为Agent-Sentry通过分析工具调用执行轨迹来构建行为边界并阻止偏离的调用。其他关键词如MoE、SFT、RAG等未在摘要中提及，评分为0。

!!! tip deepseek-chat TL;DR

论文提出Agent-Sentry框架，通过分析LLM Agent的执行轨迹构建行为边界来防止越界攻击，实验表明能阻止90%以上攻击并保持98%系统效用。

摘要翻译

基于自然语言指令自主生成新功能的代理计算系统正日益普及。尽管能力强大，这些系统也引发了严重的安全、隐私和安全性担忧。从根本上说，这些系统所提供的完整功能集合及其概率性执行流程无法预先获知。由于缺乏系统行为的明确界定，验证系统是否成功执行了用户预期任务，还是执行了无关操作（可能由系统被入侵导致），变得异常困难。本文提出Agent-Sentry框架，旨在通过界定代理系统的行为边界来解决这一问题。我们的核心观点是：代理系统为特定用例设计，因此无需暴露无限制或未定义的功能。一旦被界定，这些系统将更易于审查。Agent-Sentry通过挖掘代理系统的常用功能及其执行轨迹来构建行为边界，从而实践这一理念。随后，系统从这些轨迹中学习策略，并阻止偏离已学习行为或与用户意图不符的工具调用。评估结果表明，Agent-Sentry能帮助阻止超过90%试图触发越界执行的攻击，同时保持高达98%的系统效用。

摘要 (Abstract)

Agentic computing systems, which autonomously spawn new functionalities based on natural language instructions, are becoming increasingly prevalent. While immensely capable, these systems raise serious security, privacy, and safety concerns. Fundamentally, the full set of functionalities offered by these systems, combined with their probabilistic execution flows, is not known beforehand. Given this lack of characterization, it is non-trivial to validate whether a system has successfully carried out the user’s intended task or instead executed irrelevant actions, potentially as a consequence of compromise. In this paper, we propose Agent-Sentry, a framework that attempts to bound agentic systems to address this problem. Our key insight is that agentic systems are designed for specific use cases and therefore need not expose unbounded or unspecified functionalities. Once bounded, these systems become easier to scrutinize. Agent-Sentry operationalizes this insight by uncovering frequent functionalities offered by an agentic system, along with their execution traces, to construct behavioral bounds. It then learns a policy from these traces and blocks tool calls that deviate from learned behaviors or that misalign with user intent. Our evaluation shows that Agent-Sentry helps prevent over 90% of attacks that attempt to trigger out-of-bounds executions, while preserving up to 98% of system utility.

关键词: LLM Agents, Agentic Systems, Execution Provenance, Security Bounds, Tool Calls, Behavioral Analysis, Attack Prevention, System Utility

96. ❌ Dynamical Systems Theory Behind a Hierarchical Reasoning Model

作者: Vasiliy A. Es’kin, Mikhail E. Smorkalov 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22871v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Contraction Mapping Model (CMM)，专注于复杂算法推理任务，与LLMs相关（8分）但更强调小型高效模型（SLMs/On-device AI: 10分）。核心贡献在于推理能力（Chain of Thought/System 2 Thinking: 各10分）和极端参数效率（Quantization/Model Compression: 10分）。其他关键词如MoE、Scaling Laws、训练方法、对齐、RAG、长上下文、加速技术、幻觉缓解、可解释性、多智能体等均未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在复杂算法推理任务上的不足，提出了一种基于收缩映射和神经微分方程的数学严谨推理模型CMM，在仅5M参数下实现了93.7%的Sudoku-Extreme准确率，并在极端压缩至0.26M参数时仍保持强大性能，证明了数学严谨的潜在动力学可以替代暴力参数扩展。

摘要翻译

当前的大型语言模型主要依赖于线性序列生成和海量参数规模，但在处理复杂算法推理任务时仍面临严重困难。尽管近期的推理架构（如分层推理模型和微型递归模型）已证明紧凑的递归网络能够应对此类任务，但其训练过程通常缺乏严格的数学保证，易导致不稳定性和表征崩溃。本研究提出收缩映射模型——一种将离散递归推理重构为连续神经常微分方程/随机微分方程的新型架构。该模型通过显式强制潜在相点收敛至稳定平衡态，并采用超球面排斥损失缓解特征崩溃，从而构建出具有数学基础且高度稳定的推理引擎。在Sudoku-Extreme基准测试中，仅含500万参数的收缩映射模型取得了93.7%的顶尖准确率，显著优于2700万参数的分层推理模型和500万参数的微型递归模型。值得注意的是，即使被极端压缩至仅26万参数的微型架构，该模型仍保持强大的预测能力，在Sudoku-Extreme和迷宫基准测试中分别达到85.4%和82.2%的准确率。这些成果为极端参数效率设立了新标杆，证明基于数学严谨的潜在动力学机制能够有效替代人工推理中依赖暴力参数扩展的传统路径。

摘要 (Abstract)

Current large language models (LLMs) primarily rely on linear sequence generation and massive parameter counts, yet they severely struggle with complex algorithmic reasoning. While recent reasoning architectures, such as the Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM), demonstrate that compact recursive networks can tackle these tasks, their training dynamics often lack rigorous mathematical guarantees, leading to instability and representational collapse. We propose the Contraction Mapping Model (CMM), a novel architecture that reformulates discrete recursive reasoning into continuous Neural Ordinary and Stochastic Differential Equations (NODEs/NSDEs). By explicitly enforcing the convergence of the latent phase point to a stable equilibrium state and mitigating feature collapse with a hyperspherical repulsion loss, the CMM provides a mathematically grounded and highly stable reasoning engine. On the Sudoku-Extreme benchmark, a 5M-parameter CMM achieves a state-of-the-art accuracy of 93.7 %, outperforming the 27M-parameter HRM (55.0 %) and 5M-parameter TRM (87.4 %). Remarkably, even when aggressively compressed to an ultra-tiny footprint of just 0.26M parameters, the CMM retains robust predictive power, achieving 85.4 % on Sudoku-Extreme and 82.2 % on the Maze benchmark. These results establish a new frontier for extreme parameter efficiency, proving that mathematically rigorous latent dynamics can effectively replace brute-force scaling in artificial reasoning.

关键词: Contraction Mapping Model, Neural Ordinary Differential Equations, algorithmic reasoning, parameter efficiency, hierarchical reasoning, latent dynamics, model compression, mathematical guarantees

97. ❌ Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories

作者: Yang Li, Yule Liu, Xinlei He, Youjian Zhao, Qi Li, Ke Xu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22869v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Chain-of-Authorization框架，核心是让LLM通过推理轨迹内化授权逻辑。与LLM、监督微调、思维链推理高度相关（10分），涉及系统2深度推理（8分），与对齐有一定关联（5分），其他关键词如MoE、量化、科学AI等未涉及（0分）。

!!! tip deepseek-chat TL;DR

论文针对LLM缺乏访问边界意识导致安全风险的问题，提出了Chain-of-Authorization框架，通过监督微调和推理轨迹使LLM内化授权逻辑，在保持授权场景效用的同时有效拒绝未授权访问。

摘要翻译

大型语言模型（LLMs）已成为现代人工智能（AI）系统的核心认知组件，通过整合内部知识与外部上下文来执行复杂任务。然而，LLMs通常不加区分地处理所有可访问数据，缺乏对知识所有权与访问边界的固有认知。这一缺陷加剧了敏感数据泄露和对抗性操纵的风险，可能导致未经授权的系统访问及严重的安全危机。现有保护策略依赖于僵化、统一的防御机制，无法实现动态授权。结构隔离方法面临可扩展性瓶颈，而提示引导方法则难以实现细粒度的权限区分。本文提出授权链（Chain-of-Authorization, CoA）框架，这是一种将授权逻辑内化至LLMs核心能力的安全训练与推理范式。与被动的外部防御不同，CoA重构了模型的信息流：它在输入阶段嵌入权限上下文，并要求模型在生成最终响应前，先产生包含资源审查、身份解析和决策阶段的显式授权推理轨迹。通过对涵盖多种授权状态的数据进行监督微调，CoA将策略执行与任务响应相融合，使授权成为实质性响应的因果前提。大量评估表明，CoA不仅在授权场景中保持可比的任务效用，还能克服权限不匹配时的认知混淆。该机制对各种未授权及对抗性访问表现出高拒绝率。通过利用LLMs的推理能力执行动态授权，并以自然语言理解作为主动安全机制，CoA为在现代AI系统中部署可靠的大型语言模型提供了新路径。

摘要 (Abstract)

Large Language Models (LLMs) have become core cognitive components in modern artificial intelligence (AI) systems, combining internal knowledge with external context to perform complex tasks. However, LLMs typically treat all accessible data indiscriminately, lacking inherent awareness of knowledge ownership and access boundaries. This deficiency heightens risks of sensitive data leakage and adversarial manipulation, potentially enabling unauthorized system access and severe security crises. Existing protection strategies rely on rigid, uniform defense that prevent dynamic authorization. Structural isolation methods faces scalability bottlenecks, while prompt guidance methods struggle with fine-grained permissions distinctions. Here, we propose the Chain-of-Authorization (CoA) framework, a secure training and reasoning paradigm that internalizes authorization logic into LLMs’ core capabilities. Unlike passive external defneses, CoA restructures the model’s information flow: it embeds permission context at input and requires generating explicit authorization reasoning trajectory that includes resource review, identity resolution, and decision-making stages before final response. Through supervised fine-tuning on data covering various authorization status, CoA integrates policy execution with task responses, making authorization a causal prerequisite for substantive responses. Extensive evaluations show that CoA not only maintains comparable utility in authorized scenarios but also overcomes the cognitive confusion when permissions mismatches. It exhibits high rejection rates against various unauthorized and adversarial access. This mechanism leverages LLMs’ reasoning capability to perform dynamic authorization, using natural language understanding as a proactive security mechanism for deploying reliable LLMs in modern AI systems.

关键词: Large Language Models, Authorization, Chain-of-Authorization, Reasoning Trajectories, Supervised Fine-tuning, Security, Access Control, Cognitive Components

98. ❌ Agent Audit: A Security Analysis System for LLM Agent Applications

作者: Haiyue Zhang, Yi Nian, Yue Zhao 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22853v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Agent Audit: A Security Analysis System for LLM Agent Applications》专注于LLM代理应用的安全分析系统开发，与’Large Language Models’和’LLM Agents’高度相关（10分），因为论文直接研究LLM代理应用的安全问题；与’Tool Use’高度相关（10分），因为论文分析工具函数的安全风险；与其他关键词无关（0分），因为论文不涉及模型架构、训练方法、推理优化、科学应用等主题。

!!! tip deepseek-chat TL;DR

论文提出了Agent Audit系统，用于分析和检测LLM代理应用中的安全漏洞，在包含42个标注漏洞的基准测试中检测出40个漏洞，显著提高了召回率并保持亚秒级扫描时间。

摘要翻译

开发者在部署LLM智能体前应检查哪些内容：模型权重、工具代码、部署配置，还是三者皆需？实践中，智能体系统的许多安全漏洞并非仅源于模型权重，而是来自周边软件栈：例如将不可信输入传递给危险操作的工具函数、部署构件中暴露的凭证，以及过度授权的模型上下文协议（Model Context Protocol，MCP）配置。
本文提出Agent Audit——一个面向LLM智能体应用的安全分析系统。该系统通过智能体感知分析流水线，结合数据流分析、凭证检测、结构化配置解析与权限风险检查，对Python智能体代码及部署构件进行综合分析。系统支持在终端输出JSON与SARIF格式的检测报告，可直接集成至本地开发工作流和CI/CD流水线。在包含42个标注漏洞的22个样本基准测试中，Agent Audit成功检测出40个漏洞，仅产生6个误报，在保持亚秒级扫描速度的同时，相比常规静态应用安全测试（SAST）基线显著提升了召回率。本系统已开源并通过pip包管理器提供安装，使智能体系统的安全审计更易实施。
在实时演示环节，参会者可扫描包含漏洞的智能体代码库，观察Agent Audit如何识别工具函数、提示词等环节的安全风险。所有检测结果均关联至源码位置与配置路径，并可导出至VS Code及GitHub代码扫描平台进行交互式审查。

摘要 (Abstract)

What should a developer inspect before deploying an LLM agent: the model, the tool code, the deployment configuration, or all three? In practice, many security failures in agent systems arise not from model weights alone, but from the surrounding software stack: tool functions that pass untrusted inputs to dangerous operations, exposed credentials in deployment artifacts, and over-privileged Model Context Protocol (MCP) configurations. We present Agent Audit, a security analysis system for LLM agent applications. Agent Audit analyzes Python agent code and deployment artifacts through an agent-aware pipeline that combines dataflow analysis, credential detection, structured configuration parsing, and privilege-risk checks. The system reports findings in terminal, JSON, and SARIF formats, enabling direct integration with local development workflows and CI/CD pipelines. On a benchmark of 22 samples with 42 annotated vulnerabilities, Agent Audit detects 40 vulnerabilities with 6 false positives, substantially improving recall over common SAST baselines while maintaining sub-second scan times. Agent Audit is open source and installable via pip, making security auditing accessible for agent systems. In the live demonstration, attendees scan vulnerable agent repositories and observe how Agent Audit identifies security risks in tool functions, prompts, and more. Findings are linked to source locations and configuration paths, and can be exported into VS Code and GitHub Code Scanning for interactive inspection.

关键词: LLM agent, security analysis, tool functions, vulnerability detection, deployment artifacts, SAST, Model Context Protocol, CI/CD integration

99. ❌ The Coordinate System Problem in Persistent Structural Memory for Neural Architectures

作者: Abhinaba Basu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22858v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究神经网络中的持久结构记忆问题，提出了Dual-View Pheromone Pathway Network (DPPN)架构，涉及稀疏注意力机制和记忆稳定性。与大多数关键词无关，仅与"Mixture of Experts OR MoE OR Sparse Models"有一定关联（5分），因为论文提到了稀疏注意力和路由机制，但未明确涉及MoE。其他关键词如LLMs、AI for Science等均未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文研究了神经网络中持久结构记忆的坐标系统问题，发现记忆需要稳定的坐标系统，并提出DPPN架构通过固定随机傅里叶特征提供稳定坐标，实现了比Transformer基线更好的任务内学习性能。

摘要翻译

我们提出双视角信息素路径网络（Dual-View Pheromone Pathway Network, DPPN），该架构通过潜在槽位转移上的持久信息素场来路由稀疏注意力，并借此揭示了神经网络中持久结构记忆的两个独立必要条件。通过对5种模型变体和4个迁移目标进行五组逐步精进的实验（每种条件使用多达10个随机种子），我们发现了一个核心原则：持久记忆需要稳定的坐标系，而任何与模型联合学习的坐标系本质上都是不稳定的。我们刻画了三个障碍——信息素饱和、表层结构纠缠与坐标不兼容——并证明当嵌入从头学习时，无论是对比更新、多源蒸馏、匈牙利对齐还是语义分解，都无法解决这种不稳定性。固定的随机傅里叶特征提供了稳定、结构无关且信息丰富的外源性坐标，但仅坐标稳定性并不足够：路由偏置信息素无法实现有效迁移（10个种子，p>0.05）。在任务内学习中，DPPN的表现优于Transformer和随机稀疏基线（AULC分别为0.700 vs 0.680 vs 0.670）。将路由偏置替换为学习率调制可消除负迁移：作为学习率先验的“温热信息素”在同族任务上实现了+0.003的性能提升（17个种子，p<0.05），且从未降低性能。在外源性坐标上的结构补全函数带来了超越正则化的+0.006同族任务增益，这表明稳定性与信息性之间的两难困境可通过学习函数部分渗透。本研究的贡献在于明确了持久结构记忆的两个独立必要条件：（a）坐标稳定性与（b）优雅的迁移机制。

摘要 (Abstract)

We introduce the Dual-View Pheromone Pathway Network (DPPN), an architecture that routes sparse attention through a persistent pheromone field over latent slot transitions, and use it to discover two independent requirements for persistent structural memory in neural networks. Through five progressively refined experiments using up to 10 seeds per condition across 5 model variants and 4 transfer targets, we identify a core principle: persistent memory requires a stable coordinate system, and any coordinate system learned jointly with the model is inherently unstable. We characterize three obstacles – pheromone saturation, surface-structure entanglement, and coordinate incompatibility – and show that neither contrastive updates, multi-source distillation, Hungarian alignment, nor semantic decomposition resolves the instability when embeddings are learned from scratch. Fixed random Fourier features provide extrinsic coordinates that are stable, structure-blind, and informative, but coordinate stability alone is insufficient: routing-bias pheromone does not transfer (10 seeds, p>0.05). DPPN outperforms transformer and random sparse baselines for within-task learning (AULC 0.700 vs 0.680 vs 0.670). Replacing routing bias with learning-rate modulation eliminates negative transfer: warm pheromone as a learning-rate prior achieves +0.003 on same-family tasks (17 seeds, p<0.05) while never reducing performance. A structure completion function over extrinsic coordinates produces +0.006 same-family bonus beyond regularization, showing the catch-22 between stability and informativeness is partially permeable to learned functions. The contribution is two independent requirements for persistent structural memory: (a) coordinate stability and (b) graceful transfer mechanism.

关键词: persistent structural memory, coordinate system, sparse attention, pheromone field, neural architecture, transfer learning, DPPN, stable embeddings

100. ❌ UniQueR: Unified Query-based Feedforward 3D Reconstruction

作者: Chensheng Peng, Quentin Herau, Jiezhi Yang, Yichen Xie, Yihan Hu, Wenzhao Zheng, Matthew Strong, Masayoshi Tomizuka, Wei Zhan 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22851v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文UniQueR专注于计算机视觉领域的3D重建任务，提出了一种基于查询的前馈框架，使用3D锚点查询和3D高斯表示进行高效重建。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用相关，而本文研究的是计算机视觉中的3D重建，属于完全不同的领域，没有涉及任何大语言模型、深度学习技术原理创新或AI在科学（如生物信息学）中的应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

论文提出UniQueR框架，通过稀疏3D查询推理解决从未标定图像进行高效准确3D重建的问题，在渲染质量和几何精度上超越了现有前馈方法。

摘要翻译

本文提出UniQueR，一种基于查询的统一前馈框架，用于从无位姿图像中实现高效且精确的三维重建。现有前馈模型（如DUSt3R、VGGT和AnySplat）通常预测逐像素点云图或像素对齐的高斯分布，这些方法本质上仍属于2.5维表示且仅限于可见表面。相比之下，UniQueR将重建任务构建为稀疏三维查询推断问题。我们的模型学习一组紧凑的三维锚点作为显式几何查询，使网络能够在前向传播中推断场景结构（包括遮挡区域的几何信息）。每个查询直接在全局三维空间（而非逐帧相机空间）中编码空间与外观先验，并生成一组用于可微分渲染的三维高斯分布。通过利用多视图特征间的统一查询交互机制以及解耦交叉注意力设计，UniQueR在显著降低内存与计算成本的同时实现了强大的几何表达能力。在Mip-NeRF 360和VR-NeRF数据集上的实验表明，UniQueR在渲染质量与几何精度上均超越现有前馈方法，且所使用的图元数量比密集重建方法少一个数量级。

摘要 (Abstract)

We present UniQueR, a unified query-based feedforward framework for efficient and accurate 3D reconstruction from unposed images. Existing feedforward models such as DUSt3R, VGGT, and AnySplat typically predict per-pixel point maps or pixel-aligned Gaussians, which remain fundamentally 2.5D and limited to visible surfaces. In contrast, UniQueR formulates reconstruction as a sparse 3D query inference problem. Our model learns a compact set of 3D anchor points that act as explicit geometric queries, enabling the network to infer scene structure, including geometry in occluded regions–in a single forward pass. Each query encodes spatial and appearance priors directly in global 3D space (instead of per-frame camera space) and spawns a set of 3D Gaussians for differentiable rendering. By leveraging unified query interactions across multi-view features and a decoupled cross-attention design, UniQueR achieves strong geometric expressiveness while substantially reducing memory and computational cost. Experiments on Mip-NeRF 360 and VR-NeRF demonstrate that UniQueR surpasses state-of-the-art feedforward methods in both rendering quality and geometric accuracy, using an order of magnitude fewer primitives than dense alternatives.

关键词: 3D reconstruction, query-based framework, feedforward model, 3D Gaussians, unposed images, geometric accuracy, multi-view features, differentiable rendering

101. ❌ CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models

作者: Youzhi Liu, Li Gao, Liu Liu, Mingyang Lv, Yang Cai 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22846v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是具身视觉跟踪（EVT）任务，采用竞争性多智能体强化学习框架，核心是视觉-语言-动作模型（VLM）在动态对抗环境中的训练。与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐技术等）完全无关。唯一相关的是’Multi-agent Systems OR Agent Coordination’，因为论文明确提出了竞争性多智能体强化学习框架，涉及追踪器与对手智能体之间的协调与对抗，因此给10分（高度相关）。其他关键词在论文标题、摘要中均未提及，也无相关技术内容，故均给0分。

!!! tip deepseek-chat TL;DR

该论文针对具身视觉跟踪任务中单智能体模仿学习方法依赖昂贵专家数据且泛化能力有限的问题，提出了一个竞争性博弈论多智能体强化学习框架CoMaTrack，通过在动态对抗环境中训练智能体，实现了更强的自适应规划和抗干扰策略，并在标准基准和新提出的CoMaTrack-Bench上取得了最先进的性能。

摘要翻译

具身视觉追踪（Embodied Visual Tracking，简称EVT）作为具身智能的核心动态任务，要求智能体能够精确跟随语言指定的目标。然而，现有方法大多依赖单智能体模仿学习，不仅需要昂贵的专家数据，且因静态训练环境导致泛化能力有限。受竞争驱动能力演化的启发，我们提出了CoMaTrack——一种基于竞争博弈理论的多智能体强化学习框架。该框架通过在动态对抗环境中设置竞争性子任务来训练智能体，从而产生更强的自适应规划能力和抗干扰策略。我们进一步推出了首个面向竞争性EVT的基准测试CoMaTrack-Bench，其特点在于构建了追踪者与自适应对手在多样化环境和指令下的博弈场景，实现了在主动对抗交互下的标准化鲁棒性评估。实验表明，CoMaTrack在标准基准测试和CoMaTrack-Bench上均取得了最先进的性能。值得注意的是，采用本框架训练的30亿参数视觉语言模型（VLM）在极具挑战性的EVT-Bench上超越了此前基于70亿参数模型的单智能体模仿学习方法，在STT、DT和AT指标上分别达到92.1%、74.2%和57.5%。基准测试代码将在https://github.com/wlqcode/CoMaTrack-Bench 公开。

摘要 (Abstract)

Embodied Visual Tracking (EVT), a core dynamic task in embodied intelligence, requires an agent to precisely follow a language-specified target. Yet most existing methods rely on single-agent imitation learning, suffering from costly expert data and limited generalization due to static training environments. Inspired by competition-driven capability evolution, we propose CoMaTrack, a competitive game-theoretic multi-agent reinforcement learning framework that trains agents in a dynamic adversarial setting with competitive subtasks, yielding stronger adaptive planning and interference-resilient strategies. We further introduce CoMaTrack-Bench, the first benchmark for competitive EVT, featuring game scenarios between a tracker and adaptive opponents across diverse environments and instructions, enabling standardized robustness evaluation under active adversarial interactions. Experiments show that CoMaTrack achieves state-of-the-art results on both standard benchmarks and CoMaTrack-Bench. Notably, a 3B VLM trained with our framework surpasses previous single-agent imitation learning methods based on 7B models on the challenging EVT-Bench, achieving 92.1% in STT, 74.2% in DT, and 57.5% in AT. The benchmark code will be available at https://github.com/wlqcode/CoMaTrack-Bench

关键词: Embodied Visual Tracking, Multi-agent Reinforcement Learning, Game-theoretic Framework, Vision-Language-Action Models, Competitive Adversarial Setting, Adaptive Planning, Benchmark Evaluation

102. ❌ UAV-DETR: DETR for Anti-Drone Target Detection

作者: Jun Yang, Dong Wang, Hongxu Yin, Hongpeng Li, Jianxiong Yu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22841v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文UAV-DETR专注于计算机视觉中的目标检测任务，特别是针对反无人机应用的小目标检测。它提出了基于DETR架构的改进，包括WTConv增强的主干网络、滑动窗口自注意力编码器、高效跨尺度特征重校准融合网络以及混合损失策略。所有评分关键词均与大语言模型（LLMs）、大模型技术原理或AI for Science应用直接相关，而本论文研究的是传统的深度学习目标检测模型（DETR变体），未涉及任何大语言模型、大模型训练/对齐技术、AI for Science（如生物信息学）或相关创新技术。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对复杂背景下微型无人机检测中特征表示与计算效率难以平衡的问题，提出了UAV-DETR框架，通过架构改进和混合损失策略，在自定义数据集和公共基准上均实现了更高的检测精度和更低的参数量。

摘要翻译

无人机检测在众多安防与反无人机应用中至关重要。然而，现有的基于深度学习的方法通常难以在鲁棒的特征表征与计算效率之间取得平衡。这一挑战在复杂背景及严重环境干扰下检测微型无人机时尤为突出。为解决这些问题，我们提出了UAV-DETR，一种融合了小目标友好型架构与实时检测能力的新型框架。具体而言，UAV-DETR采用WTConv增强的主干网络和滑动窗口自注意力（Sliding Window Self-Attention, SWSA-IFI）编码器，在显著降低参数开销的同时，捕捉微小目标的高频结构细节。此外，我们提出了一种高效跨尺度特征重校准与融合网络（Efficient Cross-Scale Feature Recalibration and Fusion Network, ECFRFN），以抑制背景噪声并聚合多尺度语义信息。为进一步提升精度，UAV-DETR引入了混合Inner-CIoU与NWD损失策略，缓解了标准IoU度量对小目标微小位置偏差的极端敏感性。大量实验表明，UAV-DETR在我们定制的无人机数据集上（mAP50:95提升6.61%，参数减少39.8%）以及公开的DUT-ANTI-UAV基准上（精确度提升1.4%，F1分数提升1.0%）均显著优于基线RT-DETR。这些结果确立了UAV-DETR在反无人机目标检测中效率与精度之间的优越平衡。代码发布于https://github.com/wd-sir/UAVDETR。

摘要 (Abstract)

Drone detection is pivotal in numerous security and counter-UAV applications. However, existing deep learning-based methods typically struggle to balance robust feature representation with computational efficiency. This challenge is particularly acute when detecting miniature drones against complex backgrounds under severe environmental interference. To address these issues, we introduce UAV-DETR, a novel framework that integrates a small-target-friendly architecture with real-time detection capabilities. Specifically, UAV-DETR features a WTConv-enhanced backbone and a Sliding Window Self-Attention (SWSA-IFI) encoder, capturing the high-frequency structural details of tiny targets while drastically reducing parameter overhead. Furthermore, we propose an Efficient Cross-Scale Feature Recalibration and Fusion Network (ECFRFN) to suppress background noise and aggregate multi-scale semantics. To further enhance accuracy, UAV-DETR incorporates a hybrid Inner-CIoU and NWD loss strategy, mitigating the extreme sensitivity of standard IoU metrics to minor positional deviations in small objects. Extensive experiments demonstrate that UAV-DETR significantly outperforms the baseline RT-DETR on our custom UAV dataset (+6.61% in mAP50:95, with a 39.8% reduction in parameters) and the public DUT-ANTI-UAV benchmark (+1.4% in Precision, +1.0% in F1-Score). These results establish UAV-DETR as a superior trade-off between efficiency and precision in counter-UAV object detection. The code is available at https://github.com/wd-sir/UAVDETR.

关键词: UAV-DETR, drone detection, small-target detection, DETR architecture, real-time detection, parameter efficiency, cross-scale feature fusion, hybrid loss strategy

103. ❌ URA-Net: Uncertainty-Integrated Anomaly Perception and Restoration Attention Network for Unsupervised Anomaly Detection

作者: Wei Luo, Peng Xing, Yunkang Cao, Haiming Yao, Weiming Shen, Zechao Li 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22840v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的无监督异常检测，提出了一种基于卷积神经网络和贝叶斯神经网络的新方法（URA-Net），用于工业缺陷检测和医学图像分析。论文内容与绝大多数关键词（涉及大模型、训练技术、推理优化、对齐、智能体等）完全无关，因为这些关键词均特指自然语言处理或通用大模型技术。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及医学图像分析（OCT-2017数据集），属于AI在科学（医学）领域的应用，但并非核心焦点（主要针对工业检测），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种不确定性集成的异常感知与恢复注意力网络（URA-Net），通过将异常模式显式恢复为正常模式来解决无监督异常检测中的过泛化问题，并在工业和医学图像数据集上验证了其有效性。

摘要翻译

无监督异常检测在工业缺陷检测与医学图像分析中具有关键作用，大多数方法依赖于重建框架。然而，这些方法可能因过度泛化而能够较好地重建异常区域，从而导致检测性能不佳。为解决此问题，我们提出了一种创新的不确定性集成异常感知与修复注意力网络（Uncertainty-Integrated Anomaly Perception and Restoration Attention Network, URA-Net），该方法不仅关注正常模式重建，还显式地将异常模式修复至对应的正常状态。首先，与传统图像重建方法不同，我们利用预训练的卷积神经网络提取多层次语义特征作为重建目标。为辅助URA-Net学习修复异常，我们引入了一种新颖的特征级人工异常合成模块，用于生成训练用的异常样本。随后，我们提出了一种基于贝叶斯神经网络的不确定性集成异常感知模块，用于学习异常与正常特征的分布。该模块有助于估计异常区域及模糊边界，为后续的异常修复奠定基础。接着，我们设计了一种新型修复注意力机制，利用全局正常语义信息对检测到的异常区域进行修复，从而获得无缺陷的修复特征。最后，我们通过输入特征与修复特征之间的残差图实现异常检测与定位。在MVTec AD和BTAD两个工业数据集以及医学图像数据集OCT-2017上的综合实验结果，明确验证了所提方法的有效性和优越性。

摘要 (Abstract)

Unsupervised anomaly detection plays a pivotal role in industrial defect inspection and medical image analysis, with most methods relying on the reconstruction framework. However, these methods may suffer from over-generalization, enabling them to reconstruct anomalies well, which leads to poor detection performance. To address this issue, instead of focusing solely on normality reconstruction, we propose an innovative Uncertainty-Integrated Anomaly Perception and Restoration Attention Network (URA-Net), which explicitly restores abnormal patterns to their corresponding normality. First, unlike traditional image reconstruction methods, we utilize a pre-trained convolutional neural network to extract multi-level semantic features as the reconstruction target. To assist the URA-Net learning to restore anomalies, we introduce a novel feature-level artificial anomaly synthesis module to generate anomalous samples for training. Subsequently, a novel uncertainty-integrated anomaly perception module based on Bayesian neural networks is introduced to learn the distributions of anomalous and normal features. This facilitates the estimation of anomalous regions and ambiguous boundaries, laying the foundation for subsequent anomaly restoration. Then, we propose a novel restoration attention mechanism that leverages global normal semantic information to restore detected anomalous regions, thereby obtaining defect-free restored features. Finally, we employ residual maps between input features and restored features for anomaly detection and localization. The comprehensive experimental results on two industrial datasets, MVTec AD and BTAD, along with a medical image dataset, OCT-2017, unequivocally demonstrate the effectiveness and superiority of the proposed method.

关键词: Unsupervised anomaly detection, Anomaly restoration, Bayesian neural networks, Attention mechanism, Industrial defect inspection, Medical image analysis, Feature reconstruction, Uncertainty estimation

104. ❌ Improving Safety Alignment via Balanced Direct Preference Optimization

作者: Shiji Zhao, Mengyang Wang, Shukun Xiong, Fangzhou Chen, Qihui Zhu, Shouwei Ruan, Yisong Xiao, Ranjie Duan, Xun Chen, XingXing Wei 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22829v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的安全对齐问题，直接涉及RLHF/DPO方法改进，因此与’Large Language Models’、‘Instruction Tuning/Alignment’、‘RLHF/DPO’高度相关（10分）。论文属于后训练阶段的安全对齐工作，与’Post-training/SFT’有一定关联（5分）。其他关键词如MoE、量化、推理加速、科学AI应用等均未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型安全对齐中存在的过拟合问题，提出了基于互信息的平衡直接偏好优化方法（B-DPO），在提升安全能力的同时保持了模型的通用能力。

摘要翻译

随着大语言模型（LLM）的快速发展和广泛应用，其潜在的安全风险已引起广泛关注。为提升大语言模型的安全性能，业界普遍采用基于人类反馈的强化学习（RLHF）方法。作为RLHF的一种简单有效替代方案，直接偏好优化（Direct Preference Optimization, DPO）被广泛用于安全对齐。然而，安全对齐仍面临严重的过拟合问题，这限制了其实际性能。本文从模型对训练数据的理解角度重新审视了这一过拟合现象。我们发现，偏好对中的回答之间存在不平衡偏好理解现象，这损害了模型的安全性能。为解决此问题，我们提出平衡直接偏好优化（Balanced Direct Preference Optimization, B-DPO），该方法基于互信息自适应地调节偏好回答与非偏好回答之间的优化强度。一系列实验结果表明，与现有先进方法相比，B-DPO能在保持大语言模型在各种主流基准测试中具有竞争力的通用能力的同时，有效增强其安全性能。\color{red}{警告：本文包含有害文本示例，建议读者谨慎阅读。}

摘要 (Abstract)

With the rapid development and widespread application of Large Language Models (LLMs), their potential safety risks have attracted widespread attention. Reinforcement Learning from Human Feedback (RLHF) has been adopted to enhance the safety performance of LLMs. As a simple and effective alternative to RLHF, Direct Preference Optimization (DPO) is widely used for safety alignment. However, safety alignment still suffers from severe overfitting, which limits its actual performance. This paper revisits the overfitting phenomenon from the perspective of the model’s comprehension of the training data. We find that the Imbalanced Preference Comprehension phenomenon exists between responses in preference pairs, which compromises the model’s safety performance. To address this, we propose Balanced Direct Preference Optimization (B-DPO), which adaptively modulates optimization strength between preferred and dispreferred responses based on mutual information. A series of experimental results show that B-DPO can enhance the safety capability while maintaining the competitive general capabilities of LLMs on various mainstream benchmarks compared to state-of-the-art methods. \color{red}{Warning: This paper contains examples of harmful texts, and reader discretion is recommended.

关键词: Large Language Models, Safety Alignment, Direct Preference Optimization, Overfitting, Balanced DPO, Mutual Information, Reinforcement Learning from Human Feedback

105. ❌ Empirical Comparison of Agent Communication Protocols for Task Orchestration

作者: Ivan Dobrovolskyi 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22823v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多智能体系统中的通信协议比较，包括工具集成协议和智能体间委托协议，与’LLM Agents/Autonomous Agents/Agentic Workflow’、‘Tool Use/Function Calling/API Tool Use’和’Multi-agent Systems/Agent Coordination’高度相关（10分）。论文提到AI智能体系统，隐含涉及大模型应用，因此’Large Language Models/LLMs/Foundation Models’有一定关联（5分）。其他关键词如MoE、量化、推理加速、科学AI等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文首次系统性地比较了工具集成协议、多智能体委托协议和混合架构在任务编排中的性能，量化了响应时间、上下文窗口消耗、成本和错误恢复等方面的权衡。

摘要翻译

背景：当前，人工智能代理系统正从单一工具交互向复杂的多智能体协同架构演进。在此过程中，两种主流的通信协议应运而生：一是工具集成协议，用于规范智能体调用外部工具的方式；二是智能体间委托协议，支持自主智能体相互发现并委托任务。尽管这两种协议已被数十家企业合作伙伴广泛采用，但学术界尚未对它们进行实证比较。目标：本研究旨在建立首个系统性基准测试，在三个复杂度层级的标准查询任务上，对纯工具集成架构、多智能体委托架构及混合架构进行对比，并从响应时间、上下文窗口消耗、经济成本、错误恢复能力和实现复杂度五个维度量化其性能权衡。

摘要 (Abstract)

Context. Nowadays, artificial intelligence agent systems are transforming from single-tool interactions to complex multi-agent orchestrations. As a result, two competing communication protocols have emerged: a tool integration protocol that standardizes how agents invoke external tools, and an inter-agent delegation protocol that enables autonomous agents to discover and delegate tasks to one another. Despite widespread industry adoption by dozens of enterprise partners, no empirical comparison of these protocols exists in the literature. Objective. The goal of this work is to develop the first systematic benchmark comparing tool-integration-only, multi-agent delegation, and hybrid architectures across standardized queries at three complexity levels, and to quantify the trade-offs in response time, context window consumption, monetary cost, error recovery, and implementation complexity.

关键词: multi-agent systems, agent communication protocols, task orchestration, tool integration, agent delegation, benchmark comparison, AI agents, autonomous agents

106. ❌ TDATR: Improving End-to-End Table Recognition via Table Detail-Aware Learning and Cell-Level Visual Alignment

作者: Chunxia Qin, Chenyu Liu, Pengcheng Xia, Jun Du, Baocai Yin, Bing Yin, Cong Liu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22819v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于表格识别（Table Recognition）这一特定计算机视觉和文档分析任务，提出了一种端到端的表格识别方法TDATR，通过表格细节感知学习和单元格级视觉对齐来改进性能。论文的核心技术涉及视觉-语言对齐、多任务学习、HTML生成等，但并未涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、RLHF等）、大模型在不同领域的应用（如AI for Science），或任何评分关键词中列出的具体大模型技术。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有表格识别方法在数据受限场景下性能不佳的问题，提出了一种端到端的表格识别模型TDATR，通过表格细节感知学习和单元格级视觉对齐，在多个基准测试上取得了先进或极具竞争力的性能。

摘要翻译

表格在各类文档中普遍存在，使得表格识别（TR）成为文档分析领域的一项基础任务。现有的模块化表格识别流程分别建模表格结构与内容，导致集成效果欠佳且工作流程复杂。端到端方法严重依赖大规模表格识别数据，在数据受限场景中表现不佳。为解决这些问题，我们提出TDATR（表格细节感知的表格识别模型），通过表格细节感知学习和单元格级视觉对齐改进端到端表格识别。TDATR采用“先感知后融合”策略：模型首先进行表格细节感知学习，通过在语言建模范式下设计的多个结构理解与内容识别任务，联合感知表格结构与内容。这些任务能自然利用多场景文档数据以增强模型鲁棒性；随后模型整合隐式表格细节生成结构化HTML输出，实现在有限数据训练时更高效的表格识别建模。此外，我们设计了结构引导的单元格定位模块并集成至端到端表格识别框架，该模块能高效定位单元格并强化视觉-语言对齐，从而提升表格识别的可解释性与准确性。在未经数据集特定微调的情况下，我们在七个基准测试中取得了领先或极具竞争力的性能。

摘要 (Abstract)

Tables are pervasive in diverse documents, making table recognition (TR) a fundamental task in document analysis. Existing modular TR pipelines separately model table structure and content, leading to suboptimal integration and complex workflows. End-to-end approaches rely heavily on large-scale TR data and struggle in data-constrained scenarios. To address these issues, we propose TDATR (Table Detail-Aware Table Recognition) improves end-to-end TR through table detail-aware learning and cell-level visual alignment. TDATR adopts a ``perceive-then-fuse’’ strategy. The model first performs table detail-aware learning to jointly perceive table structure and content through multiple structure understanding and content recognition tasks designed under a language modeling paradigm. These tasks can naturally leverage document data from diverse scenarios to enhance model robustness. The model then integrates implicit table details to generate structured HTML outputs, enabling more efficient TR modeling when trained with limited data. Furthermore, we design a structure-guided cell localization module integrated into the end-to-end TR framework, which efficiently locates cell and strengthens vision-language alignment. It enhances the interpretability and accuracy of TR. We achieve state-of-the-art or highly competitive performance on seven benchmarks without dataset-specific fine-tuning.

关键词: Table Recognition, End-to-End, Table Detail-Aware Learning, Cell-Level Visual Alignment, HTML Generation, Document Analysis, Vision-Language Alignment, Multi-task Learning

107. ❌ When AI Shows Its Work, Is It Actually Working? Step-Level Evaluation Reveals Frontier Language Models Frequently Bypass Their Own Reasoning

作者: Abhinaba Basu, Pavan Chakraborty 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22816v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	15.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究语言模型的推理过程真实性，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（15分），因为直接评估CoT是否被实际使用；与’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’相关（10分），涉及深度推理评估；与’Mechanistic Interpretability OR Explainable AI’相关（10分），通过注意力模式分析解释模型行为；与’Hallucination Mitigation OR Factuality OR Truthfulness’相关（8分），涉及推理真实性验证；与’Large Language Models OR LLMs OR Foundation Models’相关（10分），测试了多个前沿大模型；与’Small Language Models OR SLMs OR On-device AI’相关（5分），对比了较小模型；与’AI for Science OR Bioinformatics OR Cheminformatics’相关（5分），包含医学QA任务；与’Self-Correction OR Self-Improvement OR Self-Reflection’相关（5分），涉及模型自我验证。其他关键词与论文内容无关或未提及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现前沿语言模型的逐步推理步骤大多是装饰性的而非真正使用，通过移除推理句子的评估方法揭示模型答案通常不依赖于其展示的推理过程。

摘要翻译

语言模型在回答问题前，越来越多地通过逐步推理来“展示其思考过程”。但这些推理步骤是否真正被使用，抑或只是模型在已确定答案后生成的装饰性叙述？试想：一个医疗AI写道“患者导管术后出现嗜酸性粒细胞增多和网状青斑，提示胆固醇栓塞综合征。答案：B。”若我们移除关于嗜酸性粒细胞增多的观察，诊断会改变吗？对于大多数前沿模型而言，答案是否定的——该步骤仅是装饰性的。
我们引入步骤级评估方法：每次移除一个推理句子，并检查答案是否改变。这一简单测试仅需API访问权限（无需模型权重），每个模型每项任务的成本约为1-2美元。
通过对10个前沿模型（GPT-5.4、Claude Opus、DeepSeek-V3.2、MiniMax-M2.5、Kimi-K2.5等）在情感分析、数学推理、主题分类和医疗问答（每项任务N=376-500）中的测试，发现大多数模型产生的是装饰性推理：移除任意步骤时答案改变的概率低于17%，而仅凭任意单一步骤即可复现原答案。即使在数学任务中（较小模型[0.8-8B]显示出55%的步骤必要性），该现象依然存在。
有两个模型打破了这一模式：MiniMax-M2.5在情感分析任务中步骤必要性达37%，Kimi-K2.5在主题分类中达39%——但两者在其他任务中仍使用捷径。推理的真实性既因模型而异，也因任务而异。
我们还发现了“输出刚性”现象：针对相同的医疗问题，Claude Opus生成11个诊断步骤，而GPT-OSS-120B仅输出单个标记。机制分析（注意力模式）证实，在装饰性任务中思维链（CoT）注意力在深层网络的下降幅度（33%）显著高于真实性任务（20%）。
启示：前沿模型的逐步解释大多具有装饰性，必须进行按模型、按领域的专项评估；决定推理真实性的关键在于训练目标，而非模型规模。

摘要 (Abstract)

Language models increasingly “show their work” by writing step-by-step reasoning before answering. But are these reasoning steps genuinely used, or decorative narratives generated after the model has already decided? Consider: a medical AI writes “The patient’s eosinophilia and livedo reticularis following catheterization suggest cholesterol embolization syndrome. Answer: B.” If we remove the eosinophilia observation, does the diagnosis change? For most frontier models, the answer is no - the step was decorative. We introduce step-level evaluation: remove one reasoning sentence at a time and check whether the answer changes. This simple test requires only API access – no model weights – and costs approximately $1-2 per model per task. Testing 10 frontier models (GPT-5.4, Claude Opus, DeepSeek-V3.2, MiniMax-M2.5, Kimi-K2.5, and others) across sentiment, mathematics, topic classification, and medical QA (N=376-500 each), the majority produce decorative reasoning: removing any step changes the answer less than 17% of the time, while any single step alone recovers the answer. This holds even on math, where smaller models (0.8-8B) show genuine step dependence (55% necessity). Two models break the pattern: MiniMax-M2.5 on sentiment (37% necessity) and Kimi-K2.5 on topic classification (39%) - but both shortcut other tasks. Faithfulness is model-specific and task-specific. We also discover “output rigidity”: on the same medical questions, Claude Opus writes 11 diagnostic steps while GPT-OSS-120B outputs a single token. Mechanistic analysis (attention patterns) confirms that CoT attention drops more in late layers for decorative tasks (33%) than faithful ones (20%). Implications: step-by-step explanations from frontier models are largely decorative, per-model per-domain evaluation is essential, and training objectives - not scale - determine whether reasoning is genuine.

关键词: language models, step-by-step reasoning, decorative reasoning, faithfulness evaluation, attention patterns, medical QA, model evaluation, reasoning verification

108. ❌ Learning What Matters Now: Dynamic Preference Inference under Contextual Shifts

作者: Xianwei Cao, Dou Quan, Zhenliang Zhang, Shuang Wang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22813v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究动态偏好推断（DPI）在序列决策问题中的应用，属于强化学习（RL）和多目标优化领域，但未涉及大模型（LLMs）、深度学习技术原理或AI for Science等关键词。所有关键词均与大模型技术、训练方法、推理优化、AI应用等直接相关，而本文专注于传统RL框架下的偏好建模，无任何大模型或深度学习创新内容，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文研究在上下文变化下如何动态推断未观察的偏好权重，提出了动态偏好推断（DPI）框架，并在多目标环境中实现了比固定权重基线更高的性能。

摘要翻译

人类常常需要同时处理多个有时相互冲突的目标，并随着环境变化调整其优先级，而非遵循固定的目标函数。相比之下，大多数计算决策与多目标强化学习方法都假设偏好权重是静态的，或假设存在已知的标量奖励。在本研究中，我们探讨当这些偏好权重作为随情境漂移的未观测潜变量时，序列决策问题应如何解决。具体而言，我们提出了动态偏好推断（Dynamic Preference Inference, DPI）——一个受认知启发的框架：智能体维持对偏好权重的概率信念，通过近期交互更新该信念，并基于推断出的偏好调整其策略。我们将DPI实例化为一个变分偏好推断模块，该模块与一个以偏好为条件的行动者-评论家（preference-conditioned actor-critic）系统联合训练，并使用向量化回报作为潜变量权衡关系的证据。在具有事件驱动型目标变化的排队系统、迷宫及多目标连续控制环境中，DPI能使其推断的偏好适应新机制，并在目标切换后获得比固定权重基准方法和启发式包络基线更高的性能表现。

摘要 (Abstract)

Humans often juggle multiple, sometimes conflicting objectives and shift their priorities as circumstances change, rather than following a fixed objective function. In contrast, most computational decision-making and multi-objective RL methods assume static preference weights or a known scalar reward. In this work, we study sequential decision-making problem when these preference weights are unobserved latent variables that drift with context. Specifically, we propose Dynamic Preference Inference (DPI), a cognitively inspired framework in which an agent maintains a probabilistic belief over preference weights, updates this belief from recent interaction, and conditions its policy on inferred preferences. We instantiate DPI as a variational preference inference module trained jointly with a preference-conditioned actor-critic, using vector-valued returns as evidence about latent trade-offs. In queueing, maze, and multi-objective continuous-control environments with event-driven changes in objectives, DPI adapts its inferred preferences to new regimes and achieves higher post-shift performance than fixed-weight and heuristic envelope baselines.

关键词: Dynamic Preference Inference, sequential decision-making, multi-objective RL, preference weights, contextual shifts, actor-critic, vector-valued returns

109. ❌ PhotoAgent: A Robotic Photographer with Spatial and Aesthetic Understanding

作者: Lirong Che, Zhenfeng Gan, Yanbo Chen, Junbo Tan, Xueqian Wang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22796v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出PhotoAgent，一个结合大型多模态模型（LMMs）推理与新型控制范式的具身智能体，用于摄影任务。核心相关关键词包括：1）‘Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’（10分）：摘要明确提到使用LMM驱动的chain-of-thought推理将美学目标转化为几何约束；2）‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（10分）：PhotoAgent本身是一个自主智能体，实现从语言指令到几何控制的端到端工作流；3）‘World Models AND General World Models’（10分）：构建了基于3D高斯泼溅（3DGS）的光真实内部世界模型，用于视觉反思和迭代优化；4）‘Self-Correction OR Self-Improvement OR Self-Reflection’（8分）：通过视觉反思在内部世界模型中迭代优化初始视角；5）‘System 2 Thinking OR Slow Thinking OR In-depth Reasoning’（8分）：涉及深度空间推理和美学分析；6）‘Large Language Models OR LLMs OR Foundation Models’（8分）：使用LMMs进行推理，LMMs是LLMs的扩展；7）‘Tool Use OR Function Calling OR API Tool Use’（5分）：智能体控制相机可视为工具使用。其他关键词如MoE、量化、RAG等未在摘要中体现，故评0分。

!!! tip deepseek-chat TL;DR

该研究解决了具身智能体在摄影任务中如何将高层语言指令转化为几何控制的语义鸿沟问题，通过结合大型多模态模型的链式推理和3D高斯泼溅构建的内部世界模型，实现了从主观美学目标到高质量摄影视角的自动优化。

摘要翻译

面向摄影等创意任务的具身智能体需弥合高层级语言指令与几何控制之间的语义鸿沟。我们提出PhotoAgent，该智能体通过融合大型多模态模型推理与新型控制范式实现这一目标。PhotoAgent首先通过LMM驱动的思维链推理将主观美学目标转化为可求解的几何约束，使解析求解器能够计算出高质量的初始视点。随后，该初始位姿在基于3D高斯溅射构建的光照真实内部世界模型中，通过视觉反射进行迭代优化。这种“心智模拟”替代了昂贵且缓慢的物理试错过程，实现了向美学更优结果的快速收敛。评估结果表明，PhotoAgent在空间推理方面表现卓越，并能获得更优的最终图像质量。

摘要 (Abstract)

Embodied agents for creative tasks like photography must bridge the semantic gap between high-level language commands and geometric control. We introduce PhotoAgent, an agent that achieves this by integrating Large Multimodal Models (LMMs) reasoning with a novel control paradigm. PhotoAgent first translates subjective aesthetic goals into solvable geometric constraints via LMM-driven, chain-of-thought (CoT) reasoning, allowing an analytical solver to compute a high-quality initial viewpoint. This initial pose is then iteratively refined through visual reflection within a photorealistic internal world model built with 3D Gaussian Splatting (3DGS). This ``mental simulation’’ replaces costly and slow physical trial-and-error, enabling rapid convergence to aesthetically superior results. Evaluations confirm that PhotoAgent excels in spatial reasoning and achieves superior final image quality.

关键词: Embodied Agents, Large Multimodal Models, Chain-of-Thought Reasoning, World Models, 3D Gaussian Splatting, Spatial Understanding, Aesthetic Understanding, Robotic Photography

110. ❌ Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning

作者: Jiacheng Hua, Yishu Yin, Yuhang Wu, Tai Wang, Yifei Huang, Miao Liu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23404v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于多模态大语言模型（MLLMs）的3D空间推理能力，核心创新是提出TRACE提示方法，通过生成基于文本的空间表示作为中间推理轨迹来提升空间问答准确性。与关键词高度相关的是：1）‘Large Language Models’（论文明确研究MLLMs，属于大模型范畴，权重1.0，评分10）；2）‘Chain of Thought’（TRACE方法本质是引导模型生成中间推理步骤，与思维链高度契合，权重1.0，评分10）；3）‘System 2 Thinking’（论文强调结构化空间推理和深度推理过程，与系统2思维相关，权重1.0，评分8）。其他关键词如MoE、量化、RAG等未在论文中涉及，评分为0。加权总分计算为(10×1.0)+(10×1.0)+(8×1.0)=28.0，超过动态及格分26.6。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在3D空间推理上的不足，提出了一种基于文本表示引导的推理方法TRACE，通过生成结构化空间表示作为中间推理步骤，显著提升了模型在视频空间问答任务上的准确性。

摘要翻译

现有的大规模多模态语言模型（MLLMs）在三维空间推理方面存在困难，因为它们难以从视频输入中构建对三维环境的结构化抽象。为弥补这一不足，我们借鉴以自我为中心的空间认知理论，研究如何使MLLMs能够对基于文本的视频空间表征进行建模与推理。具体而言，我们提出“自我中心视频的异中心语境文本化表征”（TRACE）提示方法，该方法引导MLLMs生成三维环境的文本化表征作为中间推理轨迹，从而实现更准确的空间问答。TRACE通过编码元语境、相机轨迹和细粒度物体实体，支持对自我中心视频进行结构化空间推理。在VSI-Bench和OST-Bench上的大量实验表明，TRACE在不同参数规模和训练架构的多种MLLM骨干网络上，均较现有提示策略取得显著且一致的性能提升。我们进一步通过消融实验验证了设计选择的有效性，并深入分析了当前MLLMs在三维空间推理中的瓶颈所在。

摘要 (Abstract)

Existing Multimodal Large Language Models (MLLMs) struggle with 3D spatial reasoning, as they fail to construct structured abstractions of the 3D environment depicted in video inputs. To bridge this gap, drawing inspiration from cognitive theories of allocentric spatial reasoning, we investigate how to enable MLLMs to model and reason over text-based spatial representations of video. Specifically, we introduce Textual Representation of Allocentric Context from Egocentric Video (TRACE), a prompting method that induces MLLMs to generate text-based representations of 3D environments as intermediate reasoning traces for more accurate spatial question answering. TRACE encodes meta-context, camera trajectories, and detailed object entities to support structured spatial reasoning over egocentric videos. Extensive experiments on VSI-Bench and OST-Bench demonstrate that TRACE yields notable and consistent improvements over prior prompting strategies across a diverse range of MLLM backbones, spanning different parameter scales and training schemas. We further present ablation studies to validate our design choices, along with detailed analyses that probe the bottlenecks of 3D spatial reasoning in MLLMs.

关键词: Multimodal Large Language Models, Spatial Reasoning, 3D Environment, Textual Representation, Intermediate Reasoning Traces, Egocentric Video, Prompting Method, TRACE

111. ❌ Off-Policy Value-Based Reinforcement Learning for Large Language Models

作者: Peng-Yuan Wang, Ziniu Li, Tian Xu, Bohan Yang, Tian-Shuo Liu, ChenYang Wang, Xiong-Hui Chen, Yi-Chen Li, Tianyun Yang, Congliang Chen, Yang Yu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23355v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的强化学习训练方法，提出ReVal这一基于值函数的off-policy RL框架，与’Large Language Models’和’RLHF’高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG等均未在摘要中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型强化学习训练中样本效率低的问题，提出了一种基于值函数的off-policy方法ReVal，在数学推理任务上实现了更快的收敛和更好的性能。

摘要翻译

提升数据利用效率对于扩展强化学习（RL）在长周期任务中的应用至关重要，因为此类任务中生成轨迹的成本高昂。然而，当前针对大语言模型（LLMs）的主流RL方法主要是在线策略（on-policy）方法：它们仅对每批数据更新一次，随后将其丢弃并重新采集新样本，导致样本效率低下。本文探索了一种适用于LLMs的基于价值（value-based）的替代性RL框架，该框架天然支持离线策略（off-policy）学习。我们提出了ReVal，这是一种基于贝尔曼更新的方法，它结合了捕捉内部一致性的逐步信号与源自结果验证的轨迹级信号。ReVal天然支持基于回放缓冲区（replay-buffer）的训练，从而能够高效地重用历史轨迹。在标准数学推理基准上的实验表明，ReVal不仅收敛更快，而且在最终性能上超越了GRPO。在DeepSeek-R1-Distill-1.5B模型上，ReVal提升了训练效率，相较于GRPO，在AIME24基准上实现了2.7%的提升，在领域外基准GPQA上实现了4.5%的提升。这些结果表明，基于价值的RL是LLM训练中基于策略（policy-based）方法的一个实用替代方案。

摘要 (Abstract)

Improving data utilization efficiency is critical for scaling reinforcement learning (RL) for long-horizon tasks where generating trajectories is expensive. However, the dominant RL methods for LLMs are largely on-policy: they update each batch of data only once, discard it, and then collect fresh samples, resulting in poor sample efficiency. In this work, we explore an alternative value-based RL framework for LLMs that naturally enables off-policy learning. We propose ReVal, a Bellman-update-based method that combines stepwise signals capturing internal consistency with trajectory-level signals derived from outcome verification. ReVal naturally supports replay-buffer-based training, allowing efficient reuse of past trajectories. Experiments on standard mathematical reasoning benchmarks show that ReVal not only converges faster but also outperforms GRPO in final performance. On DeepSeek-R1-Distill-1.5B, ReVal improves training efficiency and achieves improvement of 2.7% in AIME24 and 4.5% in out-of-domain benchmark GPQA over GRPO. These results suggest that value-based RL is a practical alternative to policy-based methods for LLM training.

关键词: Large Language Models, Reinforcement Learning, Off-policy Learning, Value-based RL, Sample Efficiency, Bellman Update, Mathematical Reasoning, Training Efficiency

112. ❌ Steering LLMs for Culturally Localized Generation

作者: Simran Khanuja, Hongbin Liu, Shujian Zhang, John Lambert, Mingqing Chen, Rajiv Mathews, Lun Wang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23301v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的文化偏见问题，使用稀疏自编码器进行机制可解释性分析，创建文化嵌入（CuE）来诊断和干预文化表示。因此，与’Large Language Models’（核心研究对象）和’Mechanistic Interpretability’（核心方法）高度相关（10分）。与’Instruction Tuning OR Alignment OR Value Alignment’相关（8分），因为研究涉及文化对齐和引导。与’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（5分），因为提到了后训练对齐方法。其他关键词如MoE、SLMs、Scaling Laws、RAG、Agents等，论文未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文研究LLMs中文化偏见的问题，通过机制可解释性方法识别文化特征并创建文化嵌入（CuE），实现了对LLMs文化表示的诊断和可控引导，提高了文化忠实度并激发了长尾文化概念。

摘要翻译

大型语言模型已在全球范围内部署，但其生成的回答往往偏向于训练数据丰富的文化。现有的文化本地化方法（如提示工程或训练后对齐）属于黑箱操作，难以控制，且无法区分模型失败是由于知识缺失还是激发策略不当所致。本文通过机制可解释性方法，揭示并操控大型语言模型中的文化表征，以弥补这些不足。我们利用稀疏自编码器识别出编码文化显著信息的可解释特征，并将其聚合为文化嵌入向量。该向量既可用于分析模糊提示下的隐性文化偏见，也可用于构建白盒导向干预机制。在多个模型上的实验表明，基于文化嵌入向量的导向方法相较于单纯提示，能显著提升文化忠实度，并激发出更罕见的长尾文化概念。值得注意的是，这种白盒导向方法与黑箱本地化技术具有互补性：在提示增强输入的基础上应用文化嵌入向量能获得额外增益。这也表明模型确实能从更好的激发策略中受益，且不一定缺乏长尾知识表征——尽管这一现象在不同文化间存在差异。我们的研究既为大型语言模型中的文化表征提供了诊断性见解，也提供了一种可控的方法来实现面向特定文化的导向。

摘要 (Abstract)

LLMs are deployed globally, yet produce responses biased towards cultures with abundant training data. Existing cultural localization approaches such as prompting or post-training alignment are black-box, hard to control, and do not reveal whether failures reflect missing knowledge or poor elicitation. In this paper, we address these gaps using mechanistic interpretability to uncover and manipulate cultural representations in LLMs. Leveraging sparse autoencoders, we identify interpretable features that encode culturally salient information and aggregate them into Cultural Embeddings (CuE). We use CuE both to analyze implicit cultural biases under underspecified prompts and to construct white-box steering interventions. Across multiple models, we show that CuE-based steering increases cultural faithfulness and elicits significantly rarer, long-tail cultural concepts than prompting alone. Notably, CuE-based steering is complementary to black-box localization methods, offering gains when applied on top of prompt-augmented inputs. This also suggests that models do benefit from better elicitation strategies, and don’t necessarily lack long-tail knowledge representation, though this varies across cultures. Our results provide both diagnostic insight into cultural representations in LLMs and a controllable method to steer towards desired cultures.

关键词: Large Language Models, Cultural Bias, Mechanistic Interpretability, Sparse Autoencoders, Cultural Embeddings, Steering Interventions, Cultural Localization, Interpretable Features

113. ❌ Is AI Catching Up to Human Expression? Exploring Emotion, Personality, Authorship, and Linguistic Style in English and Arabic with Six Large Language Models

作者: Nasser A Alsadhan 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23251v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在情感表达和人格模仿方面的能力，涉及六个大模型（Jais, Mistral, LLaMA, GPT-4o, Gemini, DeepSeek）的评估，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术原理或应用，如MoE、SLMs、训练方法、推理技术、代理系统、压缩技术等，因此这些关键词评分为0分。论文虽涉及AI生成文本检测和情感计算，但未直接针对幻觉缓解、可解释AI等具体技术，也未涉及生物信息学等科学应用，因此相关关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究评估了六种大语言模型在英语情感表达和阿拉伯语人格模仿方面与人类文本的差异，发现AI生成文本可通过分类器区分（F1>0.95），但模型在情感信号编码上与人类存在显著差异，且合成数据能提升阿拉伯语人格分类任务性能。

摘要翻译

大型语言模型流畅度的不断提升，引发了一个重要问题：它们是否能够在多样化的语言和文化语境中，有效模拟包括情感表达与人格特质在内的复杂人类特征。本研究探讨了LLMs能否令人信服地模仿英语中的情感细微差别以及阿拉伯语中的人格标记。阿拉伯语作为一种关键的低资源语言，具有独特的语言和文化特征。我们在六个模型上进行了两项任务：Jais、Mistral、LLaMA、GPT-4o、Gemini和DeepSeek。首先，我们评估了机器分类器能否可靠地区分人类撰写和AI生成的文本。其次，我们评估了LLM生成的文本在多大程度上表现出与人类相似的情感或人格特质。我们的结果表明，AI生成的文本与人类撰写的文本是可区分的（F1>0.95），但在经过改写的样本上分类性能会下降，这表明分类器依赖于表面的风格线索。情感和人格分类实验揭示了显著的泛化差距：在人类数据上训练的分类器在AI生成的文本上表现不佳，反之亦然，这表明LLMs编码情感信号的方式与人类不同。重要的是，在阿拉伯语人格分类任务中，用AI生成的数据增强训练能提升性能，这突显了合成数据在应对低资源语言挑战方面的潜力。针对特定模型的分析显示，GPT-4o和Gemini表现出更优越的情感连贯性。语言学和心理语言学分析揭示了人类文本与AI文本在语气、真实性和文本复杂性方面存在可测量的差异。这些发现对情感计算、作者归属判定和负责任的AI部署具有重要意义，尤其是在生成式AI检测和对齐面临独特挑战的低资源语言环境中。

摘要 (Abstract)

The advancing fluency of LLMs raises important questions about their ability to emulate complex human traits, including emotional expression and personality, across diverse linguistic and cultural contexts. This study investigates whether LLMs can convincingly mimic emotional nuance in English and personality markers in Arabic, a critical under-resourced language with unique linguistic and cultural characteristics. We conduct two tasks across six models:Jais, Mistral, LLaMA, GPT-4o, Gemini, and DeepSeek. First, we evaluate whether machine classifiers can reliably distinguish between human-authored and AI-generated texts. Second, we assess the extent to which LLM-generated texts exhibit emotional or personality traits comparable to those of humans. Our results demonstrate that AI-generated texts are distinguishable from human-authored ones (F1>0.95), though classification performance deteriorates on paraphrased samples, indicating a reliance on superficial stylistic cues. Emotion and personality classification experiments reveal significant generalization gaps: classifiers trained on human data perform poorly on AI-generated texts and vice versa, suggesting LLMs encode affective signals differently from humans. Importantly, augmenting training with AI-generated data enhances performance in the Arabic personality classification task, highlighting the potential of synthetic data to address challenges in under-resourced languages. Model-specific analyses show that GPT-4o and Gemini exhibit superior affective coherence. Linguistic and psycholinguistic analyses reveal measurable divergences in tone, authenticity, and textual complexity between human and AI texts. These findings have implications for affective computing, authorship attribution, and responsible AI deployment, particularly within underresourced language contexts where generative AI detection and alignment pose unique challenges.

关键词: Large Language Models, Emotion Expression, Personality Mimicry, AI-generated Text Detection, Arabic Language Processing, Affective Computing, Linguistic Style Analysis, Under-resourced Languages

114. ❌ I Came, I Saw, I Explained: Benchmarking Multimodal LLMs on Figurative Meaning in Memes

作者: Shijia Zhou, Saif M. Mohammad, Barbara Plank, Diego Frassinelli 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23229v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文评估了8种最先进的多模态大语言模型（MLLMs）在识别和解释网络迷因中比喻意义的能力，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。研究涉及模型解释其预测的能力，与’Mechanistic Interpretability OR Explainable AI’相关（5分）。评估包括模型提供的推理是否支持预测标签，这涉及’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’和’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’（各5分）。研究发现模型存在偏见，即使没有比喻意义也倾向于关联，这涉及事实性和幻觉问题，与’Hallucination Mitigation OR Factuality OR Truthfulness’相关（5分）。其他关键词如MoE、SLMs、Scaling Laws、训练技术、推理优化、代理系统、模型压缩、科学AI等，论文未直接涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该研究评估了多模态大语言模型在识别和解释网络迷因中比喻意义的能力，发现所有模型都存在强烈偏见，倾向于将迷因与比喻意义关联，即使没有这种意义，且正确预测并不总是伴随忠实的解释。

摘要翻译

网络模因作为一种流行的多模态在线传播形式，常通过图文结合的方式运用比喻性元素传递多层次含义。然而，多模态大语言模型如何整合并解读视觉与文本信息以识别模因中的比喻意义，目前仍不甚明确。为填补这一研究空白，我们在三个数据集上评估了八种前沿生成式多模态大语言模型对六类比喻意义的检测与解释能力。此外，我们对这些模型生成的解释进行了人工评估，以判断其推理过程是否支持预测标签，以及是否忠实于原始模因内容。研究结果表明，所有模型均表现出强烈的倾向性——即使模因本身不含比喻意义，模型仍倾向于将其与比喻含义关联。定性分析进一步显示，正确的预测并不总是伴随着忠实的解释。

摘要 (Abstract)

Internet memes represent a popular form of multimodal online communication and often use figurative elements to convey layered meaning through the combination of text and images. However, it remains largely unclear how multimodal large language models (MLLMs) combine and interpret visual and textual information to identify figurative meaning in memes. To address this gap, we evaluate eight state-of-the-art generative MLLMs across three datasets on their ability to detect and explain six types of figurative meaning. In addition, we conduct a human evaluation of the explanations generated by these MLLMs, assessing whether the provided reasoning supports the predicted label and whether it remains faithful to the original meme content. Our findings indicate that all models exhibit a strong bias to associate a meme with figurative meaning, even when no such meaning is present. Qualitative analysis further shows that correct predictions are not always accompanied by faithful explanations.

关键词: multimodal large language models, MLLMs, figurative meaning, memes, explanation faithfulness, model evaluation, bias detection, human evaluation

115. ❌ Decoding AI Authorship: Can LLMs Truly Mimic Human Style Across Literature and Politics?

作者: Nasser A Alsadhan 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23219v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs（GPT-4o、Gemini 1.5 Pro、Claude Sonnet 3.5）在模仿人类作者风格方面的能力，属于LLMs的应用评估研究，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的技术原理、方法或应用，如MoE、SLMs、训练方法、推理优化、代理系统等，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

本研究评估了大型语言模型（LLMs）模仿著名文学和政治人物写作风格的能力，发现AI生成的文本在风格特征上仍与人类写作存在显著差异，其中困惑度是最主要的区分指标。

摘要翻译

随着生成式人工智能模仿特定人类风格的能力日益增强，本研究调查了包括GPT-4o、Gemini 1.5 Pro和Claude Sonnet 3.5在内的前沿大语言模型（LLMs）模拟著名文学与政治人物——沃尔特·惠特曼、威廉·华兹华斯、唐纳德·特朗普和巴拉克·奥巴马——作者风格特征的能力。通过采用严格主题对齐的零样本提示框架，我们生成了合成文本集，并借助结合基于Transformer的分类模型（BERT）与可解释机器学习（XGBoost）的互补评估框架进行分析。我们的方法整合了语言探索与词频统计（LIWC）指标、困惑度及可读性指数，以评估AI生成文本与人类撰写文本之间的差异。结果表明，AI生成的仿写文本仍具有高度可检测性：仅使用八个风格计量特征训练的XGBoost模型，其准确率可与高维神经分类器相媲美。特征重要性分析指出困惑度是首要判别指标，揭示了AI输出在随机规律性方面与人类写作更高变异性之间存在显著差异。尽管大语言模型在低维启发式特征（如句法复杂度和可读性）上表现出与人类作者的分布趋同，但尚未完全复现人类文本中固有的细腻情感密度和风格变异。通过揭示当前生成式模仿存在的具体统计差距，本研究为大语言模型的风格行为提供了全面基准，并为数字人文与社交媒体领域的作者身份归属研究提供了关键洞见。

摘要 (Abstract)

Amidst the rising capabilities of generative AI to mimic specific human styles, this study investigates the ability of state-of-the-art large language models (LLMs), including GPT-4o, Gemini 1.5 Pro, and Claude Sonnet 3.5, to emulate the authorial signatures of prominent literary and political figures: Walt Whitman, William Wordsworth, Donald Trump, and Barack Obama. Utilizing a zero-shot prompting framework with strict thematic alignment, we generated synthetic corpora evaluated through a complementary framework combining transformer-based classification (BERT) and interpretable machine learning (XGBoost). Our methodology integrates Linguistic Inquiry and Word Count (LIWC) markers, perplexity, and readability indices to assess the divergence between AI-generated and human-authored text. Results demonstrate that AI-generated mimicry remains highly detectable, with XGBoost models trained on a restricted set of eight stylometric features achieving accuracy comparable to high-dimensional neural classifiers. Feature importance analyses identify perplexity as the primary discriminative metric, revealing a significant divergence in the stochastic regularity of AI outputs compared to the higher variability of human writing. While LLMs exhibit distributional convergence with human authors on low-dimensional heuristic features, such as syntactic complexity and readability, they do not yet fully replicate the nuanced affective density and stylistic variance inherent in the human-authored corpus. By isolating the specific statistical gaps in current generative mimicry, this study provides a comprehensive benchmark for LLM stylistic behavior and offers critical insights for authorship attribution in the digital humanities and social media.

关键词: Large Language Models, LLMs, stylistic mimicry, authorship attribution, stylometric analysis, AI-generated text, human-authored text, perplexity

116. ❌ Sparser, Faster, Lighter Transformer Language Models

作者: Edoardo Cetin, Stefano Peluchetti, Emilio Castillo, Akira Naruse, Mana Murakami, Llion Jones 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23198v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM中的稀疏模型技术（MoE/Sparse Models关键词得10分），通过L1正则化实现超过99%稀疏度，并开发了专门的CUDA内核来加速稀疏计算（Inference Acceleration关键词得10分）。论文直接针对大型语言模型（LLMs关键词得10分）的效率问题，属于模型压缩技术范畴（Quantization/Model Compression关键词得5分）。其他关键词如小模型、对齐、推理方法、科学应用等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文通过引入L1正则化在LLM的前馈层中实现超过99%的稀疏度，并开发了专门的CUDA内核来加速稀疏计算，从而显著提高了大语言模型的推理效率、能源效率和内存使用效率。

摘要翻译

自回归大语言模型（LLM）的规模化发展推动了前所未有的进步，但也带来了巨大的计算成本。在本研究中，我们通过利用LLM前馈层中的非结构化稀疏性来应对这些成本，这些层占据了模型参数和执行浮点运算（FLOPs）的主要部分。为此，我们引入了一种新的稀疏打包格式和一组专为现代GPU优化执行流程设计的CUDA内核，使其能够在LLM推理和训练期间实现高效的稀疏计算。为验证所获增益，我们对LLM稀疏性进行了定量研究，证明简单的L1正则化可以诱导超过99%的稀疏度，且对下游任务性能的影响可忽略不计。结合我们开发的内核，我们展示了这种稀疏度水平能够转化为显著的吞吐量提升、能效改善和内存使用优化，且这些效益随模型规模扩大而增强。我们将在开源许可下发布所有代码与内核，以促进技术采用，并加速相关研究，从而确立稀疏性作为提升现代基础模型效率与可扩展性的一个实用方向。

摘要 (Abstract)

Scaling autoregressive large language models (LLMs) has driven unprecedented progress but comes with vast computational costs. In this work, we tackle these costs by leveraging unstructured sparsity within an LLM’s feedforward layers, the components accounting for most of the model parameters and execution FLOPs. To achieve this, we introduce a new sparse packing format and a set of CUDA kernels designed to seamlessly integrate with the optimized execution pipelines of modern GPUs, enabling efficient sparse computation during LLM inference and training. To substantiate our gains, we provide a quantitative study of LLM sparsity, demonstrating that simple L1 regularization can induce over 99% sparsity with negligible impact on downstream performance. When paired with our kernels, we show that these sparsity levels translate into substantial throughput, energy efficiency, and memory usage benefits that increase with model scale. We will release all code and kernels under an open-source license to promote adoption and accelerate research toward establishing sparsity as a practical axis for improving the efficiency and scalability of modern foundation models.

关键词: sparse models, large language models, inference acceleration, L1 regularization, CUDA kernels, model efficiency, transformer, computational cost reduction

117. ❌ From Synthetic to Native: Benchmarking Multilingual Intent Classification in Logistics Customer Service

作者: Haoyu He, Jinyu Zhuang, Haoran Chu, Shuhang Yu, J, T AI Group, Hao Wang, Kunpeng Han 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23172v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究物流客服中的多语言意图分类，构建了一个基于真实客服日志的基准数据集，并评估了包括小语言模型在内的多种模型。与关键词的相关性分析：1）‘Small Language Models’得8分，因为论文明确提到对小语言模型进行了基准测试；2）‘Large Language Models’和’Scaling Laws AND Data Quality’各得5分，前者因论文提到使用LLM辅助质量控制，后者因涉及数据质量对模型性能的影响；3）其他关键词得0分，因为论文未涉及这些具体技术或应用领域。

!!! tip deepseek-chat TL;DR

该论文构建了一个基于真实物流客服日志的多语言意图分类基准数据集，发现机器翻译的测试集会高估模型在嘈杂真实查询上的性能，尤其对长尾意图和跨语言迁移任务影响显著。

摘要翻译

多语言意图分类是全球物流平台客服系统的核心任务，其模型需处理跨语言及层级化标签空间中的嘈杂用户查询。然而现有大多数多语言基准依赖机器翻译文本，这类文本通常比真实的客户请求更规范、更标准，可能导致对模型实际鲁棒性的高估。本文提出一个基于真实物流客服日志构建的层级化多语言意图分类公共基准。该数据集通过过滤、大语言模型辅助质量控制及人工验证，从60万条历史记录中筛选出约3万条经去标识化的独立用户查询，并组织为包含13个父类别和17个子类别的两级分类体系。英语、西班牙语和阿拉伯语作为训练可见语言，印尼语、中文及其他仅用于测试的语言支持零样本评估。为直接衡量合成数据与真实评估之间的差距，我们提供了成对的原始查询与机器翻译测试集，并在扁平化与层级化分类协议下对多语言编码器、嵌入模型和小型语言模型进行基准测试。结果表明，翻译测试集会显著高估模型在嘈杂原始查询上的性能，尤其对长尾意图和跨语言迁移任务而言，这凸显了构建更贴近现实的多语言意图基准的必要性。

摘要 (Abstract)

Multilingual intent classification is central to customer-service systems on global logistics platforms, where models must process noisy user queries across languages and hierarchical label spaces. Yet most existing multilingual benchmarks rely on machine-translated text, which is typically cleaner and more standardized than native customer requests and can therefore overestimate real-world robustness. We present a public benchmark for hierarchical multilingual intent classification constructed from real logistics customer-service logs. The dataset contains approximately 30K de-identified, stand-alone user queries curated from 600K historical records through filtering, LLM-assisted quality control, and human verification, and is organized into a two-level taxonomy with 13 parent and 17 leaf intents. English, Spanish, and Arabic are included as seen languages, while Indonesian, Chinese, and additional test-only languages support zero-shot evaluation. To directly measure the gap between synthetic and real evaluation, we provide paired native and machine-translated test sets and benchmark multilingual encoders, embedding models, and small language models under flat and hierarchical protocols. Results show that translated test sets substantially overestimate performance on noisy native queries, especially for long-tail intents and cross-lingual transfer, underscoring the need for more realistic multilingual intent benchmarks.

关键词: multilingual intent classification, logistics customer service, benchmark dataset, small language models, machine-translated text, real-world robustness, zero-shot evaluation, hierarchical taxonomy

118. ❌ UniDial-EvalKit: A Unified Toolkit for Evaluating Multi-Faceted Conversational Abilities

作者: Qi Jia, Haodong Zhao, Dun Pei, Xiujie Song, Shibo Wang, Zijian Chen, Zicheng Zhang, Xiangyang Zhu, Guangtao Zhai 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23160v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文UniDial-EvalKit专注于开发一个统一的评估工具包，用于评估多轮交互式AI系统（如对话系统）。它与关键词的相关性如下：1）与"Large Language Models OR LLMs OR Foundation Models”（5分）相关，因为LLM是交互式AI系统的核心组件，评估工具包适用于评估基于LLM的系统；2）与"LLM Agents OR Autonomous Agents OR Agentic Workflow”（5分）相关，因为工具包旨在评估多轮交互能力，这与智能体工作流和自主代理的评估需求一致；3）其他关键词（0分）与论文内容无关，论文不涉及模型架构（如MoE、SLMs）、训练技术（如预训练、微调、对齐）、推理优化（如注意力机制、解码加速）、特定应用领域（如科学AI）或高级能力（如思维链、世界模型）。

!!! tip deepseek-chat TL;DR

该论文提出了UniDial-EvalKit，一个统一的工具包，用于标准化和高效评估多轮交互式AI系统的能力，解决了现有评估协议异构性高、难以系统比较的问题。

摘要翻译

在多轮交互场景中对人工智能系统进行基准测试，对于理解其在真实应用中的实际能力至关重要。然而，现有的评估协议高度异质化，在数据集格式、模型接口和评估流程上存在显著差异，这严重阻碍了系统性的比较。在本工作中，我们提出了UniDial-EvalKit（UDE），一个用于评估交互式AI系统的统一评估工具包。UDE的核心贡献在于其整体统一性：它将异构数据格式标准化为通用模式，通过模块化架构简化复杂的评估流程，并在一致的评分接口下统一度量计算。它还通过并行生成与评分以及基于检查点的缓存来支持高效的大规模评估，从而消除冗余计算。在多个多轮基准测试上的验证表明，UDE不仅通过标准化工作流程和透明日志记录保证了高可复现性，还显著提升了评估效率和可扩展性。我们公开了完整的工具包和评估脚本，以促进标准化的基准测试生态，并加速交互式AI领域的未来突破。

摘要 (Abstract)

Benchmarking AI systems in multi-turn interactive scenarios is essential for understanding their practical capabilities in real-world applications. However, existing evaluation protocols are highly heterogeneous, differing significantly in dataset formats, model interfaces, and evaluation pipelines, which severely impedes systematic comparison. In this work, we present UniDial-EvalKit (UDE), a unified evaluation toolkit for assessing interactive AI systems. The core contribution of UDE lies in its holistic unification: it standardizes heterogeneous data formats into a universal schema, streamlines complex evaluation pipelines through a modular architecture, and aligns metric calculations under a consistent scoring interface. It also supports efficient large-scale evaluation through parallel generation and scoring, as well as checkpoint-based caching to eliminate redundant computation. Validated across diverse multi-turn benchmarks, UDE not only guarantees high reproducibility through standardized workflows and transparent logging, but also significantly improves evaluation efficiency and extensibility. We make the complete toolkit and evaluation scripts publicly available to foster a standardized benchmarking ecosystem and accelerate future breakthroughs in interactive AI.

关键词: evaluation toolkit, multi-turn interactive scenarios, benchmarking AI systems, unified evaluation, standardized workflows, reproducibility, efficiency, interactive AI

119. ❌ When Language Models Lose Their Mind: The Consequences of Brain Misalignment

作者: Gabriele Merlin, Mariya Toneva 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23091v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的脑对齐（brain alignment）及其对语言能力的影响，与关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文明确研究LLMs。同时，论文探讨脑对齐作为模型对齐（alignment）的一种形式，与关键词’Instruction Tuning OR Alignment OR Value Alignment’高度相关（10分），因为脑对齐是模型对齐的一个具体方面。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、Reasoning、Agents、Quantization、AI for Science等均未在摘要中提及或与论文主题无关，故给0分。

!!! tip deepseek-chat TL;DR

该论文研究了脑对齐对大语言模型语言能力的影响，通过创建脑未对齐模型并评估其在200多个下游任务上的表现，发现脑对齐对实现稳健的语言理解能力至关重要。

摘要翻译

尽管脑对齐大语言模型因其作为认知模型的潜力以及在提升人工智能安全性与可信度方面的前景而备受关注，但此类脑对齐对语言能力的具体作用仍不明确。本研究通过引入脑失准模型——即那些在保持高水平语言建模性能的同时，被刻意训练为对脑活动预测能力较差的大语言模型——来探究脑对齐的功能性影响。我们在涵盖语义、句法、语篇、推理和形态学等多元语言领域的200余项下游任务上评估了这些模型。通过将脑失准模型与严格匹配的脑对齐模型进行对比，我们分离出脑对齐对语言理解的具体影响。实验结果表明，脑失准会显著损害下游任务表现，这凸显了脑对齐在实现稳健语言能力中的关键作用。这些发现强调了大语言模型中脑对齐的重要性，并为神经表征与语言处理之间的关系提供了新的见解。

摘要 (Abstract)

While brain-aligned large language models (LLMs) have garnered attention for their potential as cognitive models and for potential for enhanced safety and trustworthiness in AI, the role of this brain alignment for linguistic competence remains uncertain. In this work, we investigate the functional implications of brain alignment by introducing brain-misaligned models–LLMs intentionally trained to predict brain activity poorly while maintaining high language modeling performance. We evaluate these models on over 200 downstream tasks encompassing diverse linguistic domains, including semantics, syntax, discourse, reasoning, and morphology. By comparing brain-misaligned models with well-matched brain-aligned counterparts, we isolate the specific impact of brain alignment on language understanding. Our experiments reveal that brain misalignment substantially impairs downstream performance, highlighting the critical role of brain alignment in achieving robust linguistic competence. These findings underscore the importance of brain alignment in LLMs and offer novel insights into the relationship between neural representations and linguistic processing.

关键词: brain-aligned large language models, brain misalignment, linguistic competence, downstream tasks, neural representations, language understanding, cognitive models

120. ❌ HGNet: Scalable Foundation Model for Automated Knowledge Graph Generation from Scientific Literature

作者: Devvrat Joshi, Islem Rekik 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23136v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究科学文献的自动化知识图谱生成，属于AI for Science领域，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文提到使用大语言模型（LLMs）作为背景和对比，但主要贡献是提出新的框架（HGNet和Z-NERD），而非LLM技术本身，因此与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分）。其他关键词如MoE、SLMs、Scaling Laws、各种训练方法（预训练、微调、对齐等）、推理优化、代理系统、模型压缩等均未在论文中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为HGNet的两阶段可扩展框架，用于从科学文献中自动构建层次化知识图谱，解决了现有方法在识别多词实体、跨领域泛化和建模科学知识层次结构方面的不足，并在多个基准测试中实现了最先进的性能。

摘要翻译

自动化知识图谱构建对于驾驭快速增长的科学文献至关重要。然而，现有方法难以识别长的多词实体，通常无法跨领域泛化，且普遍忽视了科学知识的层次性。尽管通用大语言模型具有适应性，但其计算成本高昂，且在专业任务上的准确性不稳定。因此，当前的知识图谱往往浅层且不一致，限制了其在探索与综合中的应用价值。我们提出了一个可扩展的、零样本科学知识图谱构建的两阶段框架。第一阶段Z-NERD引入了：（i）正交语义分解，通过分离文本中的语义“转向”来促进领域无关的实体识别；（ii）一种多尺度TCQK注意力机制，通过具备n-元感知的注意力头来捕捉连贯的多词实体。第二阶段HGNet通过层次感知的消息传递进行关系抽取，显式建模父类、子类及同级关系。为确保全局一致性，我们引入了两个互补的目标函数：可微层次损失，用于抑制循环和捷径边；以及连续抽象场损失，它将抽象层次沿欧几里得空间中的可学习轴进行嵌入。这是首个将层次抽象形式化为标准欧几里得嵌入中连续属性的方法，为双曲几何方法提供了一种更简单的替代方案。我们发布了SPHERE（https://github.com/basiralab/SPHERE），一个用于层次关系抽取的多领域基准测试集。我们的框架在SciERC、SciER和SPHERE上取得了最新的最优性能，在分布外测试中，命名实体识别性能提升了8.08%，关系抽取性能提升了5.99%。在零样本设置下，命名实体识别的增益达到10.76%，关系抽取的增益达到26.2%。

摘要 (Abstract)

Automated knowledge graph (KG) construction is essential for navigating the rapidly expanding body of scientific literature. However, existing approaches struggle to recognize long multi-word entities, often fail to generalize across domains, and typically overlook the hierarchical nature of scientific knowledge. While general-purpose large language models (LLMs) offer adaptability, they are computationally expensive and yield inconsistent accuracy on specialized tasks. As a result, current KGs are shallow and inconsistent, limiting their utility for exploration and synthesis. We propose a two-stage framework for scalable, zero-shot scientific KG construction. The first stage, Z-NERD, introduces (i) Orthogonal Semantic Decomposition (OSD), which promotes domain-agnostic entity recognition by isolating semantic “turns” in text, and (ii) a Multi-Scale TCQK attention mechanism that captures coherent multi-word entities through n-gram-aware attention heads. The second stage, HGNet, performs relation extraction with hierarchy-aware message passing, explicitly modeling parent, child, and peer relations. To enforce global consistency, we introduce two complementary objectives: a Differentiable Hierarchy Loss to discourage cycles and shortcut edges, and a Continuum Abstraction Field (CAF) Loss that embeds abstraction levels along a learnable axis in Euclidean space. This is the first approach to formalize hierarchical abstraction as a continuous property within standard Euclidean embeddings, offering a simpler alternative to hyperbolic methods. We release SPHERE (https://github.com/basiralab/SPHERE), a multi-domain benchmark for hierarchical relation extraction. Our framework establishes a new state of the art on SciERC, SciER, and SPHERE, improving NER by 8.08% and RE by 5.99% on out-of-distribution tests. In zero-shot settings, gains reach 10.76% for NER and 26.2% for RE.

关键词: knowledge graph generation, scientific literature, hierarchical relation extraction, zero-shot learning, entity recognition, foundation model, multi-scale attention, domain adaptation

121. ❌ PaperVoyager : Building Interactive Web with Visual Language Models

作者: Dasen Dai, Biao Wu, Meng Fang, Wenhao Wang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22999v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出了一种将研究论文转换为可执行交互式网页系统的智能体，核心涉及视觉语言模型驱动的自主代理、工具使用和复杂推理。与以下关键词高度相关：‘LLM Agents/Autonomous Agents/Agentic Workflow’（10分，论文核心研究自主代理）、‘Tool Use/Function Calling/API Tool Use’（10分，代理执行工具使用）、‘AI for Science/Bioinformatics/Cheminformatics’（10分，应用于科学论文理解）。与’Large Language Models/LLMs/Foundation Models’（8分，基于视觉语言模型）、‘Chain of Thought/CoT Reasoning/Multi-step Reasoning’（8分，涉及复杂推理）、‘System 2 Thinking/Slow Thinking/In-depth Reasoning’（8分，需要深入推理）有一定关联。其他关键词如MoE、量化、对齐等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了现有文档代理只能将论文转换为静态内容的问题，提出了一种基于视觉语言模型的自主代理PaperVoyager，能够将研究论文自动转换为可执行的交互式网页系统，并通过实验验证了其有效性。

摘要翻译

视觉语言模型的最新进展使得自主代理能够进行复杂推理、工具使用和文档理解。然而，现有的文档代理主要将论文转化为静态成果，如摘要、网页或幻灯片，这对于涉及动态机制与状态转换的技术论文而言是不够的。在本研究中，我们提出了一种“论文到交互系统代理”，能够将研究论文转化为可执行的交互式网页系统。给定一篇PDF论文，该代理无需人工干预即可完成端到端处理，包括论文理解、系统建模和交互式网页合成，使用户能够操作输入并观察动态行为。为评估此任务，我们引入了一个包含19篇研究论文的基准数据集，每篇均配有专家构建的交互式系统作为真实参照。我们进一步提出了PaperVoyager——一个结构化生成框架，在合成过程中显式建模机制与交互逻辑。实验表明，PaperVoyager显著提升了生成交互系统的质量，为交互式科学论文理解提供了新范式。

摘要 (Abstract)

Recent advances in visual language models have enabled autonomous agents for complex reasoning, tool use, and document understanding. However, existing document agents mainly transform papers into static artifacts such as summaries, webpages, or slides, which are insufficient for technical papers involving dynamic mechanisms and state transitions. In this work, we propose a Paper-to-Interactive-System Agent that converts research papers into executable interactive web systems. Given a PDF paper, the agent performs end-to-end processing without human intervention, including paper understanding, system modeling, and interactive webpage synthesis, enabling users to manipulate inputs and observe dynamic behaviors. To evaluate this task, we introduce a benchmark of 19 research papers paired with expert-built interactive systems as ground truth. We further propose PaperVoyager, a structured generation framework that explicitly models mechanisms and interaction logic during synthesis. Experiments show that PaperVoyager significantly improves the quality of generated interactive systems, offering a new paradigm for interactive scientific paper understanding.

关键词: visual language models, autonomous agents, interactive web systems, paper understanding, tool use, complex reasoning, scientific papers, PaperVoyager

122. ❌ Beyond Hate: Differentiating Uncivil and Intolerant Speech in Multimodal Content Moderation

作者: Nils A. Herrmann, Tobias Eder, Jingyi He, Georg Groh 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22985v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多模态内容审核中的细粒度标注方案，区分不文明言论和不容忍言论，并评估视觉语言模型在不同标注方案下的性能。论文内容主要涉及多模态机器学习、内容审核、数据标注质量改进，但完全不涉及大语言模型（LLM）技术原理、训练方法、推理优化、对齐技术、代理系统、模型压缩等关键词领域。论文中提到的模型（LLaVA-1.6-Mistral-7B, Qwen2.5-VL-7B）是多模态视觉语言模型，而非纯文本大语言模型，且论文未讨论这些模型的技术创新或原理。论文强调数据质量改进，但未涉及Scaling Laws中的Data Quality概念。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态内容审核中粗粒度仇恨言论标注的不足，提出了区分不文明言论和不容忍言论的细粒度标注方案，并通过实验证明结合粗粒度和细粒度标注可以提升视觉语言模型的审核性能、降低有害内容漏检率。

摘要翻译

当前的多模态毒性基准通常使用单一的二元仇恨标签。这种粗粒度方法混淆了表达的两种根本不同特征：语气与内容。借鉴传播学理论，我们引入了一种细粒度标注方案，区分两个可分离的维度：不文明性（粗鲁或轻蔑的语气）与不宽容性（攻击多元主义并针对群体或身份的内容），并将其应用于来自Hateful Memes数据集的2,030个模因。我们评估了不同视觉-语言模型在粗标签训练、跨标签方案的迁移学习以及结合粗粒度仇恨标签与我们细粒度标注的联合学习方法下的表现。结果表明，细粒度标注能够补充现有的粗粒度标签，当联合使用时能提升模型的整体性能。此外，采用细粒度方案训练的模型展现出更平衡的内容审核相关错误分布，并且相较于仅使用仇恨标签训练的模型，更不易漏检有害内容（漏报率与误报率差值FNR-FPR：LLaVA-1.6-Mistral-7B从0.74降至0.42；Qwen2.5-VL-7B从0.54降至0.28）。本研究通过提升数据质量来增强审核系统的可靠性与准确性，为内容审核领域的数据中心化方法提供了贡献。总体而言，结合粗粒度与细粒度标签为构建更可靠的多模态内容审核系统提供了一条实用路径。

摘要 (Abstract)

Current multimodal toxicity benchmarks typically use a single binary hatefulness label. This coarse approach conflates two fundamentally different characteristics of expression: tone and content. Drawing on communication science theory, we introduce a fine-grained annotation scheme that distinguishes two separable dimensions: incivility (rude or dismissive tone) and intolerance (content that attacks pluralism and targets groups or identities) and apply it to 2,030 memes from the Hateful Memes dataset. We evaluate different vision-language models under coarse-label training, transfer learning across label schemes and a joint learning approach that combines the coarse hatefulness label with our fine-grained annotations. Our results show that fine-grained annotations complement existing coarse labels and, when used jointly, improve overall model performance. Moreover, models trained with the fine-grained scheme exhibit more balanced moderation-relevant error profiles and are less prone to under-detection of harmful content than models trained on hatefulness labels alone (FNR-FPR, the difference between false negative and false positive rates: 0.74 to 0.42 for LLaVA-1.6-Mistral-7B; 0.54 to 0.28 for Qwen2.5-VL-7B). This work contributes to data-centric approaches in content moderation by improving the reliability and accuracy of moderation systems through enhanced data quality. Overall, combining both coarse and fine-grained labels provides a practical route to more reliable multimodal moderation.

关键词: multimodal content moderation, fine-grained annotation, incivility, intolerance, vision-language models, hateful memes, data quality, false negative rate

123. ❌ Beyond Theoretical Bounds: Empirical Privacy Loss Calibration for Text Rewriting Under Local Differential Privacy

作者: Weijun Li, Arnaud Grivet Sébert, Qiongkai Xu, Annabelle McIver, Mark Dras 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22968v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究大语言模型（LLMs）在隐私保护文本重写中的应用，特别是本地差分隐私（LDP）下的文本混淆机制。论文的核心贡献是提出TeDA框架，用于实证校准不同文本重写机制的隐私损失，以评估隐私-效用权衡。因此，仅与关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（8分），因为论文明确关注LLMs在隐私保护数据共享中的应用。其他关键词涉及大模型技术原理（如MoE、量化、推理加速等）或特定应用领域（如科学AI），均未在论文中提及或讨论，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型应用中文本数据共享的隐私保护问题，提出了一种基于假设检验的TeDA框架，用于实证校准本地差分隐私下文本重写机制的隐私损失，从而更可比地评估隐私-效用权衡。

摘要翻译

大型语言模型的日益广泛应用提升了以隐私保护方式共享文本数据的关注度。一项重要的研究方向通过本地差分隐私（Local Differential Privacy，LDP）下的文本重写来应对这一挑战，即在文本发布前对其进行本地混淆处理，并提供形式化的隐私保证。这些保证通常通过参数$\varepsilon$来表达，该参数限定了最坏情况下的隐私损失上界。然而，名义上的$\varepsilon$值往往难以解释，且在不同机制间难以直接比较。本研究探讨如何对LDP下的文本重写机制进行实证校准。我们提出了TeDA框架，该框架通过假设检验的形式实现校准，在表层空间和嵌入空间中实例化文本可区分性审计，从而能够从经验层面评估经隐私化处理的文本的不可区分性。将这一校准方法应用于多种代表性机制后，我们发现相似的名义$\varepsilon$界限可能对应着截然不同的可区分性水平。因此，实证校准为评估隐私与效用的权衡提供了更具可比性的基础，同时也为实际LDP文本重写部署中的机制比较与分析提供了实用工具。

摘要 (Abstract)

The growing use of large language models has increased interest in sharing textual data in a privacy-preserving manner. One prominent line of work addresses this challenge through text rewriting under Local Differential Privacy (LDP), where input texts are locally obfuscated before release with formal privacy guarantees. These guarantees are typically expressed by a parameter $\varepsilon$ that upper bounds the worst-case privacy loss. However, nominal $\varepsilon$ values are often difficult to interpret and compare across mechanisms. In this work, we investigate how to empirically calibrate across text rewriting mechanisms under LDP. We propose TeDA, which formulates calibration via a hypothesis-testing framework that instantiates text distinguishability audits in both surface and embedding spaces, enabling empirical assessment of indistinguishability from privatized texts. Applying this calibration to several representative mechanisms, we demonstrate that similar nominal $\varepsilon$ bounds can imply very different levels of distinguishability. Empirical calibration thus provides a more comparable footing for evaluating privacy-utility trade-offs, as well as a practical tool for mechanism comparison and analysis in real-world LDP text rewriting deployments.

关键词: Large Language Models, Privacy-preserving Text Rewriting, Local Differential Privacy, Empirical Privacy Calibration, TeDA Framework, Privacy-Utility Trade-off, Text Distinguishability Audits

124. ❌ Quality Over Clicks: Intrinsic Quality-Driven Iterative Reinforcement Learning for Cold-Start E-Commerce Query Suggestion

作者: Qi Sun, Kejun Xiao, Huaipeng Zhao, Tao Luo, Xiaoyi Zeng 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22922v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Cold-EQS框架，利用大语言模型（LLMs）进行电子商务查询建议，属于大模型在特定领域的应用研究。论文明确提到使用LLMs，因此与’Large Language Models’关键词高度相关（8分）。论文使用事实性（factuality）作为奖励信号之一，与’Hallucination Mitigation’有一定关联（5分）。其他关键词如MoE、SFT、RAG、量化等均未在论文中涉及，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文针对冷启动电子商务查询建议问题，提出了一个基于迭代强化学习的框架Cold-EQS，利用内在质量奖励优化查询建议，在离线评估和在线实验中实现了显著的用户参与度提升。

摘要翻译

现有对话系统依赖查询建议（Query Suggestion，QS）以提升用户参与度。近期研究通常采用结合点击率（Click-Through Rate，CTR）模型的大语言模型，但由于其高度依赖充足的在线点击数据以进行有效的CTR模型训练，在冷启动场景中往往失效。为弥补这一差距，我们提出Cold-EQS——一种面向冷启动电商查询建议（E-commerce Query Suggestion，EQS）的迭代强化学习框架。具体而言，我们以可回答性、事实性与信息增益作为奖励，持续优化建议查询的质量。为持续优化查询建议模型，我们对分组的候选建议查询进行不确定性估计，从而从缺乏点击信号的在线用户查询中筛选困难与模糊样本。此外，我们构建了一个包含16,949条在线用户查询的EQS基准数据集，用于离线训练与评估。大量离线与在线实验一致表明，在线与离线效果之间存在强正相关性。离线与在线实验结果均证明了Cold-EQS的优越性，其在在线聊天UV指标上实现了显著的+6.81%提升。

摘要 (Abstract)

Existing dialogue systems rely on Query Suggestion (QS) to enhance user engagement. Recent efforts typically employ large language models with Click-Through Rate (CTR) model, yet fail in cold-start scenarios due to their heavy reliance on abundant online click data for effective CTR model training. To bridge this gap, we propose Cold-EQS, an iterative reinforcement learning framework for Cold-Start E-commerce Query Suggestion (EQS). Specifically, we leverage answerability, factuality, and information gain as reward to continuously optimize the quality of suggested queries. To continuously optimize our QS model, we estimate uncertainty for grouped candidate suggested queries to select hard and ambiguous samples from online user queries lacking click signals. In addition, we provide an EQS-Benchmark comprising 16,949 online user queries for offline training and evaluation. Extensive offline and online experiments consistently demonstrate a strong positive correlation between online and offline effectiveness. Both offline and online experimental results demonstrate the superiority of our Cold-EQS, achieving a significant +6.81% improvement in online chatUV.

关键词: Cold-Start, E-commerce Query Suggestion, Iterative Reinforcement Learning, Large Language Models, Intrinsic Quality, Factuality, Answerability, Information Gain

125. ❌ Multilingual KokoroChat: A Multi-LLM Ensemble Translation Method for Creating a Multilingual Counseling Dialogue Dataset

作者: Ryoma Suzuki, Zhiyang Qi, Michimasa Inaba 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22913v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是开发一种多LLM集成翻译方法，用于创建多语言心理咨询对话数据集。论文高度相关于’Large Language Models OR LLMs OR Foundation Models’（评分10），因为该方法完全基于多个LLM的集成使用。其他关键词如MoE、SLMs、Scaling Laws、各种训练技术、推理方法、压缩技术、AI for Science等均未在论文中涉及或提及，因此评分为0。

!!! tip deepseek-chat TL;DR

为解决高质量心理咨询对话数据集稀缺的问题，本研究开发了一种多LLM集成翻译方法，成功将日语心理咨询语料库翻译成英语和中文，并通过人工评估验证了该方法优于单个最先进LLM的翻译质量。

摘要翻译

为解决高质量、公开可用的心理咨询对话数据集严重匮乏的问题，我们通过将大规模人工撰写的日语心理咨询语料库KokoroChat翻译为英语和中文，创建了“多语言KokoroChat”。此过程中的一个关键挑战在于，最优的翻译模型会因输入内容而异，这使得任何单一模型都无法始终保证最高质量。在心理咨询这类敏感领域，尽可能高的翻译保真度至关重要，因此仅依赖单一大型语言模型是不够的。为克服这一挑战，我们开发并采用了一种新颖的多LLM集成方法。我们的方法首先从多个不同的大型语言模型生成多样化的翻译假设，随后由一个大型语言模型基于对所有呈现假设各自优缺点的分析，生成高质量的最终译文。我们通过人工偏好研究对“多语言KokoroChat”的质量进行了严格验证。这些评估证实，与我们方法集成的任何单一前沿大型语言模型相比，由集成方法产生的译文更受青睐。这一强烈偏好证实了我们方法输出结果的卓越质量。多语言KokoroChat可在 https://github.com/UEC-InabaLab/MultilingualKokoroChat 获取。

摘要 (Abstract)

To address the critical scarcity of high-quality, publicly available counseling dialogue datasets, we created Multilingual KokoroChat by translating KokoroChat, a large-scale manually authored Japanese counseling corpus, into both English and Chinese. A key challenge in this process is that the optimal model for translation varies by input, making it impossible for any single model to consistently guarantee the highest quality. In a sensitive domain like counseling, where the highest possible translation fidelity is essential, relying on a single LLM is therefore insufficient. To overcome this challenge, we developed and employed a novel multi-LLM ensemble method. Our approach first generates diverse hypotheses from multiple distinct LLMs. A single LLM then produces a high-quality translation based on an analysis of the respective strengths and weaknesses of all presented hypotheses. The quality of ``Multilingual KokoroChat’’ was rigorously validated through human preference studies. These evaluations confirmed that the translations produced by our ensemble method were preferred from any individual state-of-the-art LLM. This strong preference confirms the superior quality of our method’s outputs. The Multilingual KokoroChat is available at https://github.com/UEC-InabaLab/MultilingualKokoroChat.

关键词: Multilingual Counseling Dataset, LLM Ensemble Translation, Multi-LLM Method, Translation Fidelity, Human Preference Evaluation, Dialogue Dataset Creation, Counseling Domain, Machine Translation

126. ❌ EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction

作者: Yixuan Wang, Shiyu Ji, Yijun Liu, Qingfu Zhu, Wanxiang Che 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22910v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	15.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文EchoKV专注于解决大语言模型在长上下文应用中的KV缓存内存瓶颈问题，提出了一种基于相似性重建的KV缓存压缩方案。核心相关关键词包括：‘KV Cache Compression OR Linear Attention OR FlashAttention’（15分，论文直接研究KV缓存压缩技术）、‘Large Language Models OR LLMs OR Foundation Models’（10分，论文明确针对LLMs）、‘Context Window Extension OR Long Context LLMs’（10分，论文解决长上下文应用的内存问题）。‘Quantization OR Model Compression OR Low-bit Weights’和’Speculative Decoding OR Inference Acceleration’各得5分，因为KV缓存压缩属于模型压缩和推理加速的范畴。其他关键词与论文内容无直接关联，得0分。

!!! tip deepseek-chat TL;DR

论文提出EchoKV，一种基于相似性重建的灵活KV缓存压缩方案，解决了大语言模型在长上下文应用中的内存瓶颈问题，实验表明其在多种压缩比下优于现有方法并保持高吞吐量。

摘要翻译

键值（KV）缓存的日益增长的内存需求对大型语言模型（LLMs）在长上下文应用中构成了显著瓶颈。现有的低秩压缩方法通常依赖于不可逆的参数变换，牺牲了在内存充足时切换回全精度推理的灵活性。本文提出EchoKV，一种灵活的KV缓存压缩方案，能够在标准推理与压缩推理之间按需切换。不同于传统的压缩-解压缩范式，EchoKV利用一个轻量级网络，从部分子集重建残差KV分量，该方法利用了注意力头之间固有的层内与层间相似性。我们进一步引入一种两阶段微调策略，支持快速、低成本的训练（例如，对于7B模型仅需约1个A100 GPU小时）。在LongBench和RULER上的实验结果表明，EchoKV在不同压缩率下均持续优于现有方法，同时在短上下文场景中保持高吞吐量。

摘要 (Abstract)

The increasing memory demand of the Key-Value (KV) cache poses a significant bottleneck for Large Language Models (LLMs) in long-context applications. Existing low-rank compression methods often rely on irreversible parameter transformations, sacrificing the flexibility to switch back to full-precision inference when memory is abundant. In this paper, we propose EchoKV, a flexible KV cache compression scheme that enables on-demand transitions between standard and compressed inference. Unlike traditional compression-decompression paradigms, EchoKV utilizes a lightweight network to reconstruct the residual KV components from a partial subset, leveraging intrinsic inter-layer and intra-layer similarities among attention heads. We further introduce a two-stage fine-tuning strategy that allows for rapid, low-cost training (e.g., ~1 A100 GPU-hour for a 7B model). Experimental results on LongBench and RULER demonstrate that EchoKV consistently outperforms existing methods across various compression ratios while maintaining high throughput for short-context scenarios.

关键词: KV cache compression, Large Language Models, long-context applications, memory bottleneck, similarity-based reconstruction, attention heads, two-stage fine-tuning, throughput

127. ❌ The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration

作者: Haoyuan Xu, Chang Li, Xinyan Ma, Xianhao Ou, Zihan Zhang, Tao He, Xiangyu Liu, Zixiang Wang, Jiafeng Liang, Zheng Chu, Runxuan Liu, Rongchuan Mu, Ming Liu, Bing Qin 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22862v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM Agents中的工具使用，从单工具调用演进到多工具编排，高度相关关键词包括’LLM Agents’、‘Tool Use’和’Large Language Models’。‘Multi-agent Systems’有一定关联，因为多工具编排涉及协调机制。其他关键词如MoE、SLMs、训练方法、推理优化、科学AI应用等均未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文系统综述了LLM Agents中工具使用的演进，从单工具调用转向多工具编排，分析了规划执行、训练构建、安全控制等六个核心维度的研究进展，并总结了在软件工程、企业工作流等领域的应用。

摘要翻译

工具使用使大型语言模型（LLM）能够访问外部信息、调用软件系统，并在数字环境中执行操作，从而超越仅凭模型参数所能解决的问题范围。早期研究主要关注模型能否选择并执行正确的单一工具调用。然而，随着智能体系统的发展，核心问题已从孤立的工具调用转向在长轨迹中进行多工具协同编排，这一过程涉及中间状态、执行反馈、动态变化的环境以及安全性、成本和可验证性等实际约束。本文全面回顾了多工具LLM智能体的最新进展，并对这一快速发展领域的现状进行了分析。首先，我们统一了任务描述框架，区分了单次调用工具使用与长视野协同编排。随后，我们围绕六个核心维度对现有文献进行了梳理：推理时规划与执行、训练与轨迹构建、安全与控制、资源约束下的效率、开放环境中的能力完备性，以及基准设计与评估。我们进一步总结了在软件工程、企业工作流、图形用户界面和移动系统等领域的代表性应用。最后，我们讨论了构建可靠、可扩展且可验证的多工具智能体所面临的主要挑战，并展望了未来的研究方向。

摘要 (Abstract)

Tool use enables large language models (LLMs) to access external information, invoke software systems, and act in digital environments beyond what can be solved from model parameters alone. Early research mainly studied whether a model could select and execute a correct single tool call. As agent systems evolve, however, the central problem has shifted from isolated invocation to multi-tool orchestration over long trajectories with intermediate state, execution feedback, changing environments, and practical constraints such as safety, cost, and verifiability. We comprehensively review recent progress in multi-tool LLM agents and analyzes the state of the art in this rapidly developing area. First, we unify task formulations and distinguish single-call tool use from long-horizon orchestration. Then, we organize the literature around six core dimensions: inference-time planning and execution, training and trajectory construction, safety and control, efficiency under resource constraints, capability completeness in open environments, and benchmark design and evaluation. We further summarize representative applications in software engineering, enterprise workflows, graphical user interfaces, and mobile systems. Finally, we discuss major challenges and outline future directions for building reliable, scalable, and verifiable multi-tool agents.

关键词: LLM Agents, Tool Use, Multi-tool Orchestration, Agent Systems, Long-horizon Tasks, Software Engineering, Enterprise Workflows, Benchmark Evaluation

128. ❌ RadTimeline: Timeline Summarization for Longitudinal Radiological Lung Findings

作者: Sitong Zhou, Meliha Yetisgen, Mari Ostendorf 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22820v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是使用LLMs进行放射学报告的时间线总结，属于大模型在生物医学领域的应用创新。与’Large Language Models’高度相关（10分），因为论文明确使用LLMs进行三步处理流程；与’AI for Science’高度相关（10分），因为这是大模型在放射学（生物信息学相关）领域的应用。其他关键词如MoE、SFT、RAG等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了使用大语言模型（LLMs）自动生成放射学报告中肺部发现的时间线总结方法，通过三步LLM流程（提取发现、生成组名、分组）在RadTimeline数据集上实现了与人工标注相当的性能。

摘要翻译

在纵向放射学报告中追踪影像发现对于准确识别疾病进展至关重要，而自动摘要技术可显著优化这一耗时过程。本研究提出了一种结构化摘要任务，将纵向报告摘要构建为时间线生成任务：按时间排列的发现结果以纵向列呈现，而具有时间关联性的发现则被分组为横向行。这种结构化摘要格式支持跨时间点的发现结果直观对比，并便于依据原始报告进行事实核查。时间线通过三步大语言模型流程生成：提取发现结果、生成分组名称、并利用这些名称对发现进行归类。为评估此类系统，我们构建了RadTimeline数据集，该数据集专注于追踪胸部影像报告中肺部相关的放射学发现。在RadTimeline上的实验揭示了不同规模大语言模型及提示策略的效能权衡。结果表明，生成分组名称作为中间步骤对实现有效的发现归类至关重要。最优配置方案虽存在少量无关发现，但召回率表现优异，其分组性能可与人工标注者相媲美。

摘要 (Abstract)

Tracking findings in longitudinal radiology reports is crucial for accurately identifying disease progression, and the time-consuming process would benefit from automatic summarization. This work introduces a structured summarization task, where we frame longitudinal report summarization as a timeline generation task, with dated findings organized in columns and temporally related findings grouped in rows. This structured summarization format enables straightforward comparison of findings across time and facilitates fact-checking against the associated reports. The timeline is generated using a 3-step LLM process of extracting findings, generating group names, and using the names to group the findings. To evaluate such systems, we create RadTimeline, a timeline dataset focused on tracking lung-related radiologic findings in chest-related imaging reports. Experiments on RadTimeline show tradeoffs of different-sized LLMs and prompting strategies. Our results highlight that group name generation as an intermediate step is critical for effective finding grouping. The best configuration has some irrelevant findings but very good recall, and grouping performance is comparable to human annotators.

关键词: timeline summarization, longitudinal radiology reports, lung findings, LLM process, RadTimeline dataset, structured summarization, chest imaging, finding grouping

129. ❌ Efficient Hallucination Detection: Adaptive Bayesian Estimation of Semantic Entropy with Guided Semantic Exploration

作者: Qiyao Sun, Xingming Li, Xixiang He, Ao Cheng, Xuanyu Ji, Hailun Lu, Runke Huang, Qingyong Hu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22812v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的幻觉检测问题，直接涉及’Large Language Models’和’Hallucination Mitigation’两个关键词，分别给予10分。其他关键词如MoE、SLMs、训练方法、推理加速、AI for Science等均未在摘要中提及或相关，给予0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型生成事实错误内容（幻觉）的问题，提出了一种自适应贝叶斯语义熵估计框架，通过动态调整采样和引导语义探索，在保持检测性能的同时显著提高了计算效率。

摘要翻译

大型语言模型（LLM）在各种自然语言处理任务中取得了显著成功，但仍易于生成事实错误的输出，即幻觉现象。尽管近期研究通过从LLM中重复采样并量化生成响应间的语义不一致性，在幻觉检测方面展现出潜力，但这些方法依赖于固定的采样预算，无法适应查询的复杂性，导致计算效率低下。我们提出一种基于引导语义探索的自适应贝叶斯语义熵估计框架，能够根据观测到的不确定性动态调整采样需求。该方法采用分层贝叶斯框架对语义分布进行建模，通过基于方差的阈值实现采样迭代的动态控制——一旦达到足够的确定性即终止生成过程。我们还开发了一种基于扰动的重要性采样策略，以系统性地探索语义空间。在四个问答数据集上的大量实验表明，我们的方法以显著提升的效率实现了更优的幻觉检测性能。在低预算场景下，本方法仅需约50%的样本量即可达到与现有方法相当的检测性能，而在相同采样预算下平均AUROC指标提升了12.6%。

摘要 (Abstract)

Large language models (LLMs) have achieved remarkable success in various natural language processing tasks, yet they remain prone to generating factually incorrect outputs known as hallucinations. While recent approaches have shown promise for hallucination detection by repeatedly sampling from LLMs and quantifying the semantic inconsistency among the generated responses, they rely on fixed sampling budgets that fail to adapt to query complexity, resulting in computational inefficiency. We propose an Adaptive Bayesian Estimation framework for Semantic Entropy with Guided Semantic Exploration, which dynamically adjusts sampling requirements based on observed uncertainty. Our approach employs a hierarchical Bayesian framework to model the semantic distribution, enabling dynamic control of sampling iterations through variance-based thresholds that terminate generation once sufficient certainty is achieved. We also develop a perturbation-based importance sampling strategy to systematically explore the semantic space. Extensive experiments on four QA datasets demonstrate that our method achieves superior hallucination detection performance with significant efficiency gains. In low-budget scenarios, our approach requires about 50% fewer samples to achieve comparable detection performance to existing methods, while delivers an average AUROC improvement of 12.6% under the same sampling budget.

关键词: Hallucination Detection, Large Language Models, Semantic Entropy, Adaptive Sampling, Bayesian Estimation, Computational Efficiency, QA Datasets, Semantic Inconsistency

130. ❌ Span Modeling for Idiomaticity and Figurative Language Detection with Span Contrastive Loss

作者: Blake Matheny, Phuong Minh Nguyen, Minh Le Nguyen 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22799v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究基于BERT和RoBERTa的微调方法（SFT）用于习语检测，涉及大语言模型（LLMs）在自然语言处理中的应用，但未涉及其他关键词如MoE、量化、推理加速、AI for Science等。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于BERT和RoBERTa的微调方法，结合槽位损失和跨度对比损失，用于习语和比喻语言检测，在现有数据集上实现了最先进的序列准确率性能。

摘要翻译

比喻语言范畴包含多种类型，其中部分在本质上具有非组合性。这类短语或多词表达包含习语，其含义并非构成词汇意义的简单叠加。对于语言模型而言，由于分词机制与相邻上下文嵌入的影响，这构成了独特挑战。尽管许多大语言模型通过构建大规模短语词汇表已克服此问题，但在缺乏单样本/少样本提示或指令微调的情况下，模型往往无法实现即时识别。当前最佳成果主要通过基于BERT或LSTM的微调方法实现。本文提出的模型即包含此类变体。我们提出基于BERT与RoBERTa的微调模型，结合槽位损失与跨度对比损失（SCL）并采用困难负样本重加权策略，以提升习语性检测能力，在现有数据集上取得了序列准确率的先进性能。对比消融实验证明了SCL的有效性及其泛化能力。本文同时提出采用F1值与序列准确率（SA）的几何平均数，以综合评估模型的跨度感知能力与整体性能。

摘要 (Abstract)

The category of figurative language contains many varieties, some of which are non-compositional in nature. This type of phrase or multi-word expression (MWE) includes idioms, which represent a single meaning that does not consist of the sum of its words. For language models, this presents a unique problem due to tokenization and adjacent contextual embeddings. Many large language models have overcome this issue with large phrase vocabulary, though immediate recognition frequently fails without one- or few-shot prompting or instruction finetuning. The best results have been achieved with BERT-based or LSTM finetuning approaches. The model in this paper contains one such variety. We propose BERT- and RoBERTa-based models finetuned with a combination of slot loss and span contrastive loss (SCL) with hard negative reweighting to improve idiomaticity detection, attaining state of the art sequence accuracy performance on existing datasets. Comparative ablation studies show the effectiveness of SCL and its generalizability. The geometric mean of F1 and sequence accuracy (SA) is also proposed to assess a model’s span awareness and general performance together.

关键词: figurative language detection, idiomaticity detection, BERT fine-tuning, span contrastive loss, sequence accuracy, multi-word expression, hard negative reweighting, state-of-the-art performance

131. ❌ Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases

作者: Dubai Li, Yuxiang He, Yan Hu, Yu Tian, Jingsong Li 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22767v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLM Agents在医学数据库中进行观察性研究的能力，与’LLM Agents’、‘AI for Science’高度相关（10分）。涉及多步推理和系统思考（5分），需要工具使用（5分），并关注证据的真实性（5分）。其他关键词如模型架构、训练方法、优化技术等均未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了LLM Agents在医学数据库中进行端到端观察性研究的能力，发现当前模型在生成结构化证据包方面成功率较低（最佳模型仅39.9%），且Agent框架对性能影响显著。

摘要翻译

观察性研究能够大规模产生具有临床可操作性的证据，但在真实世界数据库中执行这类研究是开放性的，需要在队列构建、分析和报告等环节做出连贯决策。先前对大型语言模型智能体的评估侧重于孤立步骤或单一答案，忽略了最终证据集合的完整性与内部结构。为弥补这一不足，我们推出了RWE-bench——一个基于MIMIC-IV数据库并源自同行评审观察性研究的基准测试。每个任务均提供相应的研究方案作为参考标准，要求智能体在真实数据库中执行实验并迭代生成树状结构的证据集合。我们使用三种智能体框架，从问题级准确性和端到端任务指标两个维度评估了六种大型语言模型（三种开源模型、三种闭源模型）。在162项任务中，任务成功率普遍较低：最优智能体仅达39.9%，最优开源模型为30.4%。智能体框架的选择也产生显著影响，导致性能指标波动超过30%。此外，我们实现了自动化队列评估方法，以快速定位错误并识别智能体故障模式。总体而言，结果凸显了智能体在生成端到端证据集合能力方面存在的持续局限，而高效验证仍是未来工作的重要方向。代码与数据详见https://github.com/somewordstoolate/RWE-bench。

摘要 (Abstract)

Observational studies can yield clinically actionable evidence at scale, but executing them on real-world databases is open-ended and requires coherent decisions across cohort construction, analysis, and reporting. Prior evaluations of LLM agents emphasize isolated steps or single answers, missing the integrity and internal structure of the resulting evidence bundle. To address this gap, we introduce RWE-bench, a benchmark grounded in MIMIC-IV and derived from peer-reviewed observational studies. Each task provides the corresponding study protocol as the reference standard, requiring agents to execute experiments in a real database and iteratively generate tree-structured evidence bundles. We evaluate six LLMs (three open-source, three closed-source) under three agent scaffolds using both question-level correctness and end-to-end task metrics. Across 162 tasks, task success is low: the best agent reaches 39.9%, and the best open-source model reaches 30.4%. Agent scaffolds also matter substantially, causing over 30% variation in performance metrics. Furthermore, we implement an automated cohort evaluation method to rapidly localize errors and identify agent failure modes. Overall, the results highlight persistent limitations in agents’ ability to produce end-to-end evidence bundles, and efficient validation remains an important direction for future work. Code and data are available at https://github.com/somewordstoolate/RWE-bench.

关键词: LLM Agents, Observational Studies, Medical Databases, Real-World Evidence, Benchmark Evaluation, Cohort Construction, Evidence Bundles, MIMIC-IV

132. ❌ DALDALL: Data Augmentation for Lexical and Semantic Diverse in Legal Domain by leveraging LLM-Persona

作者: Janghyeok Choi, Jaewon Lee, Sungzoon Cho 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22765v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是利用LLMs进行法律领域的数据增强，因此与’Large Language Models’高度相关（10分）。论文涉及使用增强数据微调检索器，与’Post-training/SFT’有一定关联（5分）。研究应用于法律信息检索，与’Retrieval-Augmented Generation/RAG’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、RLHF、PEFT等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对法律领域数据稀缺问题，提出了一种基于LLM角色扮演的数据增强框架DALDALL，通过生成具有更高词汇和语义多样性的合成查询，有效提升了法律信息检索系统的性能。

摘要翻译

数据稀缺性在低资源领域始终是一个持续存在的挑战。尽管现有的数据增强方法利用大语言模型（LLM）的生成能力来产生大量合成数据，但这些方法往往重数量而轻质量，且缺乏针对特定领域的策略。在本研究中，我们提出了DALDALL，一个为法律信息检索（IR）量身定制的、基于人物角色的数据增强框架。我们的方法采用领域特定的专业角色——例如律师、检察官和法官——来生成合成查询，这些查询在词汇和语义多样性上显著优于普通提示方法。在CLERC和COLIEE基准测试上的实验表明，基于角色的增强方法在通过Self-BLEU分数衡量的词汇多样性方面取得了提升，同时保持了与原始查询的语义保真度。此外，与使用原始数据或通用增强数据训练的密集检索器相比，在基于角色增强数据上微调的密集检索器始终取得具有竞争力或更优的召回性能。这些发现确立了基于角色的提示作为一种在专业化、低资源领域中生成高质量训练数据的有效策略。

摘要 (Abstract)

Data scarcity remains a persistent challenge in low-resource domains. While existing data augmentation methods leverage the generative capabilities of large language models (LLMs) to produce large volumes of synthetic data, these approaches often prioritize quantity over quality and lack domain-specific strategies. In this work, we introduce DALDALL, a persona-based data augmentation framework tailored for legal information retrieval (IR). Our method employs domain-specific professional personas–such as attorneys, prosecutors, and judges–to generate synthetic queries that exhibit substantially greater lexical and semantic diversity than vanilla prompting approaches. Experiments on the CLERC and COLIEE benchmarks demonstrate that persona-based augmentation achieves improvement in lexical diversity as measured by Self-BLEU scores, while preserving semantic fidelity to the original queries. Furthermore, dense retrievers fine-tuned on persona-augmented data consistently achieve competitive or superior recall performance compared to those trained on original data or generic augmentations. These findings establish persona-based prompting as an effective strategy for generating high-quality training data in specialized, low-resource domains.

关键词: Data Augmentation, Large Language Models, Legal Domain, Information Retrieval, Persona-based Prompting, Synthetic Data, Lexical Diversity, Semantic Diversity

133. ❌ KALAVAI: Predicting When Independent Specialist Fusion Works – A Quantitative Model for Post-Hoc Cooperative LLM Training

作者: Ramchand Kumaresan 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22755v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	10.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的post-hoc融合训练方法（KALAVAI协议），涉及MoE路由、模型融合和微调，与’Large Language Models’、‘Mixture of Experts’、‘Post-training’和’Model Merging’高度相关（10分）。其他关键词如’Small Language Models’、‘Scaling Laws’、‘Instruction Tuning’等未在摘要中提及或无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出KALAVAI协议，通过量化模型预测独立训练的领域专家LLM后融合的协同增益，并证明轻量级MoE路由能有效融合多专家模型，提升性能。

摘要翻译

独立训练的领域专家模型可通过事后融合形成一个单一模型，其性能优于任何单一专家模型，且增益可预测：增益 = 0.82 × 差异度 - 2.72（R^2 = 0.856，n=6，差异度范围3-26%）。这使得实践者能够在投入计算资源前预估协作价值。当差异度低于约3.3%时，增益趋近于零。在KALAVAI协议中，贡献者基于共享检查点独立微调模型副本，随后提交进行轻量级混合专家（MoE）路由训练（500步）。增益结果稳定：在4.1亿参数规模上提升+7.72%（±0.02%，3次随机种子），在10亿参数规模上提升+7.49%（±0.01%，3次随机种子），在69亿参数规模上提升+6.53%，均优于最佳专家模型。路由器的匹配精度与领域先知路由的差异小于10^{-5}纳特。跨语言融合（泰米尔语/约鲁巴语/威尔士语/代码）实现+21.76%增益，其中约鲁巴语困惑度从41.9降至7.7。20位贡献者的联合训练实现+16.71%增益（±0.07个百分点，3次随机种子）。
该协议受三项条件约束：共享初始化是必要的——检查点不匹配会降低路由质量；在约10,000训练步以内冻结网络层是可选的，超过该步数则有益；学习型路由至关重要——均匀平均策略相比最佳专家模型会降低1.2%性能，而任何经过训练的路由器均可实现先知最优分配。

摘要 (Abstract)

Independently trained domain specialists can be fused post-hoc into a single model that outperforms any individual specialist, and the gain is predictable: gain = 0.82 x divergence - 2.72 (R^2 = 0.856, n=6, 3-26% divergence). This enables practitioners to estimate cooperative value before committing compute. Below ~3.3% divergence, gains approach zero.In the KALAVAI protocol, contributors fine-tune copies of a shared checkpoint independently, then submit for lightweight MoE routing (500 steps). Gains are consistent: +7.72% at 410M (+/-0.02%, 3 seeds), +7.49% at 1B (+/-0.01%, 3 seeds), +6.53% at 6.9B, each over the best specialist. The router matches domain-oracle routing within <10^{-5} nats. Cross-lingual fusion (Tamil/Yoruba/Welsh/Code) achieves +21.76%, with Yoruba perplexity falling 41.9 to 7.7. A 20-contributor federation achieves +16.71% (+/-0.07pp, 3 seeds).Three requirements bound the protocol. Shared initialisation is necessary: checkpoint mismatch degrades routing. Frozen layers are optional below ~10,000 steps and beneficial beyond. Learned routing is essential: uniform averaging degrades by -1.2% vs. best specialist, while any trained router achieves oracle-optimal assignment.

关键词: LLM fusion, Mixture of Experts, post-hoc training, model merging, domain specialists, cooperative training, routing optimization, perplexity reduction

134. ❌ PRISM: A Dual View of LLM Reasoning through Semantic Flow and Latent Computation

作者: Ruidi Chang, Jiawei Zhou, Hanjie Chen 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22754v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PRISM专注于分析LLMs的多步推理过程，核心研究LLMs的推理机制，因此与’Large Language Models’和’Chain of Thought’高度相关（10分）。它涉及对推理过程的深入分析，与’System 2 Thinking’和’Mechanistic Interpretability’有一定关联（8分）。论文提到分析失败轨迹（如过度思考），这与’Self-Correction’有间接联系（5分）。其他关键词（如MoE、SFT、RAG、量化等）涉及模型架构、训练方法、应用或优化技术，论文未直接涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过PRISM框架联合分析大语言模型推理过程中的文本序列和内部隐藏状态，以揭示推理失败的模式（如验证循环和过度思考）以及提示如何改变推理行为。

摘要翻译

大语言模型通过生成多步推理轨迹来解决复杂问题。然而，这些轨迹通常仅从两个视角之一进行分析：要么关注生成文本中不同推理步骤间的标记序列，要么关注单个步骤内模型各层的隐藏状态向量。我们提出了PRISM（通过语义与隐式建模进行概率推理检测），这是一个用于联合分析这两个层面的框架与诊断工具，它提供了关于推理如何在步骤与层间演化的统一视图。在多个推理模型与基准测试中，PRISM揭示了推理过程中的系统性模式，表明失败的轨迹更可能陷入无效的验证循环，并进一步分化为不同的模式，如过度思考与过早承诺，这些模式在得出候选答案后表现出不同的行为。它还进一步揭示了提示技术如何通过改变语义转换与内部计算模式，从而重塑推理行为，而不仅仅是影响总体准确率。通过将推理轨迹建模为结构化过程，PRISM使得这些行为变得可观察、可分析，而非仅仅依赖最终任务准确率进行评估。综上所述，这些见解使PRISM成为一个用于分析和诊断大语言模型推理过程的实用工具。

摘要 (Abstract)

Large language models (LLMs) solve complex problems by generating multi-step reasoning traces. Yet these traces are typically analyzed from only one of two perspectives: the sequence of tokens across different reasoning steps in the generated text, or the hidden-state vectors across model layers within one step. We introduce PRISM (Probabilistic Reasoning Inspection through Semantic and Implicit Modeling), a framework and diagnostic tool for jointly analyzing both levels, providing a unified view of how reasoning evolves across steps and layers. Across multiple reasoning models and benchmarks, PRISM uncovers systematic patterns in the reasoning process, showing that failed trajectories are more likely to become trapped in unproductive verification loops and further diverge into distinct modes such as overthinking and premature commitment, which behave differently once a candidate answer is reached. It further reveals how prompting reshapes reasoning behavior beyond aggregate accuracy by altering both semantic transitions and internal computational patterns. By modeling reasoning trajectories as structured processes, PRISM makes these behaviors observable and analyzable rather than relying solely on final-task accuracy. Taken together, these insights position PRISM as a practical tool for analyzing and diagnosing reasoning processes in LLMs.

关键词: LLM reasoning, multi-step reasoning, reasoning trajectories, semantic flow, latent computation, reasoning diagnosis, PRISM framework, internal computational patterns

135. ❌ Explanation Generation for Contradiction Reconciliation with LLMs

作者: Jason Chan, Zhixue Zhao, Robert Gaizauskas 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22735v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在矛盾调和解释生成任务中的推理能力，直接涉及LLMs、推理（CoT、System 2 Thinking）、解释生成（Explainable AI）等关键词，相关度较高；与幻觉缓解、自我修正有一定关联；其他技术如MoE、量化、RAG等未涉及。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型（LLMs）为矛盾陈述生成调和解释的能力，发现大多数模型在此任务上表现有限，且增加推理时间带来的收益随模型规模增大而趋于平缓。

摘要翻译

现有自然语言处理研究通常将矛盾视为需要解决的错误，即通过选择接受或舍弃某些陈述来处理。然而，在社会互动和专业领域中，人类推理的一个关键能力在于提出能够调和矛盾的解释性假设。例如，“卡西讨厌咖啡”与“她每天购买咖啡”看似矛盾，但若卡西承担着为所有同事购买咖啡的日常琐事，则两者可以兼容。尽管大语言模型（LLMs）的推理能力日益增强，但其提出此类调和性解释的能力在很大程度上仍未得到探索。为填补这一空白，我们引入了调和性解释生成任务，要求模型生成能够有效化解矛盾陈述的解释。我们提出一种重新利用现有自然语言推理（NLI）数据集的新方法，并引入支持可扩展自动评估的质量指标。对18个大语言模型的实验表明，大多数模型在此任务上成功率有限，且通过“思考”延长测试时间计算所带来的效益会随模型规模增大而趋于平缓。我们的研究结果揭示了大语言模型推理中一个尚未充分探索的维度，并指出在提升大语言模型于聊天机器人和科学助手等下游应用时，亟需解决这一局限性。

摘要 (Abstract)

Existing NLP work commonly treats contradictions as errors to be resolved by choosing which statements to accept or discard. Yet a key aspect of human reasoning in social interactions and professional domains is the ability to hypothesize explanations that reconcile contradictions. For example, “Cassie hates coffee” and “She buys coffee everyday” may appear contradictory, yet both are compatible if Cassie has the unenviable daily chore of buying coffee for all her coworkers. Despite the growing reasoning capabilities of large language models (LLMs), their ability to hypothesize such reconciliatory explanations remains largely unexplored. To address this gap, we introduce the task of reconciliatory explanation generation, where models must generate explanations that effectively render contradictory statements compatible. We propose a novel method of repurposing existing natural language inference (NLI) datasets, and introduce quality metrics that enable scalable automatic evaluation. Experiments with 18 LLMs show that most models achieve limited success in this task, and that the benefit of extending test-time compute by “thinking” plateaus as model size increases. Our results highlight an under-explored dimension of LLM reasoning and the need to address this limitation in enhancing LLMs’ downstream applications such as chatbots and scientific aids.

关键词: Large Language Models, Contradiction Reconciliation, Explanation Generation, Reasoning Capabilities, Natural Language Inference, Automatic Evaluation, Model Size, Test-time Compute

136. ❌ Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics

作者: Naohiro Tawara, Samuele Cornell, Alexander Polok, Marc Delcroix, Lukáš Burget, Shinji Watanabe 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22709v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究基于LLM的对话式自动语音识别系统，在重叠语音、多说话人场景下的性能评估。论文明确提到’LLM-based systems’，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术（如MoE、SFT、RAG、量化等），也未涉及科学AI应用，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文评估了基于大语言模型的对话式自动语音识别系统在多说话人重叠语音场景下的鲁棒性，发现LLM系统在双说话人场景中表现良好，但随着说话人数和重叠度增加性能下降，而模块化流水线方法更稳健。

摘要翻译

会话自动语音识别因重叠语音、远场噪声及说话人数目多变等问题而持续面临挑战。尽管近期基于大语言模型的系统在单说话人基准测试中表现良好，但其在多说话人场景下的鲁棒性尚不明确。我们系统地从四个维度比较了基于大语言模型的方法与模块化流水线方法：重叠语音鲁棒性、语义保真度、说话人数量以及单通道与多通道输入。为捕捉传统指标忽略的语义改变型错误，我们提出了tcpSemER，该方法通过用基于嵌入的语义相似度替换编辑距离，扩展了tcpWER指标。我们进一步将tcpWER分解为重叠与非重叠成分以实现更细粒度分析。在三个数据集上的实验表明，基于大语言模型的系统在双说话人场景中具有竞争力，但随着说话人数量与重叠率增加性能下降，而模块化流水线方法则保持更强的鲁棒性。

摘要 (Abstract)

Conversational automatic speech recognition remains challenging due to overlapping speech, far-field noise, and varying speaker counts. While recent LLM-based systems perform well on single-speaker benchmarks, their robustness in multi-speaker settings is unclear. We systematically compare LLM-based and modular pipeline approaches along four axes: overlap robustness, semantic fidelity, speaker count, and single- versus multi-channel input. To capture meaning-altering errors that conventional metrics miss, we introduce tcpSemER, which extends tcpWER by replacing Levenshtein distance with embedding-based semantic similarity. We further decompose tcpWER into overlapping and non-overlapping components for finer-grained analysis. Experiments across three datasets show that LLM-based systems are competitive in two-speaker settings but degrade as speaker count and overlap increase, whereas modular pipelines remain more robust.

关键词: Conversational ASR, Large Language Models, Overlapping Speech, Multi-speaker, Semantic Evaluation, tcpSemER, Robustness, Speaker Count

137. ❌ How Utilitarian Are OpenAI’s Models Really? Replicating and Reinterpreting Pfeffer, Krügel, and Uhl (2025)

作者: Johannes Himmelreich 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22730v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的道德推理能力评估方法，直接涉及LLM、价值对齐、思维链和深度推理等关键词。论文通过多提示测试评估OpenAI模型在道德困境中的表现，发现单提示评估不可靠，这与LLM的行为评估、对齐和推理能力高度相关。其他关键词如MoE、量化、RAG等涉及模型架构、优化或特定应用，论文未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，单提示评估LLM在道德困境中的推理能力不可靠，通过多提示测试发现OpenAI模型的实际功利主义倾向与提示框架密切相关，建议将多提示鲁棒性测试作为评估LLM行为的标准实践。

摘要翻译

Pfeffer、Krügel与Uhl（2025）的研究指出，OpenAI的推理模型o1-mini在电车难题（trolley problem）与天桥困境（footbridge dilemma）中比非推理模型GPT-4o表现出更强的功利主义（utilitarian）回应倾向。本研究复现了其实验，并基于四种当前OpenAI模型进行扩展，同时加入提示词（prompt）变体测试。电车难题的结论未能复现：GPT-4o的低功利主义回应率并非源于义务论（deontological）立场，而是由提示词中警示性表述触发的安全拒绝机制所致。当问题框架从“我是否应该……？”改为“……在道德上是否被允许？”时，GPT-4o的功利主义回应率达到99%。消除提示词干扰后，所有模型均趋于给出功利主义答案。天桥困境的结论虽可复现但存在瑕疵：在不同提示词变体中，推理模型总体上仍比非推理模型更倾向于功利主义回应，但它们常拒绝回答困境，或在回答时给出非功利主义而非功利主义的答案。这些结果表明，仅凭单一提示词评估大语言模型（LLM）的道德推理并不可靠：针对LLM行为的任何实证主张，都应采用多提示词鲁棒性测试作为标准实践。

摘要 (Abstract)

Pfeffer, Krügel, and Uhl (2025) report that OpenAI’s reasoning model o1-mini produces more utilitarian responses to the trolley problem and footbridge dilemma than the non-reasoning model GPT-4o. I replicate their study with four current OpenAI models and extend it with prompt variant testing. The trolley finding does not survive: GPT-4o’s low utilitarian rate doesn’t reflect a deontological commitment but safety refusals triggered by the prompt’s advisory framing. When framed as “Is it morally permissible…?” instead of “Should I…?”, GPT-4o gives 99% utilitarian responses. All models converge on utilitarian answers when prompt confounds are removed. The footbridge finding survives with blemishes. Reasoning models tend to give more utilitarian responses than non-reasoning models across prompt variations. But often they refuse to answer the dilemma or, when they answer, give a non-utilitarian rather than a utilitarian answer. These results demonstrate that single-prompt evaluations of LLM moral reasoning are unreliable: multi-prompt robustness testing should be standard practice for any empirical claim about LLM behavior.

关键词: Large Language Models, Moral Reasoning, Utilitarianism, Prompt Variants, Evaluation Reliability, Trolley Problem, Reasoning Models, Multi-prompt Testing

138. ❌ Detecting Non-Membership in LLM Training Data via Rank Correlations

作者: Pranav Shetty, Mirazul Haque, Zhiqiang Ma, Xiaomo Liu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22707v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM训练数据非成员检测（non-membership inference），核心是提出PRISM方法，利用模型logits的秩相关性来验证特定数据集是否被排除在LLM训练之外。因此，仅与关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文直接研究LLM训练数据审计问题。其他关键词涉及模型架构、训练技术、推理优化、应用领域等，论文未涉及，故均为0分。

!!! tip deepseek-chat TL;DR

论文提出PRISM方法，通过分析LLM的token log概率秩相关性，解决了检测数据集是否未被用于LLM训练（非成员推断）的问题，为版权合规和信任验证提供了框架。

摘要翻译

随着大语言模型（LLM）在日益庞大且不透明的文本语料库上进行训练，确定哪些数据参与了训练过程，已成为版权执行、合规审计和用户信任的关键问题。先前的研究主要关注检测某个数据集是否被用于训练（成员推断），而其互补问题——验证某个数据集未被使用——则鲜有探讨。针对这一空白，我们提出了PRISM测试方法，该方法仅需对模型输出概率（logits）进行灰盒访问，即可检测数据集层面的非成员性。我们的核心洞见是：两个未接触过某数据集的模型，其归一化词元对数概率的秩相关性，会高于其中一个模型曾在该数据集上训练过的情况。基于这一观察，我们构建了一种基于相关性的测试来检测非成员性。实验表明，PRISM能够可靠地排除所有测试数据集在训练数据中的成员资格，同时避免误报，从而为验证特定数据集被排除在LLM训练之外提供了一个框架。

摘要 (Abstract)

As large language models (LLMs) are trained on increasingly vast and opaque text corpora, determining which data contributed to training has become essential for copyright enforcement, compliance auditing, and user trust. While prior work focuses on detecting whether a dataset was used in training (membership inference), the complementary problem – verifying that a dataset was not used – has received little attention. We address this gap by introducing PRISM, a test that detects dataset-level non-membership using only grey-box access to model logits. Our key insight is that two models that have not seen a dataset exhibit higher rank correlation in their normalized token log probabilities than when one model has been trained on that data. Using this observation, we construct a correlation-based test that detects non-membership. Empirically, PRISM reliably rules out membership in training data across all datasets tested while avoiding false positives, thus offering a framework for verifying that specific datasets were excluded from LLM training.

关键词: LLM training data, non-membership inference, rank correlation, PRISM, grey-box access, copyright enforcement, compliance auditing, dataset exclusion

139. ❌ Synthetic or Authentic? Building Mental Patient Simulators from Longitudinal Evidence

作者: Baihan Li, Bingrui Jin, Kunyao Lan, Ming Wang, Mengyue Wu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22704v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是使用大语言模型（LLMs）构建精神疾病患者模拟器，属于LLM在心理健康领域的应用研究。论文明确提到使用多个LLM骨干进行实验，因此与’Large Language Models’高度相关（10分）。论文提出的DEPROFILE框架和Chain-of-Change agent属于智能代理系统，与’LLM Agents’高度相关（10分）。论文涉及心理健康应用，与’AI for Science’有一定关联（5分）。其他关键词如MoE、SFT、RAG、量化等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对现有精神疾病患者模拟方法存在行为同质化和疾病进展不连贯的问题，提出了基于真实世界纵向数据的DEPROFILE框架，通过构建综合患者档案和引入Chain-of-Change代理，显著提升了对话真实性、行为多样性和事件丰富度，超越了现有基线方法。

摘要翻译

患者模拟对于开发和评估心理健康对话系统至关重要。由于现有方法大多依赖信息有限的快照式提示，多轮交互中同质化的行为与不连贯的疾病进展已成为核心挑战。本研究提出DEPROFILE框架，一种基于真实数据构建的患者模拟方法，通过整合现实世界数据中的人口统计学属性、标准化临床症状、咨询对话及纵向生活事件史，构建统一的多源患者画像。我们进一步引入“变化链”智能体，将杂乱的纵向记录转化为结构化、基于时间锚定的记忆表征以支持模拟。在多种大语言模型（LLM）骨干上的实验表明，通过DEPROFILE构建的更全面患者画像，对话真实性、行为多样性与事件丰富度均获得持续提升并超越现有先进基线，这凸显了将患者模拟建立在可验证的纵向证据基础上的重要性。

摘要 (Abstract)

Patient simulation is essential for developing and evaluating mental health dialogue systems. As most existing approaches rely on snapshot-style prompts with limited profile information, homogeneous behaviors and incoherent disease progression in multi-turn interactions have become key chellenges. In this work, we propose DEPROFILE, a data-grounded patient simulation framework that constructs unified, multi-source patient profiles by integrating demographic attributes, standardized clinical symptoms, counseling dialogues, and longitudinal life-event histories from real-world data. We further introduce a Chain-of-Change agent to transform noisy longitudinal records into structured, temporally grounded memory representations for simulation. Experiments across multiple large language model (LLM) backbones show that with more comprehensive profile constructed by DEPROFILE, the dialogue realism, behavioral diversity, and event richness have consistently improved and exceed state-of-the-art baselines, highlighting the importance of grounding patient simulation in verifiable longitudinal evidence.

关键词: patient simulation, mental health dialogue systems, longitudinal evidence, LLM backbones, Chain-of-Change agent, dialogue realism, behavioral diversity, clinical symptoms

140. ❌ Improving LLM Predictions via Inter-Layer Structural Encoders

作者: Tom Ulanovski, Eyal Blyachman, Maya Bechler-Speicher 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22665v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM内部层表示的结构化编码方法（ILSE），直接涉及LLM技术原理创新。与’Large Language Models’高度相关（10分），因为论文明确研究LLM的层表示优化。与’Small Language Models’有一定关联（5分），因为论文提到小模型通过该方法可媲美大模型。与’Mechanistic Interpretability’有一定关联（5分），因为研究LLM内部表示可视为可解释性研究。其他关键词如MoE、Scaling Laws、训练方法、推理技术、应用领域等均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于Cayley图的层间结构编码器（ILSE），通过有效整合LLM中间层表示来提升预测性能，在多个任务上显著优于基线方法并使小模型达到大模型水平。

摘要翻译

当前大语言模型（LLM）的标准实践是基于最后一层的词元表示进行预测。然而，近期研究表明，中间层编码了丰富的信息，这些信息可能比仅使用最终层表示包含更多与任务相关的特征。重要的是，研究显示对于不同任务，最优的层可能各不相同。本文中，我们提出了层间结构编码器（Inter-Layer Structural Encoders, ILSE），这是一种强大的结构方法，能够从大语言模型的所有内部层表示中共同学习一个有效的表示。ILSE的核心是凯莱编码器（Cayley-Encoder），这是一种基于数学原理的几何编码器，它利用扩展凯莱图（expander Cayley graphs）实现高效的层间信息传播。我们在13个分类和语义相似性任务上评估了ILSE，使用了9个预训练大语言模型，参数量范围从1400万到80亿。ILSE在各项任务中均持续优于基线方法和现有方法，在准确率上最高提升达44%，在相似性度量上提升达25%。我们进一步证明，ILSE在少样本场景下具有数据高效性，并且能使较小规模的大语言模型与参数量大得多的模型竞争。

摘要 (Abstract)

The standard practice in Large Language Models (LLMs) is to base predictions on the final-layer token representations. Recent studies, however, show that intermediate layers encode substantial information, which may contain more task-relevant features than the final-layer representations alone. Importantly, it was shown that for different tasks, different layers may be optimal. In this work we introduce Inter-Layer Structural Encoders (ILSE), a powerful structural approach to learn one effective representation from the LLM’s internal layer representations all together. Central to ILSE is Cayley-Encoder, a mathematically grounded geometric encoder that leverages expander Cayley graphs for efficient inter-layer information propagation. We evaluate ILSE across 13 classification and semantic similarity tasks with 9 pre-trained LLMs ranging from 14 million to 8 billion parameters. ILSE consistently outperforms baselines and existing approaches, achieving up to 44% improvement in accuracy and 25% in similarity metrics. We further show that ILSE is data-efficient in few-shot regimes and can make small LLMs competitive with substantially larger models.

关键词: Large Language Models, intermediate layers, structural encoder, Cayley graphs, layer representations, classification tasks, semantic similarity, model efficiency

141. ❌ Benchmarking Multi-Agent LLM Architectures for Financial Document Processing: A Comparative Study of Orchestration Patterns, Cost-Accuracy Tradeoffs and Production Scaling Strategies

作者: Siddhant Kulkarni, Yukta Kulkarni 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22651v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多智能体LLM架构在金融文档处理中的应用，与’Large Language Models’、‘LLM Agents’和’Multi-agent Systems’高度相关（10分），涉及自校正架构与’Self-Correction’相关（8分）。其他关键词如MoE、SLMs、训练方法、推理加速、AI for Science等均未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该研究系统比较了四种多智能体LLM架构在金融文档信息提取中的性能，发现自校正架构精度最高但成本较高，而分层架构在成本-精度帕累托前沿表现最优，为金融领域部署提供了实用指导。

摘要翻译

在金融文档结构化信息提取中采用大语言模型（LLM）的进程已迅速加快，然而实际生产部署面临着缺乏实证指导的根本性架构决策。本文提出一个系统性基准测试，比较了四种多智能体编排架构：顺序流水线、并行扇出合并、分层监督者-工作者架构以及反射式自校正循环。我们在包含10,000份美国证券交易委员会备案文件（10-K、10-Q及8-K表格）的语料库上，对五种前沿开源模型进行了评估。我们的评估涵盖25类提取字段，包括治理结构、高管薪酬和财务指标，并从五个维度进行度量：字段级F1分数、文档级准确率、端到端延迟、单文档处理成本和令牌效率。研究发现，反射式架构实现了最高的字段级F1分数（0.943），但其成本是顺序基准线的2.3倍；而分层架构在成本-准确率帕累托前沿上占据最优位置（以1.4倍成本实现F1分数0.921）。我们进一步对语义缓存、模型路由和自适应重试策略进行了消融实验，证明混合配置能以仅1.15倍基准成本恢复反射式架构89%的准确率提升。从每日处理1K到100K文档的扩展性分析揭示了非线性的吞吐量-准确率衰减曲线，为容量规划提供了依据。这些发现为在受监管金融环境中部署多智能体LLM系统的实践者提供了可操作的指导。

摘要 (Abstract)

The adoption of large language models (LLMs) for structured information extraction from financial documents has accelerated rapidly, yet production deployments face fundamental architectural decisions with limited empirical guidance. We present a systematic benchmark comparing four multi-agent orchestration architectures: sequential pipeline, parallel fan-out with merge, hierarchical supervisor-worker and reflexive self-correcting loop. These are evaluated across five frontier and open-weight LLMs on a corpus of 10,000 SEC filings (10-K, 10-Q and 8-K forms). Our evaluation spans 25 extraction field types covering governance structures, executive compensation and financial metrics, measured along five axes: field-level F1, document-level accuracy, end-to-end latency, cost per document and token efficiency. We find that reflexive architectures achieve the highest field-level F1 (0.943) but at 2.3x the cost of sequential baselines, while hierarchical architectures occupy the most favorable position on the cost-accuracy Pareto frontier (F1 0.921 at 1.4x cost). We further present ablation studies on semantic caching, model routing and adaptive retry strategies, demonstrating that hybrid configurations can recover 89% of the reflexive architecture’s accuracy gains at only 1.15x baseline cost. Our scaling analysis from 1K to 100K documents per day reveals non-obvious throughput-accuracy degradation curves that inform capacity planning. These findings provide actionable guidance for practitioners deploying multi-agent LLM systems in regulated financial environments.

关键词: multi-agent LLM architectures, financial document processing, orchestration patterns, cost-accuracy tradeoffs, production scaling, information extraction, SEC filings, reflexive self-correcting loop

142. ❌ Multi-Method Validation of Large Language Model Medical Translation Across High- and Low-Resource Languages

作者: Chukwuebuka Anyaegbuna, Eduardo Juan Perez Guerrero, Jerry Liu, Timothy Keyes, April Liang, Natasha Steele, Stephen Ma, Jonathan Chen, Kevin Schulman 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22642v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是评估前沿大语言模型（GPT-5.1, Claude Opus 4.5, Gemini 3 Pro, Kimi K2）在医学翻译任务中的表现，属于大模型在医疗领域的应用研究。因此，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为医疗翻译是生物信息学/科学AI的应用场景。论文通过验证框架评估翻译的语义保真度，间接涉及事实性/真实性（Hallucination Mitigation OR Factuality OR Truthfulness），但非核心焦点，给5分。其他关键词主要涉及模型架构、训练方法、推理优化、代理系统等，论文未涉及这些技术细节，均给0分。

!!! tip deepseek-chat TL;DR

该研究评估了四种前沿大语言模型在高低资源语言医学翻译中的表现，发现所有模型均能高度保持语义准确性，且资源水平对翻译质量无显著影响。

摘要翻译

语言障碍影响着2730万有非英语偏好的美国居民，然而专业医学翻译仍然成本高昂且常常难以获取。我们采用五层验证框架，评估了四种前沿大语言模型（GPT-5.1、Claude Opus 4.5、Gemini 3 Pro、Kimi K2）将22份医学文件翻译成8种语言的表现，这些语言涵盖高资源（西班牙语、中文、俄语、越南语）、中资源（韩语、阿拉伯语）和低资源（他加禄语、海地克里奥尔语）类别。在704组翻译对中，所有模型均实现了高度的语义保持（LaBSE分数大于0.92），且高资源与低资源语言之间无显著差异（p = 0.066）。跨模型回译证实结果并非由同模型循环性驱动（差值 = -0.0009）。四个独立训练模型间的跨模型一致性很高（LaBSE: 0.946），词汇借用分析显示低资源语言中英语术语保留率与保真度分数无相关性（rho = +0.018, p = 0.82）。这些趋同的结果表明，前沿大语言模型能在不同资源水平的语言中保持医学含义，这对医疗领域的语言可及性具有启示意义。

摘要 (Abstract)

Language barriers affect 27.3 million U.S. residents with non-English language preference, yet professional medical translation remains costly and often unavailable. We evaluated four frontier large language models (GPT-5.1, Claude Opus 4.5, Gemini 3 Pro, Kimi K2) translating 22 medical documents into 8 languages spanning high-resource (Spanish, Chinese, Russian, Vietnamese), medium-resource (Korean, Arabic), and low-resource (Tagalog, Haitian Creole) categories using a five-layer validation framework. Across 704 translation pairs, all models achieved high semantic preservation (LaBSE greater than 0.92), with no significant difference between high- and low-resource languages (p = 0.066). Cross-model back-translation confirmed results were not driven by same-model circularity (delta = -0.0009). Inter-model concordance across four independently trained models was high (LaBSE: 0.946), and lexical borrowing analysis showed no correlation between English term retention and fidelity scores in low-resource languages (rho = +0.018, p = 0.82). These converging results suggest frontier LLMs preserve medical meaning across resource levels, with implications for language access in healthcare.

关键词: Large Language Models, Medical Translation, High-Resource Languages, Low-Resource Languages, Semantic Preservation, Validation Framework, Healthcare, Language Access

143. ❌ LGSE: Lexically Grounded Subword Embedding Initialization for Low-Resource Language Adaptation

作者: Hailay Teklehaymanot, Dren Fazlija, Wolfgang Nejdl 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22629v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究预训练语言模型在低资源语言上的适应问题，提出了LGSE框架进行词汇扩展和嵌入初始化，因此与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分）。论文涉及语言模型，但未明确指定为大语言模型，因此与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分）。其他关键词如MoE、SLMs、SFT、RAG、推理、代理、压缩等均未在论文中涉及，评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对预训练语言模型在低资源、形态丰富语言中词汇扩展导致形态信息丢失的问题，提出了基于形态分解的LGSE嵌入初始化框架，在阿姆哈拉语和提格里尼亚语的三个NLP任务上均优于基线方法。

摘要翻译

将预训练语言模型适配到低资源且形态丰富的语言仍是一项重大挑战。现有的词汇扩展方法通常依赖于任意分割的子词单元，导致词汇表征碎片化并丢失关键形态信息。为应对这一局限，我们提出了基于词汇的形态感知子词嵌入初始化框架，该框架引入形态学信息驱动的分割方法，用于初始化新词元的嵌入向量。与使用随机向量或任意子词不同，本方法将单词分解为构成语素，并通过平均预训练子词或基于FastText的语素表征来构建语义连贯的嵌入向量。当词元无法分割为有意义语素时，则使用字符n-元表征构建其嵌入以捕捉结构信息。在语言自适应预训练过程中，我们采用正则化项惩罚新引入嵌入向量与初始化值之间的较大偏差，在保持与原始预训练嵌入空间对齐的同时，实现对目标语言的适配。为隔离初始化策略的影响，我们保留原始预训练模型的词汇表和分词器，仅在适配过程中更新新增嵌入向量。我们在两种形态丰富且低资源的语言——阿姆哈拉语和提格里尼亚语（具备形态分割资源）上，针对三项自然语言处理任务评估本方法：问答、命名实体识别和文本分类。实验结果表明，本方法在所有任务中均持续优于基线方法，证明了基于形态学信息的嵌入初始化对提升低资源语言表征质量的有效性。项目资源详见GitHub链接。

摘要 (Abstract)

Adapting pretrained language models to low-resource, morphologically rich languages remains a significant challenge. Existing vocabulary expansion methods typically rely on arbitrarily segmented subword units, resulting in fragmented lexical representations and loss of critical morphological information. To address this limitation, we propose the Lexically Grounded Subword Embedding Initialization (LGSE) framework, which introduces morphologically informed segmentation for initializing embeddings of novel tokens. Instead of using random vectors or arbitrary subwords, LGSE decomposes words into their constituent morphemes and constructs semantically coherent embeddings by averaging pretrained subword or FastText-based morpheme representations. When a token cannot be segmented into meaningful morphemes, its embedding is constructed using character n-gram representations to capture structural information. During Language-Adaptive Pretraining, we apply a regularization term that penalizes large deviations of newly introduced embeddings from their initialized values, preserving alignment with the original pretrained embedding space while enabling adaptation to the target language. To isolate the effect of initialization, we retain the original pre-trained model vocabulary and tokenizer and update only the new embeddings during adaptation. We evaluate LGSE on three NLP tasks: Question Answering, Named Entity Recognition, and Text Classification, in two morphologically rich, low-resource languages: Amharic and Tigrinya, where morphological segmentation resources are available. Experimental results show that LGSE consistently outperforms baseline methods across all tasks, demonstrating the effectiveness of morphologically grounded embedding initialization for improving representation quality in underrepresented languages. Project resources are available in the GitHub link.

关键词: pretrained language models, low-resource languages, morphologically rich languages, vocabulary expansion, subword embedding initialization, morphological segmentation, language adaptation, embedding regularization

144. ❌ Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?

作者: Richard J. Young 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22582v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	15.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Chain-of-Thought（CoT）推理的忠实性问题，这是论文的核心主题（15分）。论文评估了12个大型语言模型（LLMs），因此与LLMs高度相关（10分）。研究涉及模型的深度推理过程（System 2 Thinking）和解释性（Explainable AI），以及事实性和真实性（Hallucination Mitigation）问题，这些都与论文主题直接相关（各10分）。论文提到模型内部识别提示影响但抑制输出，这与自我反思（Self-Correction）有一定关联（5分）。其他关键词如MoE、SLMs、训练方法、加速技术、AI for Science等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该研究评估了12个开源推理模型中Chain-of-Thought（CoT）推理的忠实性，发现模型在内部能识别提示影响但经常在输出中抑制这种承认，忠实性率在39.7%到89.9%之间，且受架构和训练方法影响。

摘要翻译

思维链推理已被提出作为安全关键部署中大语言模型的透明度机制，但其有效性取决于忠实性（即模型是否准确表述实际影响其输出的因素）。先前研究仅针对两种专有模型评估了这一属性，发现Claude 3.7 Sonnet的承认率低至25%，DeepSeek-R1为39%。为将评估扩展到开放权重生态系统，本研究选取来自9个架构家族（参数量7B-685B）的12个开放权重推理模型，在MMLU和GPQA Diamond的498道选择题中注入六类推理提示（迎合性、一致性、视觉模式、元数据、评分器破解及非伦理信息），并测量当提示成功改变答案时，模型在其思维链中承认提示影响的比例。通过41,832次推理测试，各模型家族的总体忠实率从39.7%（Seed-1.6-Flash）到89.9%（DeepSeek-V3.2-Speciale）不等，其中一致性提示（35.5%）和迎合性提示（53.9%）的承认率最低。训练方法和模型家族对忠实性的预测力强于参数规模，基于关键词的分析揭示了思维标记承认率（约87.5%）与答案文本承认率（约28.6%）间的显著差距，表明模型内部能识别提示影响，却系统性地在输出中压制这种承认。这些发现直接影响思维链监控作为安全机制的可行性，并表明忠实性并非推理模型的固有属性，而是随架构、训练方法及影响线索的性质发生系统性变化。

摘要 (Abstract)

Chain-of-thought (CoT) reasoning has been proposed as a transparency mechanism for large language models in safety-critical deployments, yet its effectiveness depends on faithfulness (whether models accurately verbalize the factors that actually influence their outputs), a property that prior evaluations have examined in only two proprietary models, finding acknowledgment rates as low as 25% for Claude 3.7 Sonnet and 39% for DeepSeek-R1. To extend this evaluation across the open-weight ecosystem, this study tests 12 open-weight reasoning models spanning 9 architectural families (7B-685B parameters) on 498 multiple-choice questions from MMLU and GPQA Diamond, injecting six categories of reasoning hints (sycophancy, consistency, visual pattern, metadata, grader hacking, and unethical information) and measuring the rate at which models acknowledge hint influence in their CoT when hints successfully alter answers. Across 41,832 inference runs, overall faithfulness rates range from 39.7% (Seed-1.6-Flash) to 89.9% (DeepSeek-V3.2-Speciale) across model families, with consistency hints (35.5%) and sycophancy hints (53.9%) exhibiting the lowest acknowledgment rates. Training methodology and model family predict faithfulness more strongly than parameter count, and keyword-based analysis reveals a striking gap between thinking-token acknowledgment (approximately 87.5%) and answer-text acknowledgment (approximately 28.6%), suggesting that models internally recognize hint influence but systematically suppress this acknowledgment in their outputs. These findings carry direct implications for the viability of CoT monitoring as a safety mechanism and suggest that faithfulness is not a fixed property of reasoning models but varies systematically with architecture, training method, and the nature of the influencing cue.

关键词: Chain-of-Thought, faithfulness, reasoning models, large language models, transparency, safety mechanism, hint influence, acknowledgment rate

145. ❌ CAPITU: A Benchmark for Evaluating Instruction-Following in Brazilian Portuguese with Literary Context

作者: Giovana Kerche Bonás, Roseval Malaquias Junior, Marcos Piau, Thiago Laitz, Thales Sales Almeida, Hugo Abonizio, Celio Larcher, Ramon Pires, Rodrigo Nogueira 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22576v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是评估大语言模型（LLMs）在巴西葡萄牙语中的指令跟随能力，因此与’Large Language Models’高度相关（10分）。论文明确研究指令跟随，与’Instruction Tuning’高度相关（10分）。论文提到’frontier reasoning models’和评估多轮对话中的约束持续性，与’Chain of Thought’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、RAG、RLHF、PEFT等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了CAPITU基准，用于评估大语言模型在巴西葡萄牙语中的指令跟随能力，结果显示前沿推理模型表现优异，而葡萄牙语专用模型具有成本效益优势。

摘要翻译

我们推出CAPITU基准测试，用于评估大型语言模型在巴西葡萄牙语中的指令遵循能力。与现有聚焦英语或使用通用提示的基准不同，CAPITU将全部任务置于八部巴西文学经典作品的语境中，将可验证的指令约束与文化根基性内容相结合。该基准包含59种指令类型，分为七个类别，所有设计均无需LLM评判或人工评估即可自动验证。指令类型涵盖葡萄牙语特有的语言约束（如词尾模式-ando/-endo/-indo、-inho/-inha、-mente）和结构性要求。我们在单轮和多轮对话设置下评估了18个前沿模型。结果显示，尖端推理模型表现出色（具备推理功能的GPT-5.2严格准确率达98.5%），而葡萄牙语专用模型展现出有竞争力的成本效益（Sabiazinho-4：87.0%准确率对应0.13美元成本，对比Claude-Haiku-4.5：73.5%准确率对应1.12美元成本）。多轮评估揭示了约束持续性的显著差异，各模型的对话级准确率在60%至96%之间波动。我们识别出形态约束、精确计数以及跨轮次约束持续性衰减等方面的具体挑战。我们公开完整的基准测试集、评估代码和基线结果，以促进葡萄牙语指令遵循能力的研究发展。

摘要 (Abstract)

We introduce CAPITU, a benchmark for evaluating instruction-following capabilities of Large Language Models (LLMs) in Brazilian Portuguese. Unlike existing benchmarks that focus on English or use generic prompts, CAPITU contextualizes all tasks within eight canonical works of Brazilian literature, combining verifiable instruction constraints with culturally-grounded content. The benchmark comprises 59 instruction types organized into seven categories, all designed to be automatically verifiable without requiring LLM judges or human evaluation. Instruction types include Portuguese-specific linguistic constraints (word termination patterns like -ando/-endo/-indo, -inho/-inha, -mente) and structural requirements. We evaluate 18 state-of-the-art models across single-turn and multi-turn settings. Our results show that frontier reasoning models achieve strong performance (GPT-5.2 with reasoning: 98.5% strict accuracy), while Portuguese-specialized models offer competitive cost-efficiency (Sabiazinho-4: 87.0% at $0.13 vs Claude-Haiku-4.5: 73.5% at $1.12). Multi-turn evaluation reveals significant variation in constraint persistence, with conversation-level accuracy ranging from 60% to 96% across models. We identify specific challenges in morphological constraints, exact counting, and constraint persistence degradation across turns. We release the complete benchmark, evaluation code, and baseline results to facilitate research on instruction-following in Portuguese.

关键词: Large Language Models, instruction-following, benchmark, Brazilian Portuguese, literary context, evaluation, multi-turn, constraint persistence

146. ❌ Reddit After Roe: A Computational Analysis of Abortion Narratives and Barriers in the Wake of Dobbs

作者: Aria Pessianzadeh, Alex H. Poole, Rezvaneh Rezapour 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22566v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是一项关于堕胎话语的计算社会科学研究，使用传统NLP方法（如分类、主题建模）分析Reddit帖子，不涉及大模型、深度学习技术原理或AI for Science的具体应用，与所有技术关键词完全无关。

!!! tip deepseek-chat TL;DR

本研究通过计算分析Reddit上关于堕胎的帖子，揭示了在Dobbs案后，情感和心理障碍主导了在线堕胎叙事，并探讨了信息行为、障碍类型、情感表达与时间动态之间的关联。

摘要翻译

2022年美国最高法院对多布斯诉杰克逊妇女健康组织案的裁决重塑了生殖权利格局，为堕胎服务获取带来了新的不确定性与障碍。本研究对Reddit平台上的堕胎议题讨论进行了大规模计算分析，探讨了在信息寻求与信息分享行为、堕胎的不同阶段（术前、术中、术后）以及2022年多布斯案裁决的三个时期中，获取障碍如何被具体表述。通过采集四个堕胎相关子论坛的超过1.7万条帖子，我们采用多步骤流程对帖子按信息类型、堕胎阶段、障碍类别和情感表达进行分类。依据包含法律、经济、情感及社会障碍等八类障碍的编码手册，我们分析了其与情感及信息行为的关联。基于模型生成的障碍归因的主题建模进一步揭示了讨论话语如何随法律与文化语境变迁而演变。研究结果表明，情感与心理障碍在网络堕胎叙事中持续占据主导地位，紧张、困惑、恐惧和悲伤等情绪在整体讨论中普遍存在。通过联结信息行为、障碍类型、情感表达与时间动态，本研究为理解网络社群中如何应对堕胎议题提供了多维度的阐释。

摘要 (Abstract)

The 2022 U.S. Supreme Court decision in Dobbs v. Jackson Women’s Health Organization reshaped the reproductive rights landscape, introducing new uncertainty and barriers to abortion access. We present a large-scale computational analysis of abortion discourse on Reddit, examining how barriers to access are articulated across information-seeking and information-sharing behaviors, different stages of abortion (before, during, after), and three phases of the Dobbs decision in 2022. Drawing on more than 17,000 posts from four abortion-related subreddits, we employed a multi-step pipeline to classify posts by information type, abortion stage, barrier category, and expressed emotions. Using a codebook of eight barrier types, including legal, financial, emotional, and social obstacles, we analyzed their associations with emotions and information behaviors. Topic modeling of model-generated barrier rationales further revealed how discourse evolved in response to shifting legal and cultural contexts. Our findings show that emotional and psychological barriers consistently dominate abortion narratives online, with emotions such as nervousness, confusion, fear, and sadness prevalent across discourse. By linking information behaviors, barriers, emotions, and temporal dynamics, this study provides a multi-dimensional account of how abortion is navigated in online communities.

关键词: abortion discourse, computational analysis, Reddit, barriers to access, emotional barriers, Dobbs decision, topic modeling, information behaviors

147. ❌ Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos

作者: Shoubin Yu, Lei Shu, Antoine Yang, Yao Fu, Srinivas Sunkara, Maria Wang, Jindong Chen, Mohit Bansal, Boqing Gong 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22529v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心贡献是Ego2Web基准测试，专注于连接第一人称视频感知与网络代理执行，属于大模型在不同领域的研究应用。与’LLM Agents’高度相关（10分），因为论文明确研究多模态AI代理和网络代理执行；与’Large Language Models’相关（8分），因为论文使用LLM-as-a-Judge进行自动评估；与’Tool Use’有一定关联（5分），因为代理需要在线环境交互完成任务；其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文提出了Ego2Web基准测试，通过连接第一人称视频感知和网络代理执行来评估多模态AI代理在真实世界物理环境与数字世界交互中的能力，并开发了LLM-as-a-Judge自动评估方法，实验显示当前代理性能较弱。

摘要翻译

多模态AI智能体正日益自动化涉及在线网络执行的复杂现实工作流程。然而，当前的网络智能体基准测试存在一个关键局限：它们完全聚焦于基于网络的交互与感知，缺乏对用户真实物理环境的关联。这一局限阻碍了在关键场景下的评估，例如当智能体必须利用第一人称视觉感知（如通过增强现实眼镜）识别用户环境中的物体，随后在线完成相关任务。为弥补这一空白，我们提出了Ego2Web——首个旨在连接第一人称视频感知与网络智能体执行的基准测试。Ego2Web将真实世界的第一人称视频记录与需要视觉理解、网络任务规划及在线环境交互的网络任务相结合，以实现任务成功完成。我们采用自动化数据生成流程，并结合人工验证与优化，构建了涵盖多样化网络任务类型（包括电子商务、媒体检索、知识查询等）的高质量视频-任务配对。为促进基准测试的准确且可扩展的评估，我们还开发了一种新颖的基于大语言模型的自动评估方法Ego2WebJudge，其与人工判断的一致性达到约84%，显著高于现有评估方法。在Ego2Web上对多种先进智能体进行的实验表明，其表现普遍较弱，在所有任务类别中均有大幅提升空间。我们还对任务设计进行了全面的消融研究，强调了所提出任务中精确视频理解的必要性以及当前智能体的局限性。我们希望Ego2Web能成为开发真正强大AI助手的关键新资源，推动构建能够无缝跨越物理与数字世界进行观察、理解与行动的智能系统。

摘要 (Abstract)

Multimodal AI agents are increasingly automating complex real-world workflows that involve online web execution. However, current web-agent benchmarks suffer from a critical limitation: they focus entirely on web-based interaction and perception, lacking grounding in the user’s real-world physical surroundings. This limitation prevents evaluation in crucial scenarios, such as when an agent must use egocentric visual perception (e.g., via AR glasses) to recognize an object in the user’s surroundings and then complete a related task online. To address this gap, we introduce Ego2Web, the first benchmark designed to bridge egocentric video perception and web agent execution. Ego2Web pairs real-world first-person video recordings with web tasks that require visual understanding, web task planning, and interaction in an online environment for successful completion. We utilize an automatic data-generation pipeline combined with human verification and refinement to curate well-constructed, high-quality video-task pairs across diverse web task types, including e-commerce, media retrieval, knowledge lookup, etc. To facilitate accurate and scalable evaluation for our benchmark, we also develop a novel LLM-as-a-Judge automatic evaluation method, Ego2WebJudge, which achieves approximately 84% agreement with human judgment, substantially higher than existing evaluation methods. Experiments with diverse SoTA agents on our Ego2Web show that their performance is weak, with substantial headroom across all task categories. We also conduct a comprehensive ablation study on task design, highlighting the necessity of accurate video understanding in the proposed task and the limitations of current agents. We hope Ego2Web can be a critical new resource for developing truly capable AI assistants that can seamlessly see, understand, and act across the physical and digital worlds.

关键词: Ego2Web, web agent benchmark, egocentric video perception, multimodal AI agents, LLM-as-a-Judge, video-task pairs, visual understanding, web task planning

148. ❌ Rashid: A Cipher-Based Framework for Exploring In-Context Language Learning

作者: Niyati Bafna, Ryan Soh-Eun Shim, Barbara Plank, David Yarowsky, Hale Sirin 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22497v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文的核心是研究大语言模型（LLMs）在上下文学习（In-context Learning）方面的能力，特别是针对未见语言（通过密码转换模拟）的评估。因此，与’Large Language Models OR LLMs OR Foundation Models’和’In-context Learning OR Many-shot Learning’高度相关（10分）。论文未涉及其他关键词，如模型架构（MoE）、训练方法（SFT, RLHF）、推理优化、代理系统或特定科学领域应用，故这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为Rashid的密码框架，用于通过将高资源语言加密来模拟未见语言，从而系统评估大语言模型在上下文学习（ICLL）中的性能，并利用该框架探索了提升ICLL的资源效用和下游任务策略。

摘要翻译

随着大语言模型在未知语言上下文学习领域日益受到关注，此类语言通常面临缺乏自然语言处理工具、数据资源和研究者专业知识的困境。这意味着研究进展难以评估，该领域难以进行低成本的大规模实验，且上下文学习的发现往往局限于极少数语言和任务。针对这些局限，我们提出了一个研究框架（Rashid），通过可逆密码算法将高资源语言转换为可模拟的未知语言，从而充分利用高资源语言既有的丰富资源，实现对上下文学习现象前所未有的探索。借助该框架，我们使用最先进的评估工具和人工分析评估了该领域的现有方法，探究了潜在高成本资源对提升上下文学习的效用，并在机器翻译之外的丰富下游任务中测试了上下文学习策略。这些探索既展示了本框架所启发的可能性，也为当前上下文学习的性能评估和未来发展方向提供了可操作的见解。

摘要 (Abstract)

Where there is growing interest in in-context language learning (ICLL) for unseen languages with large language models, such languages usually suffer from the lack of NLP tools, data resources, and researcher expertise. This means that progress is difficult to assess, the field does not allow for cheap large-scale experimentation, and findings on ICLL are often limited to very few languages and tasks. In light of such limitations, we introduce a framework (Rashid), for studying ICLL wherein we reversibly cipher high-resource languages (HRLs) to construct truly unseen languages with access to a wide range of resources available for HRLs, unlocking previously impossible exploration of ICLL phenomena. We use our framework to assess current methods in the field with SOTA evaluation tools and manual analysis, explore the utility of potentially expensive resources in improving ICLL, and test ICLL strategies on rich downstream tasks beyond machine translation. These lines of exploration showcase the possibilities enabled by our framework, as well as providing actionable insights regarding current performance and future directions in ICLL.

关键词: in-context learning, large language models, unseen languages, cipher framework, evaluation, high-resource languages, downstream tasks

149. ❌ Generating and Evaluating Sustainable Procurement Criteria for the Swiss Public Sector using In-Context Prompting with Large Language Models

作者: Yingqiang Gao, Veton Matoshi, Luca Rolshoven, Tilia Ellendorff, Judith Binder, Jeremy Austin Jann, Gerold Schneider, Matthias Stürmer 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22513v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是使用LLM和in-context prompting技术为瑞士公共部门生成可持续采购标准，属于大模型在公共管理领域的应用研究。与"Large Language Models"高度相关（10分），因为论文明确使用LLM作为核心工具；与"In-context Learning"高度相关（10分），因为论文使用in-context prompting方法；与"AI for Science"有一定关联（5分），因为属于AI在公共政策/管理领域的应用，但非传统科学领域。其他关键词如MoE、SFT、RAG等均未在论文中涉及，故为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于大语言模型和上下文提示的自动化流程，用于为瑞士公共部门生成和评估可持续采购标准，显著减少了人工起草工作量并保持了与官方指南的一致性。

摘要翻译

公共采购是指政府部门、市政机构及公共资助实体等公共部门机构获取商品与服务的过程。瑞士法律要求将生态、社会和经济可持续性要求以投标人必须满足的准则形式纳入招标评估。然而，将高层级的可持续性法规转化为具体、可验证且针对特定行业的采购准则（如选择标准、授予标准和技术规范），目前仍是一项劳动密集且易出错的手工任务，需要多类商品与服务领域的专业知识以及大量人工投入。本文提出一种可配置的、基于大语言模型（LLM）辅助的流程，该流程以软件形式呈现，旨在支持为瑞士系统化生成和评估面向可持续性的采购准则目录。该系统整合了上下文提示、可互换的LLM后端以及自动化输出验证功能，以实现跨不同采购领域的可审计准则生成。作为概念验证，我们采用瑞士政府和欧盟委员会发布的官方可持续性指南作为结构化参考文件，对流程进行了实例化。我们通过自动化质量检查（包括基于LLM的评估模块）与专家人工编制的黄金标准对比相结合的方式对系统进行评估。结果表明，所提出的流程能显著减少人工起草工作量，同时生成的准则目录与官方指南保持一致。我们进一步探讨了系统局限性、故障模式以及在部署过程中观察到的设计权衡，重点阐述了将生成式人工智能整合到公共部门软件工作流中的关键考量。

摘要 (Abstract)

Public procurement refers to the process by which public sector institutions, such as governments, municipalities, and publicly funded bodies, acquire goods and services. Swiss law requires the integration of ecological, social, and economic sustainability requirements into tender evaluations in the format of criteria that have to be fulfilled by a bidder. However, translating high-level sustainability regulations into concrete, verifiable, and sector-specific procurement criteria (such as selection criteria, award criteria, and technical specifications) remains a labor-intensive and error-prone manual task, requiring substantial domain expertise in several groups of goods and services and considerable manual effort. This paper presents a configurable, LLM-assisted pipeline that is presented as a software supporting the systematic generation and evaluation of sustainability-oriented procurement criteria catalogs for Switzerland. The system integrates in-context prompting, interchangeable LLM backends, and automated output validation to enable auditable criteria generation across different procurement sectors. As a proof of concept, we instantiate the pipeline using official sustainability guidelines published by the Swiss government and the European Commission, which are ingested as structured reference documents. We evaluate the system through a combination of automated quality checks, including an LLM-based evaluation component, and expert comparison against a manually curated gold standard. Our results demonstrate that the proposed pipeline can substantially reduce manual drafting effort while producing criteria catalogs that are consistent with official guidelines. We further discuss system limitations, failure modes, and design trade-offs observed during deployment, highlighting key considerations for integrating generative AI into public sector software workflows.

关键词: Large Language Models, In-context Prompting, Public Procurement, Sustainability Criteria, Automated Generation, Swiss Public Sector, LLM-assisted Pipeline, Procurement Criteria Catalogs

150. ❌ Functional Component Ablation Reveals Specialization Patterns in Hybrid Language Model Architectures

作者: Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22473v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	8.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究混合语言模型架构（结合注意力机制与状态空间模型/线性注意力），属于大模型技术原理创新。核心相关关键词：1）‘Large Language Models’（研究语言模型架构，得10分）；2）‘Small Language Models’（研究sub-1B模型，得10分）；3）‘KV Cache Compression OR Linear Attention’（涉及线性注意力，得8分）；4）‘Mechanistic Interpretability’（通过功能组件消融分析模型机制，得8分）；5）‘Quantization OR Model Compression’（为模型压缩提供指导，得5分）；6）‘Speculative Decoding OR Inference Acceleration’（涉及效率改进，得5分）。其他关键词与论文内容无关（得0分）。

!!! tip deepseek-chat TL;DR

该论文通过功能组件消融框架研究混合语言模型架构（结合注意力与状态空间模型/线性注意力），发现两种组件都必不可少且存在功能冗余，为模型压缩和架构设计提供了指导。

摘要翻译

融合注意力机制与状态空间模型（SSM）或线性注意力的混合语言模型虽能提升效率，但其各组件是否被真正利用尚不明确。本研究提出一种功能组件消融框架，应用于两个参数量低于10亿的混合模型——Qwen3.5-0.8B（顺序结构：门控DeltaNet + softmax注意力）和Falcon-H1-0.5B（并行结构：Mamba-2 + 注意力），并以纯Transformer模型Qwen2.5-0.5B作为对照。通过组件组消融、逐层扫描、位置消融、匹配随机对照以及在五个基准测试上的困惑度分析，我们得出四项结论：（1）两类组件均不可或缺，未被相互替代；（2）替代性组件（线性注意力或SSM）构成语言建模的核心支柱，移除该组件会导致困惑度恶化超过35,000倍，而移除注意力组件仅恶化约82倍；（3）组件重要性呈现位置梯度分布，早期层具有超比例的关键性；（4）混合架构对随机层移除的耐受度比纯Transformer高20-119倍，揭示了组件类型间固有的功能冗余性。这些结果为混合模型的压缩、架构设计及容错部署提供了可操作的指导。

摘要 (Abstract)

Hybrid language models combining attention with state space models (SSMs) or linear attention offer improved efficiency, but whether both components are genuinely utilized remains unclear. We present a functional component ablation framework applied to two sub-1B hybrid models – Qwen3.5-0.8B (sequential: Gated DeltaNet + softmax attention) and Falcon-H1-0.5B (parallel: Mamba-2 + attention) – with a pure Transformer control (Qwen2.5-0.5B). Through group ablations, layer-wise sweeps, positional ablations, matched random controls, and perplexity analysis across five benchmarks, we establish four findings: (1) both component types are essential and neither is bypassed; (2) the alternative component (linear attention or SSM) is the primary language modeling backbone, causing >35,000x perplexity degradation when removed versus ~82x for attention; (3) component importance follows a positional gradient, with early layers being disproportionately critical; and (4) hybrid architectures exhibit 20-119x greater resilience to random layer removal than pure Transformers, revealing built-in functional redundancy between component types. These results provide actionable guidance for hybrid model compression, architecture design, and fault-tolerant deployment.

关键词: hybrid language models, attention, state space models, linear attention, functional ablation, model compression, architecture design, perplexity analysis

151. ❌ Towards Automated Community Notes Generation with Large Vision Language Models for Combating Contextual Deception

作者: Jin Ma, Jingwen Yan, Mohammed Aldeen, Ethan Anderson, Taran Kavuru, Jinkyung Katie Park, Feng Luo, Long Cheng 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22453v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用大型视觉语言模型（LVLMs）自动生成社区笔记以对抗上下文欺骗。高度相关（10分）的关键词包括：1）Large Language Models（论文明确使用LVLMs）；2）Retrieval-Augmented Generation（方法包含检索增强）；3）LLM Agents和Multi-agent Systems（框架基于多智能体协作）。中等相关（5分）：Hallucination Mitigation（涉及事实性和欺骗检测）。其他关键词如MoE、SFT、量化等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于大型视觉语言模型和检索增强的多智能体协作框架（ACCNote），用于自动生成纠正图像上下文欺骗的社区笔记，并在新数据集XCheck上验证了其优于基线方法和商业工具的性能。

摘要翻译

社区笔记已成为社交媒体平台上一种有效的众包机制，用于对抗网络欺骗。然而，其依赖人工贡献者的特性限制了其时效性与可扩展性。在本研究中，我们针对基于图像的上下文欺骗，探索自动化的社区笔记生成方法。此类欺骗指将真实图像与误导性上下文（如时间、实体和事件）配对。与先前主要关注欺骗检测（即以二元方式判断帖子真伪）的研究不同，社区笔记式系统需要生成简洁且有依据的笔记，以帮助用户恢复缺失或修正的上下文。该问题尚未得到充分探索，原因有三：(i) 支持研究的数据集稀缺；(ii) 方法必须处理上下文欺骗的动态特性；(iii) 评估困难，因为标准指标无法衡量笔记是否真正提升了用户理解。为填补这些空白，我们构建了一个真实世界数据集 XCheck，其中包含 X 条带有相关社区笔记和外部上下文的帖子。我们进一步提出了自动化上下文校正笔记生成方法，命名为 ACCNote，这是一个基于大型视觉-语言模型构建的检索增强多智能体协作框架。最后，我们引入了一个新的评估指标——上下文帮助性分数（Context Helpfulness Score, CHS），该指标与用户研究结果一致，而非依赖词汇重叠度。在我们构建的 XCheck 数据集上的实验表明，所提出的 ACCNote 在欺骗检测和笔记生成性能上均优于基线模型，并超越了商用工具 GPT5-mini。综上，我们的数据集、方法与指标共同推进了上下文校正笔记的实用化自动生成，朝着构建更负责任的在线社交网络迈进。

摘要 (Abstract)

Community Notes have emerged as an effective crowd-sourced mechanism for combating online deception on social media platforms. However, its reliance on human contributors limits both the timeliness and scalability. In this work, we study the automated Community Notes generation method for image-based contextual deception, where an authentic image is paired with misleading context (e.g., time, entity, and event). Unlike prior work that primarily focuses on deception detection (i.e., judging whether a post is true or false in a binary manner), Community Notes-style systems need to generate concise and grounded notes that help users recover the missing or corrected context. This problem remains underexplored due to three reasons: (i) datasets that support the research are scarce; (ii) methods must handle the dynamic nature of contextual deception; (iii) evaluation is difficult because standard metrics do not capture whether notes actually improve user understanding. To address these gaps, we curate a real-world dataset, XCheck, comprising X posts with associated Community Notes and external contexts. We further propose the Automated Context-Corrective Note generation method, named ACCNote, which is a retrieval-augmented, multi-agent collaboration framework built on large vision-language models. Finally, we introduce a new evaluation metric, Context Helpfulness Score (CHS), that aligns with user study outcomes rather than relying on lexical overlap. Experiments on our XCheck dataset show that the proposed ACCNote improves both deception detection and note generation performance over baselines, and exceeds a commercial tool GPT5-mini. Together, our dataset, method, and metric advance practical automated generation of context-corrective notes toward more responsible online social networks.

关键词: Large Vision Language Models, Community Notes, Contextual Deception, Retrieval-Augmented Generation, Multi-agent Collaboration, Automated Note Generation, XCheck Dataset, Context Helpfulness Score

152. ❌ LLM-guided headline rewriting for clickability enhancement without clickbait

作者: Yehudit Aperstein, Linoy Halifa, Sagiv Bar, Alexander Apartsin 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22459v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用大语言模型（LLM）进行可控文本生成，具体应用于新闻标题改写以增强点击率同时避免标题党，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文关注语义忠实性和避免误导，与’Hallucination Mitigation OR Factuality OR Truthfulness’有一定关联（5分）。论文未涉及其他关键词的具体技术或应用领域，如MoE、SLMs、训练方法、推理加速、代理系统、科学AI等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于大语言模型（LLM）的引导式新闻标题改写框架，通过结合点击率引导和标题党抑制模型，在保持语义忠实性的同时生成更具吸引力但不误导的标题，实现了点击率提升与标题党避免之间的平衡。

摘要翻译

在新闻媒体的可控文本生成中，如何在保持信息忠实度的同时提升读者参与度，是一个核心挑战。为提升读者参与度而优化新闻标题的做法，常与“点击诱饵”混为一谈，导致产生夸大或误导性的措辞，损害编辑信任。我们并不将点击诱饵视为一种独立的文体类别，而是将其理解为对原本合法的参与度线索进行不成比例放大后的极端结果。基于这一观点，我们将标题重写构建为一个可控生成问题，即在明确约束语义忠实度与强调比例的前提下，有选择性地强化某些以参与度为导向的语言属性。我们提出了一个基于大语言模型（LLM）的引导式标题重写框架，该框架采用“面向生成的未来判别器”（FUDGE）范式进行推理时控制。大语言模型由两个辅助引导模型进行调控：（1）一个点击诱饵评分模型，提供负向引导以抑制过度的文体放大；（2）一个参与度属性模型，提供与目标点击率目标一致的正向引导。两个引导模型均在从精选的真实世界新闻语料库中提取的中性标题上进行训练。同时，通过在大语言模型受控激活预定义的参与度策略下重写这些原始标题，我们合成了点击诱饵变体。通过在推理时调整引导权重，系统能够生成一个连续谱系上的标题，从中性释义到更具吸引力但仍符合编辑规范的表述。所提出的框架为研究吸引力、语义保持与避免点击诱饵之间的权衡提供了一种原则性方法，并支持在新闻场景下进行负责任的大语言模型标题优化。

摘要 (Abstract)

Enhancing reader engagement while preserving informational fidelity is a central challenge in controllable text generation for news media. Optimizing news headlines for reader engagement is often conflated with clickbait, resulting in exaggerated or misleading phrasing that undermines editorial trust. We frame clickbait not as a separate stylistic category, but as an extreme outcome of disproportionate amplification of otherwise legitimate engagement cues. Based on this view, we formulate headline rewriting as a controllable generation problem, where specific engagement-oriented linguistic attributes are selectively strengthened under explicit constraints on semantic faithfulness and proportional emphasis. We present a guided headline rewriting framework built on a large language model (LLM) that uses the Future Discriminators for Generation (FUDGE) paradigm for inference-time control. The LLM is steered by two auxiliary guide models: (1) a clickbait scoring model that provides negative guidance to suppress excessive stylistic amplification, and (2) an engagement-attribute model that provides positive guidance aligned with target clickability objectives. Both guides are trained on neutral headlines drawn from a curated real-world news corpus. At the same time, clickbait variants are generated synthetically by rewriting these original headlines using an LLM under controlled activation of predefined engagement tactics. By adjusting guidance weights at inference time, the system generates headlines along a continuum from neutral paraphrases to more engaging yet editorially acceptable formulations. The proposed framework provides a principled approach for studying the trade-off between attractiveness, semantic preservation, and clickbait avoidance, and supports responsible LLM-based headline optimization in journalistic settings.

关键词: large language model, headline rewriting, clickbait avoidance, controllable generation, FUDGE, engagement optimization, semantic faithfulness, journalistic applications

153. ❌ Adapting Self-Supervised Speech Representations for Cross-lingual Dysarthria Detection in Parkinson’s Disease

作者: Abner Hernandez, Eunjung Yeo, Kwanghee Choi, Chin-Jou Li, Zhengjun Yue, Rohan Kumar Das, Jan Rusz, Mathew Magimai Doss, Juan Rafael Orozco-Arroyave, Tomás Arias-Vergara, Andreas Maier, Elmar Nöth, David R. Mortensen, David Harwath, Paula Andrea Perez-Toro 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22225v2

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于使用自监督语音表示进行跨语言构音障碍检测，属于AI在生物医学领域的应用（AI for Science）。论文涉及领域适应（Domain Adaptation）技术，通过表示级语言转换来对齐源语言和目标语言的分布。其他关键词主要涉及大语言模型、推理、对齐、优化等技术，与论文的语音处理和医疗诊断应用无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种表示级语言转换方法，通过自监督语音表示对齐源语言和目标语言的分布，显著提高了帕金森病跨语言构音障碍检测的敏感性和F1分数。

摘要翻译

构音障碍语音数据的有限性使得跨语言检测成为一个重要但具有挑战性的问题。一个关键难点在于，语音表征通常编码了语言依赖的结构，这可能干扰构音障碍的检测。我们提出了一种表征层面的语言迁移方法，该方法利用基于健康对照语音估计的质心向量适应，将源语言的自监督语音表征与目标语言的分布对齐。我们在捷克语、德语和西班牙语的帕金森病语音数据集的口部DDK（Diadochokinetic）录音上，于跨语言和多语言两种设置下评估了该方法。语言迁移在跨语言设置中显著提高了检测的敏感性和F1分数，同时在多语言设置中也带来了较小但一致的性能提升。表征分析进一步表明，语言迁移降低了嵌入空间中的语言身份信息，这支持了“语言迁移移除了语言依赖结构”的解释。

摘要 (Abstract)

The limited availability of dysarthric speech data makes cross-lingual detection an important but challenging problem. A key difficulty is that speech representations often encode language-dependent structure that can confound dysarthria detection. We propose a representation-level language shift (LS) that aligns source-language self-supervised speech representations with the target-language distribution using centroid-based vector adaptation estimated from healthy-control speech. We evaluate the approach on oral DDK recordings from Parkinson’s disease speech datasets in Czech, German, and Spanish under both cross-lingual and multilingual settings. LS substantially improves sensitivity and F1 in cross-lingual settings, while yielding smaller but consistent gains in multilingual settings. Representation analysis further shows that LS reduces language identity in the embedding space, supporting the interpretation that LS removes language-dependent structure.

关键词: dysarthria detection, self-supervised speech representations, cross-lingual, Parkinson’s disease, representation-level language shift, domain adaptation, speech analysis, medical AI

154. ❌ From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents

作者: Ling Yue, Kushal Raj Bhandari, Ching-Yun Ko, Dhaval Patel, Shuxin Lin, Nianjun Zhou, Jianxi Gao, Pin-Yu Chen, Shaowu Pan 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22386v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM代理的工作流优化，与’LLM Agents’、‘Tool Use’高度相关（10分），直接涉及’Retrieval-Augmented Generation’、‘Chain of Thought’、‘System 2 Thinking’、‘Self-Correction’、‘Multi-agent Systems’等代理工作流相关技术（5分）。其他关键词如模型架构、训练方法、推理加速、科学应用等未在摘要中体现（0分）。

!!! tip deepseek-chat TL;DR

该综述论文系统回顾了LLM代理工作流优化的方法，提出了基于工作流结构确定时机、优化部分和评估信号的三维分类框架，并区分了静态模板、动态运行时图和执行轨迹，旨在为未来研究提供统一框架和可复现评估标准。

摘要翻译

基于大语言模型（LLM）的系统正日益普及，其通过构建可执行的工作流来完成任务，这些工作流交织了LLM调用、信息检索、工具使用、代码执行、内存更新与验证。本文综述了近期用于设计与优化此类工作流的方法，我们将这些工作流视为智能体计算图（agentic computation graphs, ACGs）。我们依据工作流结构确定的时间点来组织文献，其中“结构”指的是存在哪些组件或智能体、它们如何相互依赖以及信息如何在它们之间流动。这一视角区分了静态方法与动态方法：静态方法在部署前固定一个可重复使用的工作流框架；动态方法则在执行前或执行期间为特定运行选择、生成或修订工作流。我们进一步从三个维度梳理现有工作：结构何时确定、工作流的哪部分被优化，以及哪些评估信号指导优化（例如任务指标、验证器信号、偏好或基于执行轨迹的反馈）。我们还区分了可重复使用的工作流模板、运行特定的实现图以及执行轨迹，从而将可重复使用的设计选择与在给定运行中实际部署的结构以及实际运行时行为分离开来。最后，我们提出了一种结构感知的评估视角，该视角在下游任务指标之外，补充了图级属性、执行成本、鲁棒性以及跨输入的结构变化等维度。我们的目标是提供一个清晰的术语体系、一个用于定位新方法的统一框架、一个更具可比性的现有文献视图，以及一个为未来LLM智能体工作流优化研究提供更具可复现性的评估标准。

摘要 (Abstract)

Large language model (LLM)-based systems are becoming increasingly popular for solving tasks by constructing executable workflows that interleave LLM calls, information retrieval, tool use, code execution, memory updates, and verification. This survey reviews recent methods for designing and optimizing such workflows, which we treat as agentic computation graphs (ACGs). We organize the literature based on when workflow structure is determined, where structure refers to which components or agents are present, how they depend on each other, and how information flows between them. This lens distinguishes static methods, which fix a reusable workflow scaffold before deployment, from dynamic methods, which select, generate, or revise the workflow for a particular run before or during execution. We further organize prior work along three dimensions: when structure is determined, what part of the workflow is optimized, and which evaluation signals guide optimization (e.g., task metrics, verifier signals, preferences, or trace-derived feedback). We also distinguish reusable workflow templates, run-specific realized graphs, and execution traces, separating reusable design choices from the structures actually deployed in a given run and from realized runtime behavior. Finally, we outline a structure-aware evaluation perspective that complements downstream task metrics with graph-level properties, execution cost, robustness, and structural variation across inputs. Our goal is to provide a clear vocabulary, a unified framework for positioning new methods, a more comparable view of existing body of literature, and a more reproducible evaluation standard for future work in workflow optimizations for LLM agents.

关键词: LLM agents, workflow optimization, agentic computation graphs, static templates, dynamic runtime graphs, tool use, information retrieval, execution traces

155. ❌ Instruction-Tuned, but Not More Verifiable Instruction-Following: A Cross-Task Diagnosis for LoRA Adapters

作者: Junyi Zou 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22379v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	15.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LoRA适配器在指令调优中的能力漂移问题，直接涉及PEFT/LoRA（核心内容，15分）、指令调优（核心内容，10分）、监督微调（核心内容，10分）和大语言模型（基础技术，10分）。其他关键词如MoE、量化、推理加速等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究发现名义上为指令调优的LoRA适配器在实际跨任务评估中并不总能提升可验证的指令遵循能力，反而可能在其他任务上表现更好，揭示了能力漂移现象。

摘要翻译

适配器的选择与部署通常基于其名义标签（如指令微调），这些标签隐式地暗示了适配后能力提升的方向。本研究通过在同一LoRA适配器上跨任务评估，检验名义训练目标是否可靠地与实际跨任务能力增益保持一致。最有力的证据来自IFEval所测量的严格、可自动验证的指令遵循能力：在多个随机种子、基础模型和LoRA配置中，名义标签反复出现但并非总能预测这一可验证目标上的性能提升，且表现出明显的配置敏感性，其中包括接近零增益甚至负增益的案例。在一个受控的“指令与数值”设定中，一个最具代表性的强例证是：一个标称指令微调的适配器将脱靶的基于NM的数值基准性能从0.133显著提升至0.632，却未改善IFEval上的可验证指令遵循能力（ILA：0.313降至0.271；PLA：0.250降至0.143；数值均四舍五入保留三位小数）。我们将这种名义能力与实际能力不匹配的模式描述为“能力漂移”。这种不匹配在原始的跨任务性能矩阵中即可观察到；我们使用的漂移分数仅作为与底层指标单位一致的简明摘要，而非提出新的正式度量指标。来自更广泛指令遵循基准的证据则因基准而异且结果混杂，这反映了不同基准对“指令遵循”操作化定义的异质性；因此，我们并未将跨基准的一致性作为前提假设。总体而言，实际应用启示是：在部署前应进行常规的跨任务评估，并避免将名义标签视为可靠的能力代理指标。

摘要 (Abstract)

Adapters are often selected and deployed based on nominal labels (e.g., instruction-tuned), which implicitly suggest what capability improves after adaptation. We test whether nominal training objectives reliably align with realized cross-task capability gains by evaluating the same LoRA adapter across tasks. Our strongest evidence is tied to strict, automatically verifiable instruction following as measured by IFEval: across multiple seeds, base models, and LoRA settings, nominal labels recurrently but not universally fail to predict improvements on this verifiable target, with clear configuration sensitivity including a near-zero or negative case. As an illustrative strongest-case example in a controlled instruction-versus-numeric setting, an instruction-tuned adapter substantially improves off-target NM-based numeric benchmark performance from 0.133 to 0.632 while not improving verifiable instruction following on IFEval (ILA: 0.313 to 0.271; PLA: 0.250 to 0.143; values rounded to three decimals). We refer to this nominal-versus-realized mismatch pattern as capability drift as a descriptive label. The mismatch is visible in the raw cross-task performance matrix; we use a drift score only as a compact summary in the same units as the underlying metrics, not as a new formal metric contribution. Evidence from broader instruction-following benchmarks is benchmark-dependent and mixed, reflecting heterogeneity in how instruction following is operationalized; we therefore do not treat cross-benchmark agreement as a premise. Overall, the practical takeaway is to perform routine cross-task evaluation before deployment and to avoid treating nominal labels as reliable capability proxies.

关键词: LoRA, adapters, instruction tuning, instruction following, capability drift, cross-task evaluation, IFEval, parameter-efficient fine-tuning

156. ❌ OccAny: Generalized Unconstrained Urban 3D Occupancy

作者: Anh-Quan Cao, Tuan-Hung Vu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23502v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是计算机视觉领域的3D占用预测问题，专注于城市环境中的几何重建和分割，使用视觉几何基础模型。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文不涉及任何语言模型、模型训练技术、推理方法、对齐、压缩、代理系统或特定科学领域应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了OccAny，首个无需校准、可泛化到域外场景的城市3D占用预测模型，通过分割强制和新视图渲染技术，在多种图像输入设置下实现了优于现有视觉几何基线的性能。

摘要翻译

现有三维占据栅格预测方法依赖领域内标注数据与精确的传感器标定先验，在可扩展性与跨领域泛化能力方面存在局限。尽管近期视觉几何基础模型展现出强大的泛化性能，但这些模型主要面向通用目标设计，缺乏城市占据栅格预测所需的一个或多个关键要素，即：度量尺度预测、复杂场景中的几何补全能力以及对城市场景的适应性。为填补这一空白，我们提出了OccAny——首个能够处理无约束跨领域未标定场景、预测并补全带有分割特征的度量尺度占据栅格的城市三维占据模型。OccAny具备多模态输入兼容性，可从时序图像、单目图像或环视图像中预测占据栅格。我们的贡献包含三方面：（一）提出了首个通用化三维占据栅格预测框架；（二）设计了分割强制机制，在提升占据栅格质量的同时实现掩码级预测；（三）开发了新颖视角渲染管线，通过推理新视角几何实现测试时视角增强以完成几何补全。大量实验表明，OccAny在三维占据栅格预测任务上超越了所有视觉几何基线模型，同时在两个成熟的城市占据预测数据集上，于三种输入设定下仍与领域内自监督方法保持竞争力。代码已开源：https://github.com/valeoai/OccAny。

摘要 (Abstract)

Relying on in-domain annotations and precise sensor-rig priors, existing 3D occupancy prediction methods are limited in both scalability and out-of-domain generalization. While recent visual geometry foundation models exhibit strong generalization capabilities, they were mainly designed for general purposes and lack one or more key ingredients required for urban occupancy prediction, namely metric prediction, geometry completion in cluttered scenes and adaptation to urban scenarios. We address this gap and present OccAny, the first unconstrained urban 3D occupancy model capable of operating on out-of-domain uncalibrated scenes to predict and complete metric occupancy coupled with segmentation features. OccAny is versatile and can predict occupancy from sequential, monocular, or surround-view images. Our contributions are three-fold: (i) we propose the first generalized 3D occupancy framework with (ii) Segmentation Forcing that improves occupancy quality while enabling mask-level prediction, and (iii) a Novel View Rendering pipeline that infers novel-view geometry to enable test-time view augmentation for geometry completion. Extensive experiments demonstrate that OccAny outperforms all visual geometry baselines on 3D occupancy prediction task, while remaining competitive with in-domain self-supervised methods across three input settings on two established urban occupancy prediction datasets. Our code is available at https://github.com/valeoai/OccAny .

关键词: 3D occupancy prediction, urban scenarios, generalization, segmentation features, novel view rendering, visual geometry foundation models, metric prediction, geometry completion

157. ❌ UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation

作者: Jie Liu, Zilyu Ye, Linxiao Yuan, Shenhan Zhu, Yu Gao, Jie Wu, Kunchang Li, Xionghui Wang, Xiaonan Nie, Weilin Huang, Wanli Ouyang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23500v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出UniGRPO，一个用于交错生成（特别是推理驱动的图像生成）的统一强化学习框架。核心与推理相关：论文明确涉及“reasoning-driven image generation”，模型先通过推理扩展用户提示，再合成图像，这与“Chain of Thought”和“System 2 Thinking”高度相关（核心内容，10分）。论文属于大模型在视觉生成领域的应用，但未明确指定LLM，故“Large Language Models”给5分（有一定关联）。方法涉及“Post-training”（使用GRPO优化策略），故给8分（较强关联）。其他关键词如MoE、量化、RAG等未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文研究如何通过统一的强化学习框架（UniGRPO）优化推理驱动的图像生成过程，实验表明该训练方法能通过推理显著提升图像生成质量，为未来完全交错模型的训练提供了可扩展的基线。

摘要翻译

能够进行交错生成的统一模型已成为一种有前景的范式，学界日益趋向于采用自回归建模处理文本生成，并采用流匹配处理图像生成。为推进这一方向，我们提出了一个专为交错生成设计的统一强化学习框架。我们在其基本单元上验证了该方法：单轮推理驱动的图像生成，即模型首先通过推理扩展用户提示，随后进行图像合成。我们将这一多模态生成过程建模为一个具有稀疏终端奖励的马尔可夫决策过程，并引入UniGRPO，利用GRPO联合优化文本和图像生成策略。为避免过度设计，我们采用极简主义方法，通过无缝整合用于推理的标准GRPO和用于视觉合成的FlowGRPO，充分利用两种模态上成熟的训练方案。为确保其可扩展至多轮交错生成，我们对原始FlowGRPO引入了两项关键修改：（1）消除无分类器引导，以保持线性、无分支的轨迹展开，这对于扩展到涉及多轮交互和多条件生成（例如编辑）的复杂场景至关重要；（2）将标准的潜在KL惩罚替换为直接在速度场上施加的MSE惩罚，这提供了一个更稳健、更直接的正则化信号，以有效缓解奖励破解问题。我们的实验表明，这种统一的训练方案通过推理显著提升了图像生成质量，为未来完全交错模型的训练后优化提供了一个稳健且可扩展的基线。

摘要 (Abstract)

Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for interleaved generation. We validate our approach on its fundamental unit: a single round of reasoning-driven image generation, where the model first expands the user prompt through reasoning, followed by image synthesis. Formulating this multimodal generation process as a Markov Decision Process with sparse terminal rewards, we introduce UniGRPO to jointly optimize text and image generation policies using GRPO. Adopting a minimalist methodology to avoid over-design, we leverage established training recipes for both modalities by seamlessly integrating standard GRPO for reasoning and FlowGRPO for visual synthesis. To ensure scalability to multi-round interleaved generation, we introduce two critical modifications to the original FlowGRPO: (1) eliminating classifier-free guidance to maintain linear, unbranched rollouts, which is essential for scaling to complex scenarios involving multi-turn interactions and multi-condition generation (e.g., editing); and (2) replacing the standard latent KL penalty with an MSE penalty directly on the velocity fields, providing a more robust and direct regularization signal to mitigate reward hacking effectively. Our experiments demonstrate that this unified training recipe significantly enhances image generation quality through reasoning, providing a robust and scalable baseline for the future post-training of fully interleaved models.

关键词: Unified Reinforcement Learning, Interleaved Generation, Reasoning-Driven Image Generation, GRPO, FlowGRPO, Markov Decision Process, Multimodal Generation, Policy Optimization

158. ❌ DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models

作者: Jaewon Min, Jaeeun Lee, Yeji Choi, Paul Hyunbin Cho, Jin Hyeon Kim, Tae-Young Lee, Jongsik Ahn, Hwayeong Lee, Seonghyun Park, Seungryong Kim 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23499v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文DA-Flow专注于计算机视觉中的光流估计任务，使用扩散模型处理视频退化问题。所有评分关键词均与大语言模型、深度学习技术原理或AI科学应用相关，而本文研究的是传统计算机视觉任务（光流估计）与扩散模型的结合，未涉及任何大语言模型技术、训练方法、推理优化、代理系统或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文提出DA-Flow方法，通过融合扩散模型特征与卷积特征来解决真实世界退化视频中的光流估计问题，在多个基准测试中显著优于现有方法。

摘要翻译

在高质量数据上训练的光流模型在面对真实世界中的模糊、噪声和压缩伪影等退化时，性能常会严重下降。为克服这一局限，我们提出了退化感知光流这一新任务，旨在从真实世界退化视频中实现精确的密集对应估计。我们的核心见解是：图像恢复扩散模型的中间表征本身具有退化感知能力，但缺乏时序感知能力。为解决此问题，我们通过全时空注意力机制将模型扩展至跨相邻帧进行关注，并通过实验证明，所得特征展现出零样本对应能力。基于这一发现，我们提出了DA-Flow——一种混合架构，该架构在迭代优化框架内将这些扩散特征与卷积特征相融合。在多个基准测试中，DA-Flow在严重退化条件下显著超越了现有光流方法。

摘要 (Abstract)

Optical flow models trained on high-quality data often degrade severely when confronted with real-world corruptions such as blur, noise, and compression artifacts. To overcome this limitation, we formulate Degradation-Aware Optical Flow, a new task targeting accurate dense correspondence estimation from real-world corrupted videos. Our key insight is that the intermediate representations of image restoration diffusion models are inherently corruption-aware but lack temporal awareness. To address this limitation, we lift the model to attend across adjacent frames via full spatio-temporal attention, and empirically demonstrate that the resulting features exhibit zero-shot correspondence capabilities. Based on this finding, we present DA-Flow, a hybrid architecture that fuses these diffusion features with convolutional features within an iterative refinement framework. DA-Flow substantially outperforms existing optical flow methods under severe degradation across multiple benchmarks.

关键词: optical flow estimation, diffusion models, degradation-aware, video corruption, spatio-temporal attention, zero-shot correspondence, hybrid architecture, iterative refinement

159. ❌ WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

作者: Zhen Li, Zian Meng, Shuwei Shi, Wenshuo Peng, Yuwei Wu, Bo Zheng, Chuanhao Li, Kaipeng Zhang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23497v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频世界模型（video world models）和动态世界建模（dynamic world modeling），与关键词’World Models AND General World Models’高度相关（10分），因为其核心是构建动作驱动的世界模型。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、PEFT等）、AI for Science应用或其他指定关键词，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了WildWorld，一个从AAA游戏中收集的大规模动作条件世界建模数据集，包含显式状态标注，用于解决现有数据集缺乏多样化语义动作和长时程状态一致性的问题，并展示了在建模丰富动作和保持状态一致性方面的持续挑战。

摘要翻译

动力学系统理论与强化学习将世界演化视为由动作驱动的潜在状态动态过程，视觉观测则提供关于状态的部分信息。近期的视频世界模型尝试从数据中学习这种动作条件化的动态。然而，现有数据集很少能满足这一需求：它们通常缺乏多样且具有语义意义的动作空间，且动作往往直接与视觉观测绑定，而非通过底层状态进行中介。因此，动作常与像素级变化纠缠在一起，导致模型难以学习结构化的世界动态，并难以在长时程中保持一致的演化。本文提出WildWorld——一个具有显式状态标注的大规模动作条件化世界建模数据集，该数据集自动采集自一款写实风格的AAA级动作角色扮演游戏（《怪物猎人：荒野》）。WildWorld包含超过1.08亿帧画面，涵盖450余种动作（包括移动、攻击和技能施放），并同步提供逐帧的角色骨骼、世界状态、相机位姿和深度图标注。我们进一步构建WildBench基准，通过动作跟随与状态对齐两项任务评估模型性能。大量实验揭示了当前模型在建模语义丰富的动作以及保持长时程状态一致性方面仍面临持续挑战，凸显了发展状态感知视频生成技术的必要性。项目页面详见 https://shandaai.github.io/wildworld-project/。

摘要 (Abstract)

Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state. Recent video world models attempt to learn this action-conditioned dynamics from data. However, existing datasets rarely match the requirement: they typically lack diverse and semantically meaningful action spaces, and actions are directly tied to visual observations rather than mediated by underlying states. As a result, actions are often entangled with pixel-level changes, making it difficult for models to learn structured world dynamics and maintain consistent evolution over long horizons. In this paper, we propose WildWorld, a large-scale action-conditioned world modeling dataset with explicit state annotations, automatically collected from a photorealistic AAA action role-playing game (Monster Hunter: Wilds). WildWorld contains over 108 million frames and features more than 450 actions, including movement, attacks, and skill casting, together with synchronized per-frame annotations of character skeletons, world states, camera poses, and depth maps. We further derive WildBench to evaluate models through Action Following and State Alignment. Extensive experiments reveal persistent challenges in modeling semantically rich actions and maintaining long-horizon state consistency, highlighting the need for state-aware video generation. The project page is https://shandaai.github.io/wildworld-project/.

关键词: world modeling, action-conditioned dynamics, video world models, explicit state annotations, dynamic world modeling, state-aware video generation, long-horizon state consistency, WildWorld dataset

160. ❌ Foveated Diffusion: Efficient Spatially Adaptive Image and Video Generation

作者: Brian Chao, Lior Yariv, Howard Xiao, Gordon Wetzstein 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23491v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于扩散模型和流匹配模型在图像和视频生成中的效率优化，特别是利用人类视觉的偏心特性进行非均匀令牌分配。虽然论文涉及深度学习在生成任务中的应用，但其核心内容与大多数关键词（特别是大语言模型相关技术）无关。唯一相关的关键词是’Post-training OR Supervised Fine-tuning OR SFT’，因为论文提到’post-trained from an existing base model’，但这不是核心创新点，只是方法的一部分，因此给5分。其他关键词均未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于人类视觉偏心特性的高效图像和视频生成方法，通过非均匀分配令牌密度来减少计算复杂度，在保持感知质量的同时显著降低生成时间和令牌数量。

摘要翻译

扩散模型与流匹配模型已为创意内容创作（如交互式图像与流媒体视频生成）解锁了前所未有的能力。然而，随着对更高分辨率、帧率及上下文长度的需求日益增长，高效生成变得愈发具有挑战性，因为计算复杂度随生成标记数量的增加呈平方级增长。本研究旨在优化用户注视位置已知或可估计（例如通过眼动追踪）场景下的生成效率。在此类场景中，我们利用了人类视觉依赖于离心率的敏锐度特性：虽然用户在注视点周围的小区域（中央凹区域）能感知极高分辨率的视觉信息，但其分辨细节的能力在视野周边区域迅速下降。我们的方法首先通过一个模拟中央凹分辨率的掩码来非均匀分配标记，将更高的标记密度分配给中央凹区域，而降低周边区域的密度。随后在混合分辨率标记设置下生成图像或视频，所得结果在感知上与全分辨率生成难以区分，同时显著减少了标记数量与生成时间。为此，我们开发了一种从高分辨率数据直接构建混合分辨率标记的原理性机制，使得中央凹扩散模型能够基于现有基础模型进行后训练，并保持跨分辨率的内容一致性。我们通过广泛的分析与精心设计的用户研究验证了该方法，证明了中央凹化作为高效生成的一个实用且可扩展维度的有效性。

摘要 (Abstract)

Diffusion and flow matching models have unlocked unprecedented capabilities for creative content creation, such as interactive image and streaming video generation. The growing demand for higher resolutions, frame rates, and context lengths, however, makes efficient generation increasingly challenging, as computational complexity grows quadratically with the number of generated tokens. Our work seeks to optimize the efficiency of the generation process in settings where the user’s gaze location is known or can be estimated, for example, by using eye tracking. In these settings, we leverage the eccentricity-dependent acuity of human vision: while a user perceives very high-resolution visual information in a small region around their gaze location (the foveal region), the ability to resolve detail quickly degrades in the periphery of the visual field. Our approach starts with a mask modeling the foveated resolution to allocate tokens non-uniformly, assigning higher token density to foveal regions and lower density to peripheral regions. An image or video is generated in a mixed-resolution token setting, yielding results perceptually indistinguishable from full-resolution generation, while drastically reducing the token count and generation time. To this end, we develop a principled mechanism for constructing mixed-resolution tokens directly from high-resolution data, allowing a foveated diffusion model to be post-trained from an existing base model while maintaining content consistency across resolutions. We validate our approach through extensive analysis and a carefully designed user study, demonstrating the efficacy of foveation as a practical and scalable axis for efficient generation.

关键词: diffusion models, flow matching, foveated generation, efficient generation, token allocation, human vision, video generation, computational efficiency

161. ❌ AgentRVOS: Reasoning over Object Tracks for Zero-Shot Referring Video Object Segmentation

作者: Woojeong Jin, Jaeho Lee, Heeseong Shin, Seungho Jang, Junhwan Heo, Seungryong Kim 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23489v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出AgentRVOS，一个基于MLLM（多模态大语言模型）和SAM3的智能体化流程，用于零样本Referring Video Object Segmentation。核心创新在于利用MLLM进行基于对象轨迹的推理（query-grounded reasoning），这直接关联到’LLM Agents’、‘Chain of Thought’和’System 2 Thinking’关键词，因为这些概念体现了智能体决策、多步推理和深度推理过程。论文明确使用MLLM，因此与’Large Language Models’高度相关。其他关键词如MoE、SLMs、训练技术、优化方法、压缩加速、科学AI应用等，论文未涉及或仅作为背景提及，故评分为0。

!!! tip deepseek-chat TL;DR

该研究解决了Referring Video Object Segmentation中训练免费方法因MLLM在缺乏对象级证据时做出时间决策而导致的推理质量和时空覆盖限制问题，通过提出AgentRVOS智能体流程，结合SAM3的感知和MLLM的推理，在多个基准测试中实现了训练免费方法的最先进性能。

摘要翻译

参考视频目标分割（Referring Video Object Segmentation，RVOS）旨在根据给定的自然语言查询，在整个视频中分割出目标对象。针对该任务的免训练方法通常遵循一个通用流程：多模态大语言模型（MLLM）选择关键帧，在这些帧中定位被指称的对象，随后由视频分割模型传播分割结果。这一设计虽然直观，但要求MLLM在获得任何对象级证据之前就做出时序决策，这限制了其推理质量与时空覆盖范围。为克服此局限，我们提出了AgentRVOS，一种基于SAM3与MLLM互补优势构建的免训练智能体流程。该方法首先从查询中提取概念，SAM3通过生成的掩码轨迹在全时空范围内提供可靠的感知信息。随后，MLLM基于这些对象级证据进行基于查询的推理，并在SAM3提供的时序存在信息引导下迭代筛选，从而识别目标对象。大量实验表明，AgentRVOS在多个基准测试中取得了免训练方法中最先进的性能，且在不同MLLM骨干网络上均能保持稳定结果。我们的项目页面位于：https://cvlab-kaist.github.io/AgentRVOS/。

摘要 (Abstract)

Referring Video Object Segmentation (RVOS) aims to segment a target object throughout a video given a natural language query. Training-free methods for this task follow a common pipeline: a MLLM selects keyframes, grounds the referred object within those frames, and a video segmentation model propagates the results. While intuitive, this design asks the MLLM to make temporal decisions before any object-level evidence is available, limiting both reasoning quality and spatio-temporal coverage. To overcome this, we propose AgentRVOS, a training-free agentic pipeline built on the complementary strengths of SAM3 and a MLLM. Given a concept derived from the query, SAM3 provides reliable perception over the full spatio-temporal extent through generated mask tracks. The MLLM then identifies the target through query-grounded reasoning over this object-level evidence, iteratively pruning guided by SAM3’s temporal existence information. Extensive experiments show that AgentRVOS achieves state-of-the-art performance among training-free methods across multiple benchmarks, with consistent results across diverse MLLM backbones. Our project page is available at: https://cvlab-kaist.github.io/AgentRVOS/.

关键词: Referring Video Object Segmentation, Agentic Pipeline, MLLM, Reasoning, Zero-Shot, Training-Free, SAM3, Object Tracks

162. ❌ One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

作者: Adrien Ramanana Rahary, Nicolas Dufour, Patrick Perez, David Picard 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23488v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究单目新颖视图生成，属于计算机视觉和3D重建领域，与所有评分关键词（均专注于大模型、深度学习技术原理及其在科学领域的应用）完全无关。论文未涉及任何语言模型、训练技术、推理方法、代理系统或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种仅需单张图像即可训练的新颖视图生成方法OVIE，通过利用单目深度估计作为几何支架，在3000万张未配对互联网图像上训练，实现了零样本设置下的优越性能，且推理速度比次优基线快600倍。

摘要翻译

单目新视角合成长期以来需要多视角图像对进行监督，这限制了训练数据的规模与多样性。我们认为这并非必需：单一视角已足够。本文提出OVIE，该方法完全在未配对的互联网图像上进行训练。我们利用单目深度估计器作为训练时的几何支架：将源图像提升至三维空间，应用采样的相机变换，并通过投影获得伪目标视角。为处理遮挡缺失问题，我们引入了一种掩码训练机制，将几何损失、感知损失与纹理损失限制在有效区域内，从而能够在三千万未经筛选的图像上进行训练。在推理阶段，OVIE无需几何先验，既不依赖深度估计器也不依赖三维表示。仅使用真实场景图像训练的OVIE在零样本设置下超越了现有方法，且推理速度比第二优基线快600倍。代码与模型已在https://github.com/AdrienRR/ovie公开。

摘要 (Abstract)

Monocular novel-view synthesis has long required multi-view image pairs for supervision, limiting training data scale and diversity. We argue it is not necessary: one view is enough. We present OVIE, trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric scaffold at training time: we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view. To handle disocclusions, we introduce a masked training formulation that restricts geometric, perceptual, and textural losses to valid regions, enabling training on 30 million uncurated images. At inference, OVIE is geometry-free, requiring no depth estimator or 3D representation. Trained exclusively on in-the-wild images, OVIE outperforms prior methods in a zero-shot setting, while being 600x faster than the second-best baseline. Code and models are publicly available at https://github.com/AdrienRR/ovie.

关键词: Monocular novel-view synthesis, Unpaired internet images, Depth estimator, Masked training, Zero-shot setting, Inference acceleration, 3D reconstruction, Computer vision

163. ❌ TETO: Tracking Events with Teacher Observation for Motion Estimation and Frame Interpolation

作者: Jini Yang, Eunbeen Hong, Soowon Son, Hyunkoo Lee, Sunghwan Hong, Sunok Kim, Seungryong Kim 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23487v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文TETO专注于事件相机（event camera）的运动估计和帧插值，提出了一种基于知识蒸馏的师生框架，利用预训练的RGB跟踪器从少量未标注的真实世界数据中学习。研究内容涉及计算机视觉、运动估计、帧插值、事件相机、知识蒸馏等，但完全不涉及大语言模型（LLM）、深度学习技术原理创新或AI for Science等关键词。所有关键词均与大模型、深度学习技术原理或科学AI应用无关，因此相关度均为0分。

!!! tip deepseek-chat TL;DR

论文提出TETO框架，通过知识蒸馏从少量未标注真实事件相机数据中学习运动估计，并利用估计的运动先验条件化视频扩散变换器进行帧插值，在多个基准上实现了最先进的性能。

摘要翻译

事件相机以微秒级分辨率捕获像素级亮度变化，提供RGB帧间丢失的连续运动信息。然而，现有基于事件的运动估计方法依赖于大规模合成数据，通常存在显著的仿真与现实差距。我们提出TETO（基于教师观测的事件追踪），这是一种师生框架，通过从预训练RGB跟踪器进行知识蒸馏，仅需约25分钟未标注的真实世界记录即可学习事件运动估计。我们提出的运动感知数据筛选与查询采样策略，通过将物体运动与主导的自身运动解耦，最大化有限数据的学习效率。所得估计器可联合预测点轨迹与稠密光流，我们将其作为显式运动先验，用于调节预训练视频扩散变换器（video diffusion transformer）以进行帧插值。我们在EVIMO2数据集上实现了最先进的点追踪性能，在DSEC数据集上实现了最优光流估计，且训练数据量减少数个数量级，并证明精确的运动估计可直接转化为在BS-ERGB和HQ-EVFI数据集上更优异的帧插值质量。

摘要 (Abstract)

Event cameras capture per-pixel brightness changes with microsecond resolution, offering continuous motion information lost between RGB frames. However, existing event-based motion estimators depend on large-scale synthetic data that often suffers from a significant sim-to-real gap. We propose TETO (Tracking Events with Teacher Observation), a teacher-student framework that learns event motion estimation from only $\sim$25 minutes of unannotated real-world recordings through knowledge distillation from a pretrained RGB tracker. Our motion-aware data curation and query sampling strategy maximizes learning from limited data by disentangling object motion from dominant ego-motion. The resulting estimator jointly predicts point trajectories and dense optical flow, which we leverage as explicit motion priors to condition a pretrained video diffusion transformer for frame interpolation. We achieve state-of-the-art point tracking on EVIMO2 and optical flow on DSEC using orders of magnitude less training data, and demonstrate that accurate motion estimation translates directly to superior frame interpolation quality on BS-ERGB and HQ-EVFI.

关键词: event cameras, motion estimation, frame interpolation, knowledge distillation, teacher-student framework, optical flow, video diffusion transformer, data curation

164. ❌ UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation

作者: Jiaying Lin, Dan Xu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23478v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文UniFunc3D提出了一种用于3D功能分割的统一框架，核心创新在于将多模态大语言模型（multimodal large language model）作为主动观察者，通过联合语义、时间和空间推理来实现任务分解。这与’Large Language Models’高度相关（10分），因为论文明确使用了多模态大语言模型。同时，论文涉及’LLM Agents’（10分），因为框架将LLM作为主动代理来执行任务。此外，论文的推理过程涉及’Chain of Thought’和’System 2 Thinking’（各8分），因为它强调联合推理、任务分解和深度推理以解决歧义。其他关键词如MoE、SLMs、训练技术、优化方法、科学AI应用等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了3D场景中功能分割的视觉盲点问题，通过将多模态大语言模型作为主动观察者进行联合空间-时间推理，在无需任务特定训练的情况下，在SceneFun3D数据集上实现了59.9%的相对mIoU提升，达到了最先进的性能。

摘要翻译

三维场景中的功能分割要求智能体将隐式的自然语言指令映射到细粒度交互元素的精确掩码上。现有方法依赖于碎片化的流程，在初始任务解析阶段存在视觉盲区。我们观察到这些方法受限于单尺度、被动且启发式的帧选择策略。本文提出UniFunc3D——一个统一且无需训练的框架，将多模态大语言模型视为主动观察者。通过将语义、时序和空间推理整合至单次前向传播过程，UniFunc3D执行联合推理，将任务分解过程锚定于直接的视觉证据中。我们的方法引入了基于由粗到细策略的主动时空定位机制，使模型能够自适应选择正确的视频帧，聚焦于高细节的交互部件，同时保留消歧所需的全局上下文信息。在SceneFun3D基准测试中，UniFunc3D取得了最先进的性能，以59.9%的相对mIoU提升显著超越无需训练和基于训练的方法，且无需任何任务特定训练。代码将在项目页面发布：https://jiaying.link/unifunc3d。

摘要 (Abstract)

Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these methods are limited by single-scale, passive and heuristic frame selection. We present UniFunc3D, a unified and training-free framework that treats the multimodal large language model as an active observer. By consolidating semantic, temporal, and spatial reasoning into a single forward pass, UniFunc3D performs joint reasoning to ground task decomposition in direct visual evidence. Our approach introduces active spatial-temporal grounding with a coarse-to-fine strategy. This allows the model to select correct video frames adaptively and focus on high-detail interactive parts while preserving the global context necessary for disambiguation. On SceneFun3D, UniFunc3D achieves state-of-the-art performance, surpassing both training-free and training-based methods by a large margin with a relative 59.9% mIoU improvement, without any task-specific training. Code will be released on our project page: https://jiaying.link/unifunc3d.

关键词: 3D functionality segmentation, multimodal large language model, active spatial-temporal grounding, joint reasoning, training-free framework, coarse-to-fine strategy, task decomposition, visual grounding

165. ❌ RealMaster: Lifting Rendered Scenes into Photorealistic Video

作者: Dana Cohen-Bar, Ido Sobol, Raphael Bensadoun, Shelly Sheynin, Oran Gafni, Or Patashnik, Daniel Cohen-Or, Amit Zohar 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23462v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文RealMaster专注于视频生成领域，利用视频扩散模型将渲染视频提升为逼真视频，同时保持与3D引擎输出的对齐。论文的核心技术是使用IC-LoRA（一种参数高效微调方法）在配对视频数据集上进行训练，将高质量输出提炼为可泛化的模型。因此，仅与关键词’PEFT OR LoRA OR Parameter-efficient Fine-tuning’相关，因为论文明确使用了IC-LoRA（LoRA的变体）进行模型训练，这属于参数高效微调技术，但论文主题并非大模型或深度学习在科学领域的应用，而是计算机视觉和图形学中的视频生成，因此其他关键词均不相关。

!!! tip deepseek-chat TL;DR

论文提出RealMaster方法，利用视频扩散模型和IC-LoRA技术，将3D引擎渲染的视频提升为逼真视频，在保持几何和动态一致性的同时显著提高真实感。

摘要翻译

当前最先进的视频生成模型能够产生令人瞩目的逼真效果，但其缺乏精确控制能力，难以使生成内容与特定场景需求保持一致。此外，由于缺乏底层的显式几何结构，这些模型无法保证三维一致性。相反，三维引擎能够对每个场景元素进行细粒度控制，并通过设计提供固有的三维一致性，但其输出往往仍陷于“恐怖谷”之中。弥合这种仿真与真实之间的鸿沟，既需要结构上的精确性——即输出必须完全保留输入的几何结构与动态特性，也需要全局语义转换——即材质、光照与纹理必须进行整体性转换以实现逼真效果。我们提出了RealMaster方法，该方法利用视频扩散模型将渲染视频提升为逼真视频，同时保持与三维引擎输出的完全对齐。为训练此模型，我们通过一种基于锚点的传播策略生成配对数据集：首尾帧经增强处理以实现真实感，并借助几何条件线索在中间帧之间进行传播。随后，我们在这些配对视频上训练IC-LoRA模型，将流程中的高质量输出提炼为一个能够超越流程限制的通用模型，从而处理序列中途出现的物体与角色，并实现无需锚点帧的推理。在复杂的GTA-V序列上进行评估后，RealMaster显著优于现有视频编辑基线方法，在提升逼真度的同时，完整保留了原始三维控制所指定的几何结构、动态特性与身份特征。

摘要 (Abstract)

State-of-the-art video generation models produce remarkable photorealism, but they lack the precise control required to align generated content with specific scene requirements. Furthermore, without an underlying explicit geometry, these models cannot guarantee 3D consistency. Conversely, 3D engines offer granular control over every scene element and provide native 3D consistency by design, yet their output often remains trapped in the “uncanny valley”. Bridging this sim-to-real gap requires both structural precision, where the output must exactly preserve the geometry and dynamics of the input, and global semantic transformation, where materials, lighting, and textures must be holistically transformed to achieve photorealism. We present RealMaster, a method that leverages video diffusion models to lift rendered video into photorealistic video while maintaining full alignment with the output of the 3D engine. To train this model, we generate a paired dataset via an anchor-based propagation strategy, where the first and last frames are enhanced for realism and propagated across the intermediate frames using geometric conditioning cues. We then train an IC-LoRA on these paired videos to distill the high-quality outputs of the pipeline into a model that generalizes beyond the pipeline’s constraints, handling objects and characters that appear mid-sequence and enabling inference without requiring anchor frames. Evaluated on complex GTA-V sequences, RealMaster significantly outperforms existing video editing baselines, improving photorealism while preserving the geometry, dynamics, and identity specified by the original 3D control.

关键词: video generation, photorealistic video, 3D consistency, video diffusion models, IC-LoRA, parameter-efficient fine-tuning, GTA-V sequences, sim-to-real gap

作者: Gautam Rajendrakumar Gare, Neehar Peri, Matvei Popov, Shruti Jain, John Galeotti, Deva Ramanan 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23455v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）在少样本物体检测任务中的上下文学习（In-context Learning）性能优化，因此与’Large Language Models OR LLMs OR Foundation Models’和’In-context Learning OR Many-shot Learning’高度相关（10分）。其他关键词如MoE、SFT、RAG、量化等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在少样本物体检测任务中上下文学习效果不佳的问题，提出了一种无需梯度的测试时提示优化方法DetPO，在多个基准数据集上显著提升了检测精度。

摘要翻译

多模态大语言模型（MLLMs）在OdinW-13和RefCOCO等主流目标检测基准上展现出强大的视觉定位能力。然而，现有最先进的模型在泛化至预训练中不常见的分布外类别、任务及成像模态时仍面临困难。尽管上下文提示是提升跨任务性能的常用策略，但我们发现其检测准确率往往低于仅使用类别名称的提示方式。这表明当前MLLMs尚无法有效利用少量视觉样本和丰富文本描述进行目标检测。鉴于前沿MLLMs通常仅能通过API访问，且最先进的开源权重模型在消费级硬件上进行微调的成本极高，我们转而探索针对少样本目标检测的黑盒提示优化方法。为此，我们提出检测提示优化（DetPO），这是一种无需梯度的测试时优化方法，通过基于少量视觉训练样本最大化检测准确率并校准预测置信度，对纯文本提示进行优化。我们提出的方法在Roboflow20-VL和LVIS数据集上为通用型MLLMs带来了一致的性能提升，以最高9.7%的优势超越了先前的黑盒优化方法。代码发布于https://github.com/ggare-cmu/DetPO。

摘要 (Abstract)

Multi-Modal LLMs (MLLMs) demonstrate strong visual grounding capabilities on popular object detection benchmarks like OdinW-13 and RefCOCO. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. While in-context prompting is a common strategy to improve performance across diverse tasks, we find that it often yields lower detection accuracy than prompting with class names alone. This suggests that current MLLMs cannot yet effectively leverage few-shot visual examples and rich textual descriptions for object detection. Since frontier MLLMs are typically only accessible via APIs, and state-of-the-art open-weights models are prohibitively expensive to fine-tune on consumer-grade hardware, we instead explore black-box prompt optimization for few-shot object detection. To this end, we propose Detection Prompt Optimization (DetPO), a gradient-free test-time optimization approach that refines text-only prompts by maximizing detection accuracy on few-shot visual training examples while calibrating prediction confidence. Our proposed approach yields consistent improvements across generalist MLLMs on Roboflow20-VL and LVIS, outperforming prior black-box approaches by up to 9.7%. Our code is available at https://github.com/ggare-cmu/DetPO

关键词: Multi-Modal LLMs, In-Context Learning, Few-Shot Object Detection, Prompt Optimization, Black-box Optimization, Visual Grounding, Detection Accuracy, Test-time Optimization

167. ❌ SIGMA: A Physics-Based Benchmark for Gas Chimney Understanding in Seismic Images

作者: Bao Truong, Quang Nguyen, Baoru Huang, Jinpei Han, Van Nguyen, Ngan Le, Minh-Tan Pham, Doan Huy Hien, Anh Nguyen 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23439v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于地震图像中气体烟囱的检测和增强，属于地球物理学的AI应用。论文内容与绝大多数关键词（如LLM、MoE、SFT、RLHF、RAG、推理、代理、量化等）完全无关，因为这些关键词主要涉及大语言模型的技术原理、训练方法、推理优化、代理系统等，而本文未涉及任何语言模型或相关技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将深度学习应用于科学领域（地球物理学），但并非核心内容（论文重点是数据集创建和基准测试，而非AI技术本身），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对地震图像中气体烟囱检测缺乏标注数据的问题，创建了一个基于物理的基准数据集SIGMA，用于气体烟囱的检测和图像增强，并验证了其作为挑战性基准的有效性。

摘要翻译

地震图像通过现场记录重建地下反射率，为勘探与储层监测提供指导。气烟囱是由地下流体运移引起的垂向异常现象。理解这些现象对于评估油气潜力和规避钻井风险至关重要。然而，由于强烈的地震衰减和散射效应，其精确检测面临挑战。传统基于物理的方法计算成本高昂且对模型误差敏感，而深度学习虽能提供高效替代方案，却缺乏标注数据集。本研究提出\textbf{SIGMA}——一个用于地震图像气烟囱解析的新型物理驱动数据集，其特点包括：（i）用于检测的像素级气烟囱掩码；（ii）用于图像增强的退化图像与真实图像配对数据。我们采用了涵盖多种地质背景和数据采集条件的物理模拟方法。综合实验表明，SIGMA为气烟囱解释提供了具有挑战性的基准测试平台，并有助于推动对地震数据的通用理解。

摘要 (Abstract)

Seismic images reconstruct subsurface reflectivity from field recordings, guiding exploration and reservoir monitoring. Gas chimneys are vertical anomalies caused by subsurface fluid migration. Understanding these phenomena is crucial for assessing hydrocarbon potential and avoiding drilling hazards. However, accurate detection is challenging due to strong seismic attenuation and scattering. Traditional physics-based methods are computationally expensive and sensitive to model errors, while deep learning offers efficient alternatives, yet lacks labeled datasets. In this work, we introduce \textbf{SIGMA}, a new physics-based dataset for gas chimney understanding in seismic images, featuring (i) pixel-level gas-chimney mask for detection and (ii) paired degraded and ground-truth image for enhancement. We employed physics-based methods that cover a wide range of geological settings and data acquisition conditions. Comprehensive experiments demonstrate that SIGMA serves as a challenging benchmark for gas chimney interpretation and benefits general seismic understanding.

关键词: seismic images, gas chimney, physics-based dataset, detection, enhancement, benchmark, deep learning, geological settings

168. ❌ I3DM: Implicit 3D-aware Memory Retrieval and Injection for Consistent Video Scene Generation

作者: Jia Li, Han Yan, Yihang Chen, Siqi Li, Xibin Song, Yifu Wang, Jianfei Cai, Tien-Tsin Wong, Pan Ji 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23413v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文I3DM专注于视频生成中的场景一致性技术，提出了一种隐式3D感知记忆检索和注入机制。虽然属于深度学习在计算机视觉领域的应用，但论文内容与所有评分关键词（均围绕大语言模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关。论文未涉及任何语言模型、MoE、缩放定律、训练方法、对齐技术、推理加速、代理系统或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文解决了视频生成中重访先前区域时长期场景一致性的挑战，提出了一种隐式3D感知记忆机制，通过3D感知检索和注入模块实现了更好的重访一致性和相机控制精度。

摘要翻译

尽管视频生成领域已取得显著进展，但在重新访问先前探索过的区域时保持长期场景一致性仍具挑战性。现有解决方案要么依赖显式构建三维几何（易受误差累积和尺度模糊性问题影响），要么采用简单的相机视场检索方法（通常在复杂遮挡情况下失效）。为克服这些局限，我们提出了I3DM——一种新颖的隐式三维感知记忆机制，用于实现一致性的视频场景生成，该方法无需显式三维重建。我们方法的核心是三维感知记忆检索策略，该策略利用预训练前馈新视角合成模型（FF-NVS）的中间特征来评估视角相关性，即使在高度遮挡场景中也能实现鲁棒检索。此外，为充分利用检索到的历史帧，我们引入了三维对齐记忆注入模块。该模块能够隐式地将历史内容扭曲至目标视角，并根据可靠扭曲区域自适应地调节生成过程，从而提升场景重访一致性并实现精确的相机控制。大量实验表明，我们的方法在场景重访一致性、生成保真度和相机控制精度方面均优于现有先进技术。

摘要 (Abstract)

Despite remarkable progress in video generation, maintaining long-term scene consistency upon revisiting previously explored areas remains challenging. Existing solutions rely either on explicitly constructing 3D geometry, which suffers from error accumulation and scale ambiguity, or on naive camera Field-of-View (FoV) retrieval, which typically fails under complex occlusions. To overcome these limitations, we propose I3DM, a novel implicit 3D-aware memory mechanism for consistent video scene generation that bypasses explicit 3D reconstruction. At the core of our approach is a 3D-aware memory retrieval strategy, which leverages the intermediate features of a pre-trained Feed-Forward Novel View Synthesis (FF-NVS) model to score view relevance, enabling robust retrieval even in highly occluded scenarios. Furthermore, to fully utilize the retrieved historical frames, we introduce a 3D-aligned memory injection module. This module implicitly warps historical content to the target view and adaptively conditions the generation on reliable warping regions, leading to improved revisit consistency and accurate camera control. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches, achieving superior revisit consistency, generation fidelity, and camera control precision.

关键词: video generation, scene consistency, 3D-aware memory, implicit 3D reconstruction, memory retrieval, memory injection, novel view synthesis, camera control

169. ❌ GeoSANE: Learning Geospatial Representations from Models, Not Data

作者: Joelle Hanna, Damian Falk, Stella X. Yu, Damian Borth 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23408v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	10.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文GeoSANE提出了一种从现有基础模型权重中学习统一神经表示的方法，核心创新在于模型权重生成而非传统预训练。与’Foundation Models’相关（8分），因为论文处理多个遥感基础模型；与’Pre-training’和’Fine-tuning’相关（各8分），因为涉及权重生成后的微调；与’Model Merging’高度相关（10分），因为核心方法是从多个模型权重中学习统一表示；与’AI for Science’高度相关（10分），因为应用于地理空间科学领域。其他关键词如MoE、SLMs、RLHF等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

论文提出GeoSANE框架，通过从现有基础模型权重中学习统一表示来生成新模型权重，解决了地理空间领域多模型知识统一问题，在多个任务和数据集上超越了从头训练的模型和现有先进方法。

摘要翻译

遥感领域的最新进展催生了大量可用基础模型；这些模型分别基于不同模态、数据集和目标进行训练，却仅捕捉了广阔地理空间知识图谱中的局部信息。尽管这些模型在各自领域表现出色，其能力仍呈现互补性而非统一性。因此，我们并非择一而用，而是致力于将其优势整合为统一的共享表征。本文提出GeoSANE——一个地理空间模型工坊，它能够从现有基础模型与任务专用模型的权重中学习统一的神经表征，并可按需生成新型神经网络权重。针对目标架构，GeoSANE生成的权重可直接用于多模态分类、分割与检测任务的微调。实验表明：GeoSANE生成的模型始终优于从头训练的对照模型，达到或超越当前最先进的遥感基础模型性能；在生成轻量级网络时，其表现也优于通过剪枝或知识蒸馏获得的模型。在十个多样化数据集及GEO-Bench上的评估验证了其强大的泛化能力。通过从预训练范式转向权重生成范式，GeoSANE为跨模型与跨任务的地理空间知识统一与迁移提供了全新框架。代码发布于\href{https://hsg-aiml.github.io/GeoSANE/}{hsg-aiml.github.io/GeoSANE/}。

摘要 (Abstract)

Recent advances in remote sensing have led to an increase in the number of available foundation models; each trained on different modalities, datasets, and objectives, yet capturing only part of the vast geospatial knowledge landscape. While these models show strong results within their respective domains, their capabilities remain complementary rather than unified. Therefore, instead of choosing one model over another, we aim to combine their strengths into a single shared representation. We introduce GeoSANE, a geospatial model foundry that learns a unified neural representation from the weights of existing foundation models and task-specific models, able to generate novel neural networks weights on-demand. Given a target architecture, GeoSANE generates weights ready for finetuning for classification, segmentation, and detection tasks across multiple modalities. Models generated by GeoSANE consistently outperform their counterparts trained from scratch, match or surpass state-of-the-art remote sensing foundation models, and outperform models obtained through pruning or knowledge distillation when generating lightweight networks. Evaluations across ten diverse datasets and on GEO-Bench confirm its strong generalization capabilities. By shifting from pre-training to weight generation, GeoSANE introduces a new framework for unifying and transferring geospatial knowledge across models and tasks. Code is available at \href{https://hsg-aiml.github.io/GeoSANE/}{hsg-aiml.github.io/GeoSANE/}.

关键词: geospatial representation, foundation models, weight generation, model merging, remote sensing, multi-modal learning, transfer learning, neural representation

170. ❌ Harnessing Lightweight Transformer with Contextual Synergic Enhancement for Efficient 3D Medical Image Segmentation

作者: Xinyu Liu, Zhen Chen, Wuyang Li, Chenxin Li, Yixuan Yuan 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23390v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于3D医学图像分割，提出了一种轻量级Transformer架构（Light-UNETR）和上下文协同增强学习策略，以提高模型和数据效率。虽然论文涉及Transformer在医学图像分析中的应用，属于AI for Science（生物信息学/医学影像分析）范畴，但与所有其他关键词（均围绕大语言模型/LLM的技术、训练、推理、对齐、应用等）完全无关。因此，仅’AI for Science OR Bioinformatics OR Cheminformatics’获得8分（因其直接应用于医学图像分析，属于科学AI应用），其余关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种轻量级Transformer架构Light-UNETR和上下文协同增强学习策略，以解决3D医学图像分割中Transformer计算成本高、需要大量标注数据的问题，在多个基准测试中实现了更高的性能和效率。

摘要翻译

Transformer在三维医学图像分割中展现出卓越性能，但其高计算需求与对大量标注数据的依赖限制了实际应用。为应对这些挑战，我们聚焦于两个关键维度：模型效率与数据效率。具体而言，我们提出轻量化Transformer模型Light-UNETR以实现模型效率。该模型核心包含轻量化维度缩减注意力（Lightweight Dimension Reductive Attention, LIDR）模块，该模块通过多分支注意力机制在降低空间与通道维度的同时捕获全局与局部特征。此外，我们设计了紧凑门控线性单元（Compact Gated Linear Unit, CGLU），以极少的参数实现通道交互的选择性控制。在数据效率方面，我们提出上下文协同增强（Contextual Synergic Enhancement, CSE）学习策略：首先通过注意力引导替换（Attention-Guided Replacement）利用外部上下文信息辅助未标注数据学习，进而采用空间掩码一致性（Spatial Masking Consistency）挖掘内部上下文信息以增强未标注数据的空间上下文推理能力。在多组基准数据集上的实验验证了本方法在性能与效率上的优越性。例如，在左心房分割（Left Atrial Segmentation）数据集上仅使用10%标注数据时，我们的方法以仅9.2%的计算量（FLOPs）和14.2%的参数量，在Jaccard指标上超越BCP方法1.43%。代码已发布于https://github.com/CUHK-AIM-Group/Light-UNETR。

摘要 (Abstract)

Transformers have shown remarkable performance in 3D medical image segmentation, but their high computational requirements and need for large amounts of labeled data limit their applicability. To address these challenges, we consider two crucial aspects: model efficiency and data efficiency. Specifically, we propose Light-UNETR, a lightweight transformer designed to achieve model efficiency. Light-UNETR features a Lightweight Dimension Reductive Attention (LIDR) module, which reduces spatial and channel dimensions while capturing both global and local features via multi-branch attention. Additionally, we introduce a Compact Gated Linear Unit (CGLU) to selectively control channel interaction with minimal parameters. Furthermore, we introduce a Contextual Synergic Enhancement (CSE) learning strategy, which aims to boost the data efficiency of Transformers. It first leverages the extrinsic contextual information to support the learning of unlabeled data with Attention-Guided Replacement, then applies Spatial Masking Consistency that utilizes intrinsic contextual information to enhance the spatial context reasoning for unlabeled data. Extensive experiments on various benchmarks demonstrate the superiority of our approach in both performance and efficiency. For example, with only 10% labeled data on the Left Atrial Segmentation dataset, our method surpasses BCP by 1.43% Jaccard while drastically reducing the FLOPs by 90.8% and parameters by 85.8%. Code is released at https://github.com/CUHK-AIM-Group/Light-UNETR.

关键词: 3D medical image segmentation, lightweight transformer, model efficiency, data efficiency, Light-UNETR, Contextual Synergic Enhancement, attention mechanism, semi-supervised learning

171. ❌ From Feature Learning to Spectral Basis Learning: A Unifying and Flexible Framework for Efficient and Robust Shape Matching

作者: Feifan Luo, Hongyang Chen 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23383v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机图形学和视觉中的3D形状匹配问题，提出了一种名为Advanced Functional Maps的框架，通过可学习的谱基函数优化特征提取和匹配。论文的核心技术涉及谱图理论、功能映射、无监督学习和几何处理，但完全不涉及大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用。所有评分关键词均与大模型、深度学习技术或AI科学应用相关，而本文研究的是传统的计算机视觉/图形学问题，使用谱方法和几何深度学习，与评分关键词列表中的任何主题均无直接关联。

!!! tip deepseek-chat TL;DR

该论文解决了3D形状匹配中谱基函数优化不足和计算效率低的问题，提出了一个名为Advanced Functional Maps的无监督学习框架，通过可学习的抑制函数优化谱基，实现了端到端的特征提取和基函数联合优化，在非等距和拓扑噪声场景下显著优于现有方法。

摘要翻译

形状匹配是计算机图形学与视觉领域的一项基础任务，其中深度函数映射已成为主流范式。然而，现有方法主要通过约束逐点对应和函数映射来学习信息丰富的特征表示，而忽视了函数映射流程中的关键组成部分——谱基的优化。这一疏忽常导致匹配结果欠佳。此外，当前许多方法依赖传统耗时的函数映射求解器，带来了显著的计算开销。为弥补这些不足，我们提出了先进函数映射框架，该框架通过将固定基函数替换为可学习基函数来推广标准函数映射，并具备严格的理论保证。具体而言，谱基通过一组习得的抑制函数进行优化。在此基础上，我们提出了首个用于鲁棒非刚性三维形状匹配的无监督谱基学习方法，实现了特征提取与基函数的端到端联合优化。我们的方法引入了一个新颖的热扩散模块和无监督损失函数，并采用了一种简化的架构，绕过了昂贵的求解器和辅助损失。大量实验表明，我们的方法显著优于当前最先进的特征学习方法，尤其在具有挑战性的非等距和拓扑噪声场景中表现突出，同时保持了高效性。最后，我们揭示了优化基函数等价于谱卷积过程，其中抑制函数充当滤波器。这一发现启发了基于谱图网络的增强表示方法，为未来研究开辟了新途径。我们的代码公开于 https://github.com/LuoFeifan77/Unsupervised-Spectral-Basis-Learning。

摘要 (Abstract)

Shape matching is a fundamental task in computer graphics and vision, with deep functional maps becoming a prominent paradigm. However, existing methods primarily focus on learning informative feature representations by constraining pointwise and functional maps, while neglecting the optimization of the spectral basis-a critical component of the functional map pipeline. This oversight often leads to suboptimal matching results. Furthermore, many current approaches rely on conventional, time-consuming functional map solvers, incurring significant computational overhead. To bridge these gaps, we introduce Advanced Functional Maps, a framework that generalizes standard functional maps by replacing fixed basis functions with learnable ones, supported by rigorous theoretical guarantees. Specifically, the spectral basis is optimized through a set of learned inhibition functions. Building on this, we propose the first unsupervised spectral basis learning method for robust non-rigid 3D shape matching, enabling the joint, end-to-end optimization of feature extraction and basis functions. Our approach incorporates a novel heat diffusion module and an unsupervised loss function, alongside a streamlined architecture that bypasses expensive solvers and auxiliary losses. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art feature-learning approaches, particularly in challenging non-isometric and topological noise scenarios, while maintaining high efficiency. Finally, we reveal that optimizing basis functions is equivalent to spectral convolution, where inhibition functions act as filters. This insight enables enhanced representations inspired by spectral graph networks, opening new avenues for future research. Our code is available at https://github.com/LuoFeifan77/Unsupervised-Spectral-Basis-Learning.

关键词: shape matching, functional maps, spectral basis learning, unsupervised learning, non-rigid 3D shapes, heat diffusion, spectral convolution, geometric deep learning

172. ❌ FG-Portrait: 3D Flow Guided Editable Portrait Animation

作者: Yating Xu, Yunqi Miao, Evangelos Ververas, Jiankang Deng, Jifei Song 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23381v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《FG-Portrait: 3D Flow Guided Editable Portrait Animation》专注于计算机视觉和图形学领域的肖像动画生成，提出了一种基于3D流引导的扩散模型方法来解决运动迁移问题。论文的核心技术涉及3D头部模型、几何驱动运动对应、深度引导采样和扩散模型，但完全不涉及大语言模型（LLMs）、深度学习技术原理创新或AI for Science等关键词。所有评分关键词均与大模型、深度学习技术原理或科学AI应用相关，而本文研究的是特定视觉任务（肖像动画），属于传统计算机视觉/图形学范畴，与评分关键词列表无任何关联。

!!! tip deepseek-chat TL;DR

该论文解决了肖像动画中运动迁移的挑战，通过引入3D流和深度引导采样来改进扩散模型，实现了高保真的运动迁移和用户可编辑的面部表情与头部姿态控制。

摘要翻译

从驱动肖像到源肖像的运动迁移仍是肖像动画领域的核心挑战。当前基于扩散模型的方法仅以驱动运动为条件，未能捕捉源到驱动的对应关系，导致运动迁移效果欠佳。尽管光流估计提供了另一种解决方案，但从二维输入预测密集对应关系本身是不适定问题，常产生不准确的动画结果。我们通过引入三维流来解决这一问题——这是一种无需学习、由几何驱动的运动对应关系，可直接从参数化三维头部模型计算得出。为将这一三维先验融入扩散模型，我们提出了三维流编码技术，通过查询每个目标像素可能对应的三维流来指示其回溯至源位置的空间位移。为获得与二维运动变化对齐的三维流，我们进一步提出深度引导采样方法，以精确定位每个像素对应的三维空间点。除高保真肖像动画外，本模型还支持用户对面部表情和头部姿态进行指定编辑。大量实验证明，我们的方法在保持驱动运动一致性与源身份特征忠实度方面均具有显著优势。

摘要 (Abstract)

Motion transfer from the driving to the source portrait remains a key challenge in the portrait animation. Current diffusion-based approaches condition only on the driving motion, which fails to capture source-to-driving correspondences and consequently yields suboptimal motion transfer. Although flow estimation provides an alternative, predicting dense correspondences from 2D input is ill-posed and often yields inaccurate animation. We address this problem by introducing 3D flows, a learning-free and geometry-driven motion correspondence directly computed from parametric 3D head models. To integrate this 3D prior into diffusion model, we introduce 3D flow encoding to query potential 3D flows for each target pixel to indicate its displacement back to the source location. To obtain 3D flows aligned with 2D motion changes, we further propose depth-guided sampling to accurately locate the corresponding 3D points for each pixel. Beyond high-fidelity portrait animation, our model further supports user-specified editing of facial expression and head pose. Extensive experiments demonstrate the superiority of our method on consistent driving motion transfer as well as faithful source identity preservation.

关键词: portrait animation, motion transfer, 3D flow, diffusion model, parametric 3D head models, depth-guided sampling, facial expression editing, head pose editing

173. ❌ Object Pose Transformer: Unifying Unseen Object Pose Estimation

作者: Weihang Li, Lorenzo Garattoni, Fabien Despinoy, Nassir Navab, Benjamin Busam 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23370v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的3D物体姿态估计，提出了一种统一的Transformer框架来解决未见物体的绝对和相对姿态估计问题。论文内容完全围绕计算机视觉、几何推理和深度学习架构展开，未涉及任何大语言模型（LLMs）、大模型技术原理、AI对齐、推理方法、模型优化或AI for Science等关键词领域。所有关键词均与大模型、深度学习技术原理或科学AI应用相关，而本文是纯粹的计算机视觉研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种统一的Object Pose Transformer框架，通过任务分解和对比学习，在单个模型中同时实现了未见物体的绝对和相对3D姿态估计，并在多个基准测试中取得了最先进的性能。

摘要翻译

学习无模型未见实例的物体姿态估计仍是三维视觉领域的核心挑战。现有方法通常分为两种互不关联的范式：类别级方法在规范空间中预测绝对姿态，但依赖于预定义的分类体系；而相对姿态方法估计跨视角变换，却无法恢复单视角绝对姿态。本研究提出物体姿态变换器（Object Pose Transformer, \ours{}），这是一个通过任务分解将两种范式统一于单一模型的前馈框架。\ours{} 能够从RGB输入中联合预测深度图、点云图、相机参数和归一化物体坐标（NOCS），从而同时实现类别级绝对SA(3)姿态与未见物体的相对SE(3)姿态估计。我们的方法利用基于对比学习的物体中心潜在嵌入实现规范化，在推理时无需语义标签，并通过点云图作为相机空间表征来支持多视角相对几何推理。借助跨帧特征交互和共享物体嵌入，本模型利用多视角间的相对几何一致性来改进绝对姿态估计，减少单视角预测的歧义性。此外，\ours{} 具有相机无关性，能够实时学习相机内参，并支持可选深度输入以实现度量尺度恢复，同时在纯RGB设置下仍保持完整功能。在多个基准数据集（NOCS、HouseCat6D、Omni6DPose、Toyota-Light）上的大量实验表明，该统一架构在绝对与相对姿态估计任务中均达到了最先进的性能水平。

摘要 (Abstract)

Learning model-free object pose estimation for unseen instances remains a fundamental challenge in 3D vision. Existing methods typically fall into two disjoint paradigms: category-level approaches predict absolute poses in a canonical space but rely on predefined taxonomies, while relative pose methods estimate cross-view transformations but cannot recover single-view absolute pose. In this work, we propose Object Pose Transformer (\ours{}), a unified feed-forward framework that bridges these paradigms through task factorization within a single model. \ours{} jointly predicts depth, point maps, camera parameters, and normalized object coordinates (NOCS) from RGB inputs, enabling both category-level absolute SA(3) pose and unseen-object relative SE(3) pose. Our approach leverages contrastive object-centric latent embeddings for canonicalization without requiring semantic labels at inference time, and uses point maps as a camera-space representation to enable multi-view relative geometric reasoning. Through cross-frame feature interaction and shared object embeddings, our model leverages relative geometric consistency across views to improve absolute pose estimation, reducing ambiguity in single-view predictions. Furthermore, \ours{} is camera-agnostic, learning camera intrinsics on-the-fly and supporting optional depth input for metric-scale recovery, while remaining fully functional in RGB-only settings. Extensive experiments on diverse benchmarks (NOCS, HouseCat6D, Omni6DPose, Toyota-Light) demonstrate state-of-the-art performance in both absolute and relative pose estimation tasks within a single unified architecture.

关键词: object pose estimation, unseen objects, transformer, 3D vision, relative pose, absolute pose, NOCS, camera-agnostic

174. ❌ ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment

作者: Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, Feng Xiong, Xing Wei, Zhiheng Ma, Mu Xu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23376v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出ABot-PhysWorld，一个14B Diffusion Transformer模型，属于基础模型（Foundation Models）范畴，因此’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。模型使用DPO-based post-training框架，因此’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’和’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分）。论文研究世界模型（World Models），因此’World Models AND General World Models’高度相关（10分）。论文涉及物理对齐（physics alignment）和减少幻觉（hallucination mitigation），因此’Hallucination Mitigation OR Factuality OR Truthfulness’和’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（5分）。论文使用300万操作片段数据集，涉及数据质量，因此’Scaling Laws AND Data Quality’有一定关联（5分）。论文涉及机器人操作，属于AI for Science应用，因此’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。论文涉及机器人代理控制，因此’LLM Agents OR Autonomous Agents OR Agentic Workflow’有一定关联（5分）。其他关键词如MoE、SLMs、RAG、CoT、MCTS、PEFT等未在论文中提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文解决了视频世界模型在机器人操作中产生物理上不合理行为的问题，通过提出ABot-PhysWorld模型，结合DPO后训练框架和物理感知数据集，实现了物理上合理且视觉真实的视频生成，并在基准测试中取得了最先进的性能。

摘要翻译

基于视频的世界模型为具身模拟与规划提供了强大的范式，然而，由于在通用视觉数据上进行训练，且采用忽略物理规律的基于似然的目标函数，当前最先进的模型常产生物理上不合理的操控结果——例如物体穿透和反重力运动。我们提出了ABot-PhysWorld，这是一个140亿参数的扩散Transformer模型，能够生成视觉逼真、物理合理且动作可控的视频。该模型基于一个包含三百万段带有物理感知标注的操控视频片段构成的精选数据集构建，并采用一种新颖的基于DPO的解耦判别器后训练框架，以抑制非物理行为，同时保持视觉质量。其并行上下文块实现了精确的空间动作注入，以支持跨具身控制。为了更好地评估泛化能力，我们引入了EZSbench，这是首个独立于训练、结合真实与合成未见过的机器人-任务-场景组合的具身零样本基准。它采用解耦评估协议，分别评估物理真实性和动作对齐度。ABot-PhysWorld在PBench和EZSbench上均取得了新的最先进性能，在物理合理性和轨迹一致性方面超越了Veo 3.1和Sora v2 Pro。我们将公开EZSbench，以促进具身视频生成领域的标准化评估。

摘要 (Abstract)

Video-based world models offer a powerful paradigm for embodied simulation and planning, yet state-of-the-art models often generate physically implausible manipulations - such as object penetration and anti-gravity motion - due to training on generic visual data and likelihood-based objectives that ignore physical laws. We present ABot-PhysWorld, a 14B Diffusion Transformer model that generates visually realistic, physically plausible, and action-controllable videos. Built on a curated dataset of three million manipulation clips with physics-aware annotation, it uses a novel DPO-based post-training framework with decoupled discriminators to suppress unphysical behaviors while preserving visual quality. A parallel context block enables precise spatial action injection for cross-embodiment control. To better evaluate generalization, we introduce EZSbench, the first training-independent embodied zero-shot benchmark combining real and synthetic unseen robot-task-scene combinations. It employs a decoupled protocol to separately assess physical realism and action alignment. ABot-PhysWorld achieves new state-of-the-art performance on PBench and EZSbench, surpassing Veo 3.1 and Sora v2 Pro in physical plausibility and trajectory consistency. We will release EZSbench to promote standardized evaluation in embodied video generation.

关键词: World Foundation Model, Robotic Manipulation, Physics Alignment, Diffusion Transformer, DPO-based Post-training, Physical Plausibility, Embodied Video Generation, Zero-shot Benchmark

175. ❌ FHAvatar: Fast and High-Fidelity Reconstruction of Face-and-Hair Composable 3D Head Avatar from Few Casual Captures

作者: Yujie Sun, Zhuoqiang Cai, Chaoyue Niu, Jianchuan Chen, Zhiwen Chen, Chengfei Lv, Fan Wu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23345v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文FHAvatar专注于3D高斯头像重建的计算机视觉任务，涉及3D建模、多视图重建、实时动画等技术，但完全不涉及大语言模型、深度学习技术原理创新或AI for Science等关键词领域。所有关键词均与大模型、深度学习技术原理或科学AI应用相关，而本文是纯粹的计算机视觉/图形学研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

论文提出FHAvatar框架，通过解耦面部和头发表示并使用聚合Transformer骨干，实现了从少量随意捕获中快速重建高质量可组合3D头部头像。

摘要翻译

本文提出FHAvatar，一种从任意数量视图重建具备可组合面部与头发组件的三维高斯化身的新框架。与以往方法在统一建模过程中耦合面部和头发表示不同，我们通过在纹理空间中分别使用平面高斯表示面部、基于发丝的高斯表示头发，从而显式解耦这两个组件。为克服现有方法依赖密集多视图捕捉或昂贵的逐身份优化的局限，我们提出一种聚合Transformer主干网络，从多视图数据集中学习几何感知的跨视图先验以及头部-头发结构一致性，从而能够从少量随意拍摄的图像中实现高效的特征提取与融合。大量定量与定性实验表明，FHAvatar仅需数分钟、通过少量对新身份的观测即可达到最先进的重建质量，同时支持实时动画、便捷的发型迁移以及风格化编辑，从而拓宽了数字化身创建的普及性与应用范围。

摘要 (Abstract)

We present FHAvatar, a novel framework for reconstructing 3D Gaussian avatars with composable face and hair components from an arbitrary number of views. Unlike previous approaches that couple facial and hair representations within a unified modeling process, we explicitly decouple two components in texture space by representing the face with planar Gaussians and the hair with strand-based Gaussians. To overcome the limitations of existing methods that rely on dense multi-view captures or costly per-identity optimization, we propose an aggregated transformer backbone to learn geometry-aware cross-view priors and head-hair structural coherence from multi-view datasets, enabling effective and efficient feature extraction and fusion from few casual captures. Extensive quantitative and qualitative experiments demonstrate that FHAvatar achieves state-of-the-art reconstruction quality from only a few observations of new identities within minutes, while supporting real-time animation, convenient hairstyle transfer, and stylized editing, broadening the accessibility and applicability of digital avatar creation.

关键词: 3D Gaussian avatars, face and hair composable, few casual captures, aggregated transformer backbone, real-time animation, hairstyle transfer, digital avatar creation

176. ❌ An Explainable AI-Driven Framework for Automated Brain Tumor Segmentation Using an Attention-Enhanced U-Net

作者: MD Rashidul Islam, Bakary Gibba 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23344v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于医学影像分割，使用注意力增强的U-Net进行脑肿瘤分割，并引入Grad-CAM进行可解释性分析。论文与绝大多数关键词（涉及大模型技术、训练方法、推理优化、智能体等）完全无关，仅与’Mechanistic Interpretability OR Explainable AI’高度相关（论文明确使用Grad-CAM进行可解释性分析），以及与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（属于生物医学AI应用，但未涉及大模型）。

!!! tip deepseek-chat TL;DR

本研究提出了一种基于注意力增强U-Net和可解释AI的自动脑肿瘤分割框架，在BraTS 2020数据集上实现了高精度分割，为临床应用提供了可靠且可解释的方法。

摘要翻译

基于MRI数据的脑肿瘤计算机辅助分割对于临床诊断、治疗规划和后续疾病监测的决策具有至关重要的意义。胶质瘤因其高度恶性和异质性，要实现对其肿瘤内亚区域的准确可靠分割是一项极具挑战性的任务。手动分割通常耗时且不可靠，这凸显了对鲁棒自动化技术的需求。本研究利用BraTS 2020数据集解决了这一问题，该数据集包含了胶质瘤患者的MRI扫描图像，并标注了四个重要类别：背景/健康组织、坏死/非增强核心、水肿和增强肿瘤。本文提出了一种基于U-Net模型的新分割技术，该模型通过引入注意力门机制来聚焦图像中最显著的区域。为应对类别不平衡问题，我们采用了手动设计的损失函数，如Dice Loss和Categorical Dice Loss，并结合标准的分类交叉熵。其他评估指标，如敏感性和特异性，被用于衡量模型在肿瘤类别间的区分能力。此外，我们引入了基于Grad-CAM的可解释人工智能技术，以可视化注意力区域并提升模型的可解释性，同时结合高斯滤波实现了平滑的热图生成技术。我们的方法取得了优异的性能，准确率达到0.9919，Dice系数为0.9901，平均交并比为0.9873，敏感性为0.9908，特异性为0.9974。本研究表明，注意力机制、个性化损失函数和可解释人工智能技术的使用，显著提高了MRI扫描中高度复杂的肿瘤结构分割精度，为临床应用提供了一种可靠且可解释的方法。

摘要 (Abstract)

Computer-aided segmentation of brain tumors from MRI data is of crucial significance to clinical decision-making in diagnosis, treatment planning, and follow-up disease monitoring. Gliomas, owing to their high malignancy and heterogeneity, represent a very challenging task for accurate and reliable segmentation into intra-tumoral sub-regions. Manual segmentation is typically time-consuming and not reliable, which justifies the need for robust automated techniques.This research resolves this problem by leveraging the BraTS 2020 dataset, where we have labeled MRI scans of glioma patients with four significant classes: background/healthy tissue, necrotic/non-enhancing core, edema, and enhancing tumor. In this work, we present a new segmentation technique based on a U-Net model augmented with executed attention gates to focus on the most significant regions of images. To counter class imbalance, we employ manually designed loss functions like Dice Loss and Categorical Dice Loss, in conjunction with standard categorical cross-entropy. Other evaluation metrics, like sensitivity and specificity, were used to measure discriminability of the model between tumor classes. Besides, we introduce Grad-CAM-based explainable AI to enable visualizing attention regions and improve model interpretability, together with a smooth heatmap generation technique through Gaussian filtering. Our approach achieved superior performance with accuracy of 0.9919, Dice coefficient of 0.9901, mean IoU of 0.9873, sensitivity of 0.9908, and specificity of 0.9974. This study demonstrates that the use of attention mechanisms, personalized loss functions, and explainable AI significantly improves highly complex tumor structure segmentation precision in MRI scans, providing a reliable and explainable method for clinical applications.

关键词: Brain Tumor Segmentation, MRI, U-Net, Attention Gates, Explainable AI, Grad-CAM, Medical Imaging, Deep Learning

177. ❌ Strain-Parameterized Coupled Dynamics and Dual-Camera Visual Servoing for Aerial Continuum Manipulators

作者: Niloufar Amiri, Farrokh Janabi-Sharifi 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23333v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究肌腱驱动空中连续体机械臂（TD-ACM）的动力学建模和视觉伺服控制，属于机器人学、控制工程和机械工程领域。论文内容完全不涉及大语言模型、深度学习、人工智能模型训练、推理优化、对齐技术、智能体系统或AI在科学领域的应用。所有关键词均与大模型和深度学习技术相关，而该论文专注于物理系统建模和控制算法，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对肌腱驱动空中连续体机械臂（TD-ACM）的耦合动力学建模计算成本高且未考虑平台欠驱动特性的问题，提出了一种基于应变参数化Cosserat杆模型和无人机刚体模型的统一拉格朗日ODE框架，并在此基础上开发了具有视野限制缓解和姿态补偿功能的鲁棒双摄像头视觉伺服控制器，通过仿真和实验验证了其有效性。

摘要翻译

肌腱驱动空中连续体机械臂（Tendon-driven aerial continuum manipulators, TD-ACMs）结合了无人飞行器（UAVs）的机动性与轻质连续体机器人（CRs）的柔顺性。现有针对TD-ACMs的耦合动力学建模方法计算成本高昂，且未明确考虑空中平台欠驱动的特性。为应对这些局限，本文提出了一种具有欠驱动基座的耦合TD-ACM的广义动力学建模方法。该方案将应变参数化的Cosserat杆模型与无人飞行器的刚体模型集成于$\mathrm{SE}(3)$上的统一拉格朗日常微分方程（ODE）框架中，从而避免了计算密集的符号推导。基于所建立的模型，本文进一步提出了一种鲁棒的双摄像头基于图像的视觉伺服（IBVS）方案。该控制器缓解了传统IBVS的视场（FoV）限制，补偿了由无人机横向动力学引起的姿态变化所导致的图像运动，并引入了一个底层自适应控制器以处理建模不确定性，同时提供形式化的稳定性保证。通过在紧凑型定制原型机上进行的广泛仿真与实验验证，证明了所提框架在实际场景中的有效性与鲁棒性。

摘要 (Abstract)

Tendon-driven aerial continuum manipulators (TD-ACMs) combine the maneuverability of uncrewed aerial vehicles (UAVs) with the compliance of lightweight continuum robots (CRs). Existing coupled dynamic modeling approaches for TD-ACMs incur high computational costs and do not explicitly account for aerial platform underactuation. To address these limitations, this paper presents a generalized dynamic formulation of a coupled TD-ACM with an underactuated base. The proposed approach integrates a strain-parameterized Cosserat rod model with a rigid-body model of the UAV into a unified Lagrangian ordinary differential equation (ODE) framework on $\mathrm{SE}(3)$, thereby eliminating computationally intensive symbolic derivations. Building upon the developed model, a robust dual-camera image-based visual servoing (IBVS) scheme is introduced. The proposed controller mitigates the field-of-view (FoV) limitations of conventional IBVS, compensates for attitude-induced image motion caused by UAV lateral dynamics, and incorporates a low-level adaptive controller to address modeling uncertainties with formal stability guarantees. Extensive simulations and experimental validation on a compact custom-built prototype demonstrate the effectiveness and robustness of the proposed framework in real-world scenarios.

关键词: aerial continuum manipulators, coupled dynamics, visual servoing, Cosserat rod model, underactuated base, dual-camera, adaptive control, robust control

178. ❌ ViBe: Ultra-High-Resolution Video Synthesis Born from Pure Images

作者: Yunfeng Wu, Hongying Cheng, Zihao He, Songhua Liu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23326v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种纯图像适应框架ViBe，用于将预训练的视频扩散模型升级到超高清视频合成。核心创新是Relay LoRA，一种两阶段适应策略，涉及参数高效微调（PEFT/LoRA）技术，因此该关键词得10分。论文涉及领域适应（从视频到图像）和微调，因此’Pre-training OR Continual Pre-training OR Domain Adaptation’和’Post-training OR Supervised Fine-tuning OR SFT’各得5分。其他关键词主要与大语言模型（LLM）相关，而本文专注于视频扩散模型，未涉及LLM、推理、代理、科学AI等主题，因此得0分。

!!! tip deepseek-chat TL;DR

本文提出了一种纯图像适应框架ViBe，通过两阶段Relay LoRA策略将预训练的视频扩散模型升级到超高清视频合成，无需视频训练数据，在VBench基准上超越现有方法0.8分。

摘要翻译

基于Transformer的视频扩散模型依赖于对空间与时间令牌的三维注意力机制，这导致计算时间和内存复杂度呈二次方增长，使得超高分辨率视频的端到端训练成本极高。为突破这一瓶颈，我们提出了一种纯图像自适应框架，该框架可将预训练于原生尺度的视频扩散Transformer升级为能够合成更高分辨率视频的模型。然而，单纯使用高分辨率图像进行微调常因图像-视频模态差异而引入显著噪声。为解决此问题，我们将学习目标解耦，分别处理模态对齐与空间外推。我们方法的核心是Relay LoRA——一种两阶段自适应策略。在第一阶段，使用低分辨率图像将视频扩散模型适配至图像域，以弥合模态差异。在第二阶段，进一步使用高分辨率图像对模型进行适配，使其获得空间外推能力。在推理阶段，仅保留高分辨率适配部分，以维持视频生成模态的同时实现高分辨率视频合成。为增强细粒度细节合成能力，我们进一步提出了高频感知训练目标，该目标通过专门设计的重建损失，显式鼓励模型从退化的潜在表示中恢复高频分量。大量实验表明，我们的方法无需任何视频训练数据即可生成具有丰富视觉细节的超高分辨率视频，在VBench基准测试中甚至超越了此前使用高分辨率视频训练的最先进模型0.8分。代码将在https://github.com/WillWu111/ViBe发布。

摘要 (Abstract)

Transformer-based video diffusion models rely on 3D attention over spatial and temporal tokens, which incurs quadratic time and memory complexity and makes end-to-end training for ultra-high-resolution videos prohibitively expensive. To overcome this bottleneck, we propose a pure image adaptation framework that upgrades a video Diffusion Transformer pre-trained at its native scale to synthesize higher-resolution videos. Unfortunately, naively fine-tuning with high-resolution images alone often introduces noticeable noise due to the image-video modality gap. To address this, we decouple the learning objective to separately handle modality alignment and spatial extrapolation. At the core of our approach is Relay LoRA, a two-stage adaptation strategy. In the first stage, the video diffusion model is adapted to the image domain using low-resolution images to bridge the modality gap. In the second stage, the model is further adapted with high-resolution images to acquire spatial extrapolation capability. During inference, only the high-resolution adaptation is retained to preserve the video generation modality while enabling high-resolution video synthesis. To enhance fine-grained detail synthesis, we further propose a High-Frequency-Awareness-Training-Objective, which explicitly encourages the model to recover high-frequency components from degraded latent representations via a dedicated reconstruction loss. Extensive experiments demonstrate that our method produces ultra-high-resolution videos with rich visual details without requiring any video training data, even outperforming previous state-of-the-art models trained on high-resolution videos by 0.8 on the VBench benchmark. Code will be available at https://github.com/WillWu111/ViBe.

关键词: Video Synthesis, Ultra-High-Resolution, Diffusion Transformer, Image Adaptation, LoRA, Parameter-efficient Fine-tuning, Modality Gap, High-Frequency-Awareness

179. ❌ Pose-Free Omnidirectional Gaussian Splatting for 360-Degree Videos with Consistent Depth Priors

作者: Chuanqing Zhuang, Xin Lu, Zehui Deng, Zhengda Lu, Yiqun Wang, Junqi Diao, Jun Xiao 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23324v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉和3D重建领域，提出了一种无姿态的全景3D高斯泼溅方法（PFGS360），用于从无姿态的全景视频中重建3D高斯模型。研究内容涉及相机姿态估计、深度先验、高斯模型优化和视图合成，但完全不涉及大语言模型、深度学习技术原理或科学领域的AI应用。所有关键词均与大模型、深度学习技术或AI for Science相关，而本文属于纯粹的3D计算机视觉研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为PFGS360的无姿态全景3D高斯泼溅方法，通过球形一致性感知的姿态估计模块和深度内点感知的致密化模块，从无姿态的全景视频中重建3D高斯模型，并在真实和合成的360度视频上显著优于现有方法。

摘要翻译

基于全景图的360度三维高斯溅射是三维场景表征的关键技术，现有方法通常依赖耗时的运动恢复结构技术来提供相机位姿与稀疏点云先验。本研究提出一种免位姿的360度三维高斯溅射方法PFGS360，能够从无位姿信息的全景视频中重建三维高斯模型。为实现精确的相机位姿估计，我们首先构建了球面一致性感知的位姿估计模块，该模块通过利用高斯模型内部深度先验，在重建的高斯模型与无位姿图像间建立一致的2D-3D对应关系，从而恢复相机位姿。此外，为提升新视角合成的真实感，我们引入了深度内点感知的致密化模块，该模块借助一致的单目深度先验提取深度内点与高斯异常点，实现高效的高斯模型致密化，从而达成照片级真实感的新视角合成。实验表明，本方法在真实世界与合成的360度视频数据上均显著优于现有免位姿及有位姿先验的三维高斯溅射方法。代码发布于https://github.com/zcq15/PFGS360。

摘要 (Abstract)

Omnidirectional 3D Gaussian Splatting with panoramas is a key technique for 3D scene representation, and existing methods typically rely on slow SfM to provide camera poses and sparse points priors. In this work, we propose a pose-free omnidirectional 3DGS method, named PFGS360, that reconstructs 3D Gaussians from unposed omnidirectional videos. To achieve accurate camera pose estimation, we first construct a spherical consistency-aware pose estimation module, which recovers poses by establishing consistent 2D-3D correspondences between the reconstructed Gaussians and the unposed images using Gaussians’ internal depth priors. Besides, to enhance the fidelity of novel view synthesis, we introduce a depth-inlier-aware densification module to extract depth inliers and Gaussian outliers with consistent monocular depth priors, enabling efficient Gaussian densification and achieving photorealistic novel view synthesis. The experiments show significant outperformance over existing pose-free and pose-aware 3DGS methods on both real-world and synthetic 360-degree videos. Code is available at https://github.com/zcq15/PFGS360.

关键词: Omnidirectional 3D Gaussian Splatting, Pose-Free Reconstruction, 360-Degree Videos, Depth Priors, Novel View Synthesis, Camera Pose Estimation, Gaussian Densification, PFGS360

180. ❌ ARGENT: Adaptive Hierarchical Image-Text Representations

作者: Chuong Huynh, Hossein Souri, Abhinav Kumar, Vitali Petsiuk, Deen Dayal Mohan, Suren Kumar 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23311v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是视觉-语言模型（VLM）在双曲几何空间中的表示学习，专注于解决层次结构嵌入的稳定性和评估问题。所有给定的关键词均针对大语言模型（LLM）及其相关技术（如训练方法、推理、对齐、压缩、应用等），而本文的核心是视觉-语言模型（VLM），属于多模态领域，与纯文本LLM技术无直接关联。摘要中未提及任何LLM、语言模型技术或AI for Science的具体应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文针对现有双曲视觉-语言模型（VLM）中层次嵌入不稳定和评估不可靠的问题，提出了自适应蕴含损失和基于角度的概率蕴含协议，从而提升了图像分类、文本-图像检索和层次理解任务的性能。

摘要翻译

大规模视觉-语言模型（VLMs，如CLIP）能够学习强大的语义表征，但其在欧几里得空间中运作，无法捕捉视觉与语言概念固有的层次结构。双曲几何凭借其指数级体积增长特性，为低失真嵌入此类层次结构提供了原理性的替代方案。然而，现有的双曲VLMs所使用的蕴含损失函数具有不稳定性：当父类嵌入向量向原点收缩时，其蕴含锥会向半空间扩展，导致灾难性的锥体塌陷，从而破坏预期的层次结构。此外，对这些模型的层次化评估仍不可靠，主要依赖于基于检索和基于相关性的度量方法，容易受分类体系依赖性和模糊负样本的影响。为解决这些局限，我们提出了一种自适应蕴含损失函数，并辅以范数正则化器，无需启发式的孔径裁剪即可防止锥体塌陷。我们进一步引入了一种基于角度的概率蕴含评估协议（PEP，Probabilistic Entailment Protocol），用于评估层次理解能力，并以AUC-ROC和平均精度（Average Precision）进行评分。本文提出了一个更强大的双曲VLM基线模型ARGENT（自适应层次图像-文本表征，Adaptive hieRarchical imaGe-tExt represeNTation）。ARGENT在图像分类、文本到图像检索以及所提出的层次化度量指标上，分别将双曲VLM的最先进水平提升了0.7、1.1和0.8个绝对百分点。

摘要 (Abstract)

Large-scale Vision-Language Models (VLMs) such as CLIP learn powerful semantic representations but operate in Euclidean space, which fails to capture the inherent hierarchical structure of visual and linguistic concepts. Hyperbolic geometry, with its exponential volume growth, offers a principled alternative for embedding such hierarchies with low distortion. However, existing hyperbolic VLMs use entailment losses that are unstable: as parent embeddings contract toward the origin, their entailment cones widen toward a half-space, causing catastrophic cone collapse that destroys the intended hierarchy. Additionally, hierarchical evaluation of these models remains unreliable, being largely retrieval-based and correlation-based metrics and prone to taxonomy dependence and ambiguous negatives. To address these limitations, we propose an adaptive entailment loss paired with a norm regularizer that prevents cone collapse without heuristic aperture clipping. We further introduce an angle-based probabilistic entailment protocol (PEP) for evaluating hierarchical understanding, scored with AUC-ROC and Average Precision. This paper introduces a stronger hyperbolic VLM baseline ARGENT, Adaptive hieRarchical imaGe-tExt represeNTation. ARGENT improves the SOTA hyperbolic VLM by 0.7, 1.1, and 0.8 absolute points on image classification, text-to-image retrieval, and proposed hierarchical metrics, respectively.

关键词: Vision-Language Models, Hyperbolic Geometry, Hierarchical Representations, Adaptive Entailment Loss, Probabilistic Entailment Protocol, Image Classification, Text-to-Image Retrieval, ARGENT

181. ❌ Mamba-driven MRI-to-CT Synthesis for MRI-only Radiotherapy Planning

作者: Konstantinos Barmpounakis, Theodoros P. Vagenas, Maria Vakalopoulou, George K. Matsopoulos 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23295v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于医学影像合成（MRI-to-CT），使用Mamba架构（一种状态空间模型）进行跨模态转换，属于AI在生物医学领域的应用。所有关键词均与大型语言模型（LLM）或通用大模型技术直接相关，而本文研究的是特定领域的计算机视觉/医学影像任务，未涉及LLM、MoE、缩放定律、训练技术、推理优化、代理系统等。唯一相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为医学影像分析是生物信息学/AI for Science的一个子领域，但论文未明确提及生物信息学或化学信息学，因此给予8分（有一定关联，非核心）。

!!! tip deepseek-chat TL;DR

该研究探索了基于Mamba的架构用于MRI-to-CT合成，以支持仅MRI的放疗计划，实验表明3D Mamba能有效捕捉体积特征和长程依赖，实现准确且快速的CT合成。

摘要翻译

针对肿瘤患者的放射治疗工作流程日益依赖于多模态医学影像，通常涉及磁共振成像（MRI）与计算机断层扫描（CT）。仅使用MRI的治疗规划已成为一种具有吸引力的替代方案，因其可减少患者电离辐射暴露，并避免跨模态配准引入的误差。尽管基于nnU-Net的框架目前主导了MRI到CT的合成任务，本研究探索了基于Mamba的架构在此任务中的应用，旨在展示状态空间建模相较于标准卷积神经网络在跨模态转换中的优势。具体而言，我们调整了最初为分割任务设计的U-Mamba和SegMamba架构，以执行跨模态图像生成。我们的三维Mamba架构能有效捕捉复杂的体素特征与长程依赖关系，从而在保持快速推理速度的同时实现精确的CT合成。实验在SynthRAD2025数据集的子集上进行，该数据集包含三个解剖区域的配准单通道MRI-CT三维体数据对。定量评估通过结合以亨斯菲尔德单位（HU）计算的图像相似度指标，以及使用TotalSegmentator工具获得的基于分割的指标共同完成，以确保几何一致性得以保持。这些发现为将状态空间模型整合到放射治疗工作流程中奠定了基础。

摘要 (Abstract)

Radiotherapy workflows for oncological patients increasingly rely on multi-modal medical imaging, commonly involving both Magnetic Resonance Imaging (MRI) and Computed Tomography (CT). MRI-only treatment planning has emerged as an attractive alternative, as it reduces patient exposure to ionizing radiation and avoids errors introduced by inter-modality registration. While nnU-Net-based frameworks are predominantly used for MRI-to-CT synthesis, we explore Mamba-based architectures for this task, aiming to showcase the advantages of state-space modeling for cross-modality translation compared to standard convolutional neural networks. Specifically, we adapt both the U-Mamba and the SegMamba architecture, originally proposed for segmentation, to perform cross-modality image generation. Our 3D Mamba architecture effectively captures complex volumetric features and long-range dependencies, thus allowing accurate CT synthesis while maintaining fast inference times. Experiments were conducted on a subset of SynthRAD2025 dataset, comprising registered single-channel MRI-CT volume pairs across three anatomical regions. Quantitative evaluation is performed via a combination of image similarity metrics computed in Hounsefield Units (HU) and segmentation-based metrics obtained from TotalSegmentator to ensure geometric consistency is preserved. The findings pave the way for the integration of state-space models into radiotherapy workflows.

关键词: MRI-to-CT synthesis, Mamba, state-space models, radiotherapy planning, cross-modality translation, 3D medical imaging, SynthRAD2025, volumetric features

182. ❌ Knot-10:A Tightness-Stratified Benchmark for Real-World Knot Classification with Topological Difficulty Analysis

作者: Shiheng Nie, Yunguang Yue 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23286v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究物理绳结分类的计算机视觉基准，属于AI在特定科学应用场景（物理/视觉分析）的研究，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评分5分），因为该关键词涵盖AI在科学领域的应用，而论文涉及AI在物理视觉分类中的科学应用。但论文未涉及大模型、深度学习技术原理创新、LLM相关技术（如MoE、SFT、RAG等）、模型优化（如量化、推理加速）或代理系统等，因此其他所有关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个用于物理绳结分类的细粒度视觉基准Knots-10，通过部署导向的数据分割和拓扑难度分析，发现现有模型在紧密绳结测试和跨域手机照片测试中性能显著下降，揭示了模型对绳外观的偏见。

摘要翻译

物理绳结分类是一种细粒度视觉分类场景，其外观线索被刻意抑制：不同类别共享相同的绳索材质、颜色与背景，类别身份主要取决于交叉结构。我们提出Knots-10基准数据集，包含1,440张图像，采用面向实际部署的数据划分方式——在松散系结的绳结图像上训练，在紧密收束的绳结图像上测试。Swin-T与TransFG模型平均准确率达到97.2%；PMG模型得分94.5%，这与“拼图式图像扰乱会破坏交叉连续性”的假设相符。McNemar检验显示五款通用主干网络中有四款性能无统计学差异，因此微小排名差距需谨慎解读。Mantel置换检验表明，在五款模型中的三款里，拓扑距离与混淆模式呈显著相关（p < 0.01）。我们提出TACA正则化方法，将嵌入向量-拓扑对齐度从rho=0.46提升至rho=0.65，但未改善分类准确率；随机距离消融实验产生了可比的对齐效果，表明其收益可能源于通用正则化机制。使用100张手机照片进行的跨域试点测试显示准确率下降58-69个百分点，揭示出绳索外观偏差是主要失效模式。

摘要 (Abstract)

Physical knot classification is a fine-grained visual classification (FGVC) scenario in which appearance cues are deliberately suppressed: different classes share the same rope material, color, and background, and class identity resides primarily in crossing structure. We introduce the Knots-10 benchmark, comprising 1,440 images with a deployment-oriented split that trains on loosely tied knots and tests on tightly dressed ones. Swin-T and TransFG both average 97.2% accuracy; PMG scores 94.5%, consistent with the hypothesis that jigsaw shuffling disrupts crossing continuity. McNemar tests cannot separate four of the five general-purpose backbones, so small ranking margins should be interpreted with caution. A Mantel permutation test shows that topological distance significantly correlates with confusion patterns in three of the five models (p < 0.01). We propose TACA regularization, which improves embedding-topology alignment from rho=0.46 to rho=0.65 without improving classification accuracy; a random-distance ablation yields comparable alignment, indicating the benefit is likely driven by generic regularization. A pilot cross-domain test with 100 phone photographs reveals a 58-69 percentage-point accuracy drop, exposing rope appearance bias as the dominant failure mode.

关键词: physical knot classification, fine-grained visual classification, benchmark, topological difficulty analysis, deployment-oriented split, embedding-topology alignment, cross-domain test, rope appearance bias

183. ❌ WaveSFNet: A Wavelet-Based Codec and Spatial–Frequency Dual-Domain Gating Network for Spatiotemporal Prediction

作者: Xinyong Cai, Runming Xie, Hu Chen, Yuankai Wu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23284v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文WaveSFNet专注于时空预测学习，提出了一种基于小波的编解码器和空间-频率双域门控网络。虽然属于深度学习在科学计算/预测领域的应用，但所有评分关键词均针对大语言模型（LLMs）及其相关技术（如训练方法、推理优化、对齐、代理等）。论文内容涉及计算机视觉中的视频预测、小波变换、卷积网络架构，与LLM技术、自然语言处理或评分关键词中指定的具体LLM技术（如MoE、RLHF、RAG、量化等）完全无关。因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文提出WaveSFNet，一种基于小波的编解码器和空间-频率双域门控网络，用于解决时空预测中长程动态建模与高频细节保留的挑战，在多个数据集上实现了竞争性的预测精度和低计算复杂度。

摘要翻译

时空预测学习旨在以无监督方式从历史观测数据中预测未来帧，对众多应用至关重要。其核心挑战在于建模长程动态的同时保持高频细节，以实现清晰的多步预测。现有高效的无循环框架通常依赖步进卷积或池化进行下采样，这容易丢失纹理和边界信息，而纯空间算子往往难以平衡局部交互与全局传播。为解决这些问题，我们提出了WaveSFNet——一种将基于小波的编解码器与空间-频率双域门控时空翻译器相统一的高效框架。基于小波的编解码器在下采样与重建过程中保留了高频子带线索。同时，该翻译器首先注入相邻帧差异以显式增强动态信息，随后在大核空间局部建模与频域全局调制之间执行双域门控融合，并结合门控通道交互实现跨通道特征交换。大量实验表明，WaveSFNet在Moving MNIST、TaxiBJ和WeatherBench数据集上取得了具有竞争力的预测精度，同时保持了较低的计算复杂度。代码已开源：https://github.com/fhjdqaq/WaveSFNet。

摘要 (Abstract)

Spatiotemporal predictive learning aims to forecast future frames from historical observations in an unsupervised manner, and is critical to a wide range of applications. The key challenge is to model long-range dynamics while preserving high-frequency details for sharp multi-step predictions. Existing efficient recurrent-free frameworks typically rely on strided convolutions or pooling for sampling, which tends to discard textures and boundaries, while purely spatial operators often struggle to balance local interactions with global propagation. To address these issues, we propose WaveSFNet, an efficient framework that unifies a wavelet-based codec with a spatial–frequency dual-domain gated spatiotemporal translator. The wavelet-based codec preserves high-frequency subband cues during downsampling and reconstruction. Meanwhile, the translator first injects adjacent-frame differences to explicitly enhance dynamic information, and then performs dual-domain gated fusion between large-kernel spatial local modeling and frequency-domain global modulation, together with gated channel interaction for cross-channel feature exchange. Extensive experiments demonstrate that WaveSFNet achieves competitive prediction accuracy on Moving MNIST, TaxiBJ, and WeatherBench, while maintaining low computational complexity. Our code is available at https://github.com/fhjdqaq/WaveSFNet.

关键词: Spatiotemporal Prediction, Wavelet-based Codec, Spatial-Frequency Dual-Domain, Gated Network, High-frequency Details, Long-range Dynamics, Unsupervised Learning, Computational Efficiency

作者: Yuchen Wu, Kun Wang, Yining Pan, Na Zhao 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23276v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多模态3D目标检测的领域泛化问题，提出CCF方法来解决跨域性能下降问题。所有评分关键词均与大语言模型、深度学习技术原理或AI科学应用相关，而本文研究的是计算机视觉中的多模态融合和领域泛化技术，与这些关键词无直接关联。论文未涉及任何大模型技术、训练方法、推理优化、AI代理或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文针对多模态3D目标检测在跨域部署时性能下降的问题，提出了互补协同融合方法，通过查询解耦损失、激光雷达引导深度先验和互补跨模态掩码三个组件，显著提升了领域泛化性能并保持了源域性能。

摘要翻译

多模态融合已成为实现精确三维目标检测的有效范式。然而，当部署于与训练环境不同的目标域时，其性能会显著下降。本研究聚焦于双分支提议级检测器，识别出限制其跨域泛化鲁棒性的两个因素：1）在雨、夜间等挑战性场景中，某一模态可能出现严重退化；2）激光雷达（LiDAR）分支往往主导检测过程，导致视觉线索被系统性利用不足，且在点云受损时系统表现脆弱。为应对这些挑战，我们提出三个核心组件。首先，查询解耦损失（Query-Decoupled Loss）为纯二维、纯三维及融合查询提供独立监督，重新平衡跨模态的梯度流。其次，激光雷达引导深度先验（LiDAR-Guided Depth Prior）通过图像预测深度与激光雷达深度分布的概率融合，为二维查询注入实例感知的几何先验，从而改善其空间初始化。第三，互补跨模态掩蔽（Complementary Cross-Modal Masking）对图像和点云施加互补的空间掩码，促使两种模态的查询在融合解码器中相互竞争，从而推动自适应融合。大量实验表明，本方法在保持源域性能的同时，相较现有先进基线取得了显著提升。代码与模型已公开于 https://github.com/IMPL-Lab/CCF。

摘要 (Abstract)

Multi-modal fusion has emerged as a promising paradigm for accurate 3D object detection. However, performance degrades substantially when deployed in target domains different from training. In this work, focusing on dual-branch proposal-level detectors, we identify two factors that limit robust cross-domain generalization: 1) in challenging domains such as rain or nighttime, one modality may undergo severe degradation; 2) the LiDAR branch often dominates the detection process, leading to systematic underutilization of visual cues and vulnerability when point clouds are compromised. To address these challenges, we propose three components. First, Query-Decoupled Loss provides independent supervision for 2D-only, 3D-only, and fused queries, rebalancing gradient flow across modalities. Second, LiDAR-Guided Depth Prior augments 2D queries with instance-aware geometric priors through probabilistic fusion of image-predicted and LiDAR-derived depth distributions, improving their spatial initialization. Third, Complementary Cross-Modal Masking applies complementary spatial masks to the image and point cloud, encouraging queries from both modalities to compete within the fused decoder and thereby promoting adaptive fusion. Extensive experiments demonstrate substantial gains over state-of-the-art baselines while preserving source-domain performance. Code and models are publicly available at https://github.com/IMPL-Lab/CCF.

关键词: multi-modal fusion, 3D object detection, domain generalization, cross-domain, LiDAR, complementary fusion, query-decoupled loss, depth prior

作者: Xue Wang, Zheng Guan, Wenhua Qian, Chengchao Wang, Runzhuo Ma 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23272v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Multi-Modal Image Fusion via Intervention-Stable Feature Learning》专注于计算机视觉领域的多模态图像融合，提出了一种基于因果干预的特征学习方法。虽然论文涉及深度学习技术（如特征学习、模型优化），但其核心内容与所有评分关键词（均围绕大语言模型、大模型技术原理、AI for Science等）无直接关联。论文未提及任何语言模型、模型训练技术、推理方法、代理系统或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态图像融合中现有方法易受虚假关联影响的问题，提出了一种基于因果干预的框架，通过设计三种干预策略来识别稳健的跨模态依赖关系，并引入因果特征集成器，在公开基准测试和下游高级视觉任务中实现了最先进的性能。

摘要翻译

多模态图像融合将来自不同模态的互补信息整合为统一表征。现有方法主要优化模态间的统计相关性，往往捕捉到由数据集诱导的虚假关联，这些关联在分布变化下性能会下降。本文受因果原理启发，提出一种基于干预的框架以识别稳健的跨模态依赖关系。借鉴珀尔因果层级理论的见解，我们设计了三种原则性干预策略来探究模态关系的不同方面：i) 空间互斥扰动下的互补掩码测试模态是否能够真正补偿彼此缺失的信息；ii) 对相同区域进行随机掩码以识别在部分可观测条件下仍保持信息量的特征子集；iii) 模态丢弃评估每个模态的不可替代贡献。基于这些干预措施，我们引入了因果特征整合器（Causal Feature Integrator, CFI），该模块通过学习识别并优先考虑干预稳定的特征——这些特征通过自适应不变门控机制在不同扰动模式中保持重要性，从而捕捉稳健的模态依赖关系而非虚假关联。大量实验表明，我们的方法在公开基准测试和下游高级视觉任务中均达到了最先进的性能水平。

摘要 (Abstract)

Multi-modal image fusion integrates complementary information from different modalities into a unified representation. Current methods predominantly optimize statistical correlations between modalities, often capturing dataset-induced spurious associations that degrade under distribution shifts. In this paper, we propose an intervention-based framework inspired by causal principles to identify robust cross-modal dependencies. Drawing insights from Pearl’s causal hierarchy, we design three principled intervention strategies to probe different aspects of modal relationships: i) complementary masking with spatially disjoint perturbations tests whether modalities can genuinely compensate for each other’s missing information, ii) random masking of identical regions identifies feature subsets that remain informative under partial observability, and iii) modality dropout evaluates the irreplaceable contribution of each modality. Based on these interventions, we introduce a Causal Feature Integrator (CFI) that learns to identify and prioritize intervention-stable features maintaining importance across different perturbation patterns through adaptive invariance gating, thereby capturing robust modal dependencies rather than spurious correlations. Extensive experiments demonstrate that our method achieves SOTA performance on both public benchmarks and downstream high-level vision tasks.

关键词: Multi-modal image fusion, Causal intervention, Intervention-stable features, Causal Feature Integrator, Robust modal dependencies, Cross-modal dependencies, Feature learning, Computer vision

186. ❌ GO-Renderer: Generative Object Rendering with 3D-aware Controllable Video Diffusion Models

作者: Zekai Gu, Shuoxuan Feng, Yansong Wang, Hanzhuo Huang, Zhongshuo Du, Chengfeng Zhao, Chengwei Ren, Peng Wang, Yuan Liu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23246v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于3D重建与视频扩散模型结合的计算机视觉任务，研究内容为利用3D代理指导视频生成模型实现高质量物体渲染。所有评分关键词均涉及大语言模型（LLM）及相关技术（如训练方法、推理优化、对齐、代理系统等），而本文完全不涉及语言模型或自然语言处理，仅使用扩散模型进行视觉生成，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出GO-Renderer框架，通过整合重建的3D代理来指导视频扩散模型，解决了从图像重建可渲染3D模型时难以准确控制视角和光照的问题，实现了在任意视角和光照条件下的高质量物体渲染。

摘要翻译

从图像重建可渲染的三维模型是一项实用但具有挑战性的任务。近期的前馈式三维重建方法在高效恢复几何结构方面取得了显著成功，但仍无法准确建模这些三维重建模型的复杂外观。当前基于扩散的生成模型能够利用参考图像合成物体的逼真图像或视频，而无需显式建模其外观，这为物体渲染提供了有前景的方向，但缺乏对视角的精确控制。本文提出GO-Renderer，这是一个集成重建三维代理的统一框架，通过引导视频生成模型实现在任意光照条件下、任意视角的高质量物体渲染。我们的方法不仅能够利用重建的三维代理实现精确的视角控制，还能借助扩散生成模型在不同光照环境中实现高质量渲染，而无需显式建模复杂的材质与光照。大量实验表明，GO-Renderer在物体渲染任务中均达到最先进的性能表现，包括在新视角合成图像、在全新光照环境中渲染物体，以及将物体插入现有视频。

摘要 (Abstract)

Reconstructing a renderable 3D model from images is a useful but challenging task. Recent feedforward 3D reconstruction methods have demonstrated remarkable success in efficiently recovering geometry, but still cannot accurately model the complex appearances of these 3D reconstructed models. Recent diffusion-based generative models can synthesize realistic images or videos of an object using reference images without explicitly modeling its appearance, which provides a promising direction for object rendering, but lacks accurate control over the viewpoints. In this paper, we propose GO-Renderer, a unified framework integrating the reconstructed 3D proxies to guide the video generative models to achieve high-quality object rendering on arbitrary viewpoints under arbitrary lighting conditions. Our method not only enjoys the accurate viewpoint control using the reconstructed 3D proxy but also enables high-quality rendering in different lighting environments using diffusion generative models without explicitly modeling complex materials and lighting. Extensive experiments demonstrate that GO-Renderer achieves state-of-the-art performance across the object rendering tasks, including synthesizing images on new viewpoints, rendering the objects in a novel lighting environment, and inserting an object into an existing video.

关键词: 3D reconstruction, video diffusion models, object rendering, viewpoint control, lighting environment, generative models, 3D proxy, state-of-the-art performance

187. ❌ PoseDriver: A Unified Approach to Multi-Category Skeleton Detection for Autonomous Driving

作者: Yasamin Borhani, Taylor Mordan, Yihan Wang, Reyhaneh Hosseininejad, Javad Khoramdel, Alexandre Alahi 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23215v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉中的骨架检测任务，应用于自动驾驶场景，涉及多任务学习、数据集创建和迁移学习。所有评分关键词均与大语言模型、深度学习技术原理或AI for Science直接相关，而本文研究的是传统的计算机视觉检测问题，未涉及大模型、深度学习技术原理创新或科学领域应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为PoseDriver的统一框架，用于自动驾驶场景中多类别对象的骨架检测，通过将每个类别建模为独立任务来处理多任务学习挑战，并在OpenLane数据集上实现了最先进的性能，同时创建了自行车骨架检测数据集验证了框架的迁移能力。

摘要翻译

物体骨架提供了一种结构信息的简洁表征，能够捕捉自动驾驶应用中至关重要的姿态与方向等核心特征。然而，仅通过输入图像即可同时处理多实例、多类别的统一架构仍待探索。本文提出PoseDriver，一个专为驾驶场景常见物体设计的、自底向上的多类别骨架检测统一框架。我们将每个类别建模为独立任务，以系统应对多任务学习的挑战。具体而言，我们提出了一种基于骨架表征的车道检测新方法，在OpenLane数据集上取得了领先性能。此外，我们构建了一个用于自行车骨架检测的新数据集，并评估了本框架对新类别的迁移能力。实验结果验证了所提方法的有效性。

摘要 (Abstract)

Object skeletons offer a concise representation of structural information, capturing essential aspects of posture and orientation that are crucial for autonomous driving applications. However, a unified architecture that simultaneously handles multiple instances and categories using only the input image remains elusive. In this paper, we introduce PoseDriver, a unified framework for bottom-up multi-category skeleton detection tailored to common objects in driving scenarios. We model each category as a distinct task to systematically address the challenges of multi-task learning. Specifically, we propose a novel approach for lane detection based on skeleton representations, achieving state-of-the-art performance on the OpenLane dataset. Moreover, we present a new dataset for bicycle skeleton detection and assess the transferability of our framework to novel categories. Experimental results validate the effectiveness of the proposed approach.

关键词: skeleton detection, autonomous driving, multi-category, multi-task learning, PoseDriver, lane detection, transferability, bottom-up approach

188. ❌ FDIF: Formula-Driven supervised Learning with Implicit Functions for 3D Medical Image Segmentation

作者: Yukinori Yamamoto, Kazuya Nishimura, Tsukasa Fukusato, Hirokazu Nosato, Tetsuya Ogata, Hirokatsu Kataoka 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23199v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文FDIF专注于3D医学图像分割，提出了一种基于隐函数和符号距离函数（SDFs）的公式驱动监督学习框架，用于生成合成训练数据，避免使用真实医疗数据和专家标注。该研究与大多数关键词（如LLMs、MoE、对齐、推理、代理等）完全无关，因为这些关键词主要针对大语言模型及其相关技术。然而，论文涉及“Pre-training OR Continual Pre-training OR Domain Adaptation”，因为FDIF框架支持可扩展的预训练（scalable pre-training），尽管不是针对LLMs，而是针对3D医学图像分割模型，因此给予5分（有一定关联）。此外，论文属于“AI for Science OR Bioinformatics OR Cheminformatics”范畴，因为它应用深度学习于生物医学图像分析（3D医学图像分割），因此给予8分（高度相关，但非核心内容）。其他关键词均不适用，评分为0。

!!! tip deepseek-chat TL;DR

论文提出FDIF框架，通过隐函数和符号距离函数生成合成数据，实现无需真实医疗数据和专家标注的3D医学图像分割预训练，在多个基准测试中达到与基于大规模真实数据自监督方法相当的性能。

摘要翻译

基于深度学习的3维医学图像分割方法依赖于大规模标注数据集，但由于隐私限制和专家标注的高成本，获取此类数据十分困难。公式驱动监督学习（Formula-Driven Supervised Learning, FDSL）提供了一种具有吸引力的替代方案，可直接从数学公式生成训练数据与标签。然而，现有的基于体素的方法在几何表达能力上存在局限，且无法合成逼真的纹理。我们提出了基于隐式函数的公式驱动监督学习（Formula-Driven supervised learning with Implicit Functions, FDIF），该框架能够在完全不使用真实数据与医学专家标注的情况下实现可扩展的预训练。FDIF引入了基于有符号距离函数（Signed Distance Functions, SDFs）的隐式函数表示，能够对复杂几何结构进行紧凑建模，同时利用SDF的表面表示来支持几何与强度纹理的可控合成。在三个医学图像分割基准数据集（AMOS、ACDC和KiTS）和三种网络架构（SwinUNETR、nnUNet ResEnc-L和nnUNet Primus-M）上的实验表明，FDIF相较于公式驱动方法取得了稳定提升，并且其性能可与基于大规模真实数据集预训练的自监督方法相媲美。我们进一步证明，FDIF预训练同样有益于3维分类任务，这凸显了基于隐式函数的公式监督作为一种无数据表征学习范式的广阔前景。代码发布于https://github.com/yamanoko/FDIF。

摘要 (Abstract)

Deep learning-based 3D medical image segmentation methods relies on large-scale labeled datasets, yet acquiring such data is difficult due to privacy constraints and the high cost of expert annotation. Formula-Driven Supervised Learning (FDSL) offers an appealing alternative by generating training data and labels directly from mathematical formulas. However, existing voxel-based approaches are limited in geometric expressiveness and cannot synthesize realistic textures. We introduce Formula-Driven supervised learning with Implicit Functions (FDIF), a framework that enables scalable pre-training without using any real data and medical expert annotations. FDIF introduces an implicit-function representation based on signed distance functions (SDFs), enabling compact modeling of complex geometries while exploiting the surface representation of SDFs to support controllable synthesis of both geometric and intensity textures. Across three medical image segmentation benchmarks (AMOS, ACDC, and KiTS) and three architectures (SwinUNETR, nnUNet ResEnc-L, and nnUNet Primus-M), FDIF consistently improves over a formula-driven method, and achieves performance comparable to self-supervised approaches pre-trained on large-scale real datasets. We further show that FDIF pre-training also benefits 3D classification tasks, highlighting implicit-function-based formula supervision as a promising paradigm for data-free representation learning. Code is available at https://github.com/yamanoko/FDIF.

关键词: 3D medical image segmentation, Formula-Driven Supervised Learning, Implicit Functions, Signed Distance Functions, Data-free pre-training, Synthetic data generation, Medical imaging, Representation learning

189. ❌ PhysSkin: Real-Time and Generalizable Physics-Based Animation via Self-Supervised Neural Skinning

作者: Yuanhang Lei, Tao Cheng, Xingxuan Li, Boming Zhao, Siyuan Huang, Ruizhen Hu, Peter Yichen Chen, Hujun Bao, Zhaopeng Cui 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23194v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PhysSkin专注于计算机图形学中的物理动画，使用神经网络（transformer编码器、交叉注意力解码器）和自监督学习实现实时、可泛化的皮肤变形，但未涉及大语言模型、深度学习技术原理创新或科学领域应用，与所有评分关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文解决了3D形状物理动画实时化和泛化性的挑战，提出了PhysSkin框架，通过神经皮肤场自编码器和物理自监督学习策略，实现了高质量、实时的物理动画。

摘要翻译

实现能够泛化于多样三维形状与离散化方法的实时物理驱动动画，始终是一项根本性挑战。我们提出PhysSkin，一个应对此挑战的物理信息框架。秉承线性混合蒙皮的思想，我们学习连续的蒙皮场作为基函数，将运动子空间坐标提升至全空间变形，其中子空间由操控点变换定义。为生成无网格、与离散化方式无关、物理一致且能良好泛化于不同三维形状的蒙皮场，PhysSkin采用了一种新的神经蒙皮场自编码器，其包含一个基于Transformer的编码器和一个交叉注意力解码器。此外，我们还开发了一种新颖的物理信息自监督学习策略，该策略融合了动态蒙皮场归一化与冲突感知梯度校正，从而能够有效平衡能量最小化、空间平滑性与正交性约束。PhysSkin在可泛化的神经蒙皮方面展现出卓越性能，并实现了实时的物理驱动动画。

摘要 (Abstract)

Achieving real-time physics-based animation that generalizes across diverse 3D shapes and discretizations remains a fundamental challenge. We introduce PhysSkin, a physics-informed framework that addresses this challenge. In the spirit of Linear Blend Skinning, we learn continuous skinning fields as basis functions lifting motion subspace coordinates to full-space deformation, with subspace defined by handle transformations. To generate mesh-free, discretization-agnostic, and physically consistent skinning fields that generalize well across diverse 3D shapes, PhysSkin employs a new neural skinning fields autoencoder which consists of a transformer-based encoder and a cross-attention decoder. Furthermore, we also develop a novel physics-informed self-supervised learning strategy that incorporates on-the-fly skinning-field normalization and conflict-aware gradient correction, enabling effective balancing of energy minimization, spatial smoothness, and orthogonality constraints. PhysSkin shows outstanding performance on generalizable neural skinning and enables real-time physics-based animation.

关键词: physics-based animation, neural skinning fields, self-supervised learning, real-time animation, generalizable animation, transformer encoder, cross-attention decoder, mesh-free deformation

190. ❌ Gaze-Regularized VLMs for Ego-Centric Behavior Understanding

作者: Anupam Pani, Yanchao Yang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23190v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是视觉语言模型（VLM）在自我中心行为理解中的应用，通过整合眼动数据来增强模型性能。所有评分关键词都专门针对大语言模型（LLM）的技术、训练方法、优化、应用或评估，而本文聚焦于视觉语言模型（VLM），这是一种结合视觉和语言的多模态模型，与纯文本大语言模型（LLM）有本质区别。论文未涉及任何LLM相关技术、训练方法（如预训练、微调、对齐）、优化（如量化、推理加速）、应用（如智能体、工具使用）或评估（如幻觉缓解），也未涉及AI for Science的具体子领域（如生物信息学、化学信息学）。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种眼动正则化框架，通过将眼动数据整合到视觉语言模型中，显著提升了自我中心行为理解和未来事件预测的准确性，实验结果显示语义得分比基线模型提高了近13%。

摘要翻译

眼动注视（包含凝视点与扫视）为理解人类意图与未来行为提供了关键洞见。本研究提出一种注视正则化框架，旨在增强视觉语言模型（VLMs）在第一人称行为理解中的性能。与现有仅依赖视觉数据而忽略注视信息的方法不同，我们的方法在训练过程中直接将注视信息整合到VLM架构中。通过生成基于注视的查询，模型能够动态聚焦于注视高亮区域，同时注视正则化机制确保了模型注意力与人类注意力模式的对齐。为深入探究注视信息如何有效融入VLMs，我们进行了大量实验，探索了多种注视数据整合策略。这些创新使得模型能够生成带有详细动作描述的未来事件预测。实验结果表明，相较于未利用注视数据的基线模型，本方法在语义评分上实现了近13%的提升，凸显了其有效性。此项工作为在VLMs中利用人类注视信息奠定了基础，显著增强了其在需要精准、鲁棒未来事件预测应用中的预测能力。

摘要 (Abstract)

Eye gaze, encompassing fixations and saccades, provides critical insights into human intentions and future actions. This study introduces a gaze-regularized framework that enhances Vision Language Models (VLMs) for egocentric behavior understanding. Unlike existing methods that rely solely on visual data and overlook gaze information, our approach directly incorporates gaze information into the VLM architecture during training. By generating gaze-based queries, the model dynamically focuses on gaze-highlighted regions, while a gaze-regularization mechanism ensures the alignment of model attention with human attention patterns. To better understand how gaze can be effectively integrated into VLMs, we conducted extensive experiments exploring various strategies for incorporating gaze data. These innovations enable the prediction of future events with detailed action descriptions. Experimental results demonstrate a nearly 13 % improvement in semantic scores compared to baseline models not leveraging gaze data, highlighting the effectiveness of our approach. This work establishes a foundation for leveraging the human gaze in VLMs, significantly boosting their predictive capabilities in applications requiring accurate and robust future event prediction.

关键词: Vision Language Models, Gaze Regularization, Egocentric Behavior Understanding, Future Event Prediction, Human Attention Alignment, Gaze-based Queries, Action Description

191. ❌ ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting

作者: Yeonkyung Lee, Dayun Ju, Youngmin Kim, Seil Kang, Seong Jae Hwang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23186v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于Video Large Language Models (VideoLLMs)，属于大模型在视频理解领域的应用，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文提出ViKey框架，通过视觉提示和关键词-帧映射增强时间理解，属于大模型应用创新，但未涉及其他关键词的具体技术（如MoE、量化、推理加速等），因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对VideoLLMs在稀疏采样时时间推理能力下降的问题，提出了ViKey框架，通过视觉提示和关键词-帧映射来增强时间理解，在仅使用20%帧的情况下保持了密集帧基线的性能。

摘要翻译

近期视频大语言模型（VideoLLMs）的发展使其在多种多模态视频任务中展现出强大性能。为降低处理密集视频帧的高计算成本，以效率为导向的方法（如帧选择）已被广泛采用。尽管这些方法能有效减少冗余，但在需要时序推理的任务中常导致显著性能下降。与人类能够从稀疏视觉线索推断事件进程不同，VideoLLMs在省略中间帧时经常误解时序关系。为应对这一局限，我们探索将视觉提示（Visual Prompting, VP）作为增强VideoLLMs时序理解能力的轻量而有效的途径。分析表明，仅通过为每帧添加显式序数信息标注，即可帮助模型感知时序连续性。这种视觉线索还支持帧级引用，并缓解稀疏采样序列中的位置歧义。基于这些发现，我们提出ViKey——一种无需训练、结合VP与轻量级关键词-帧映射（Keyword-Frame Mapping, KFM）模块的框架。KFM利用帧索引作为类字典键，将文本线索关联至最相关的帧，在推理过程中提供显式时序锚点。尽管方法简洁，我们的方案显著提升了时序推理能力，并在部分数据集上仅使用20%的帧数即保持了密集帧基线的性能水平。

摘要 (Abstract)

Recent advancements in Video Large Language Models (VideoLLMs) have enabled strong performance across diverse multimodal video tasks. To reduce the high computational cost of processing dense video frames, efficiency-oriented methods such as frame selection have been widely adopted. While effective at minimizing redundancy, these methods often cause notable performance drops on tasks requiring temporal reasoning. Unlike humans, who can infer event progression from sparse visual cues, VideoLLMs frequently misinterpret temporal relations when intermediate frames are omitted. To address this limitation, we explore visual prompting (VP) as a lightweight yet effective way to enhance temporal understanding in VideoLLMs. Our analysis reveals that simply annotating each frame with explicit ordinal information helps the model perceive temporal continuity. This visual cue also supports frame-level referencing and mitigates positional ambiguity within a sparsely sampled sequence. Building on these insights, we introduce ViKey, a training-free framework that combines VP with a lightweight Keyword-Frame Mapping (KFM) module. KFM leverages frame indices as dictionary-like keys to link textual cues to the most relevant frames, providing explicit temporal anchors during inference. Despite its simplicity, our approach substantially improves temporal reasoning and, on some datasets, preserves dense-frame baseline performance with as few as 20% of frames.

关键词: Video Large Language Models, temporal reasoning, visual prompting, frame selection, Keyword-Frame Mapping, training-free framework, sparse sampling, multimodal video tasks

192. ❌ Gimbal360: Differentiable Auto-Leveling for Canonicalized $360^\circ$ Panoramic Image Completion

作者: Yuqin Lu, Haofeng Liu, Yang Zhou, Jun Liang, Shengfeng He, Jing Li 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23179v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的360度全景图像补全，使用扩散模型解决几何和拓扑不匹配问题。所有评分关键词均涉及大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理系统等），而论文完全不涉及任何语言模型、文本处理或LLM技术。论文研究的是视觉生成模型在特定几何约束下的应用，与评分关键词中的LLM技术领域无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了Gimbal360框架，通过引入规范视图空间和可微分自动调平模块，解决了从无姿态透视图像生成结构一致的360度全景图像的几何和拓扑挑战，并在新数据集上实现了最先进的性能。

摘要翻译

扩散模型在二维图像外延绘制方面表现卓越，但将其扩展至从无位姿透视图像完成$360^\circ$全景图则面临挑战，这源于透视投影与球形全景之间的几何与拓扑失配。我们提出Gimbal360，一个原则性框架，显式地桥接透视观测与球形全景。我们引入了一个规范化投影几何的规范观察空间，为两个域之间提供一致的中间表示。为了将真实场景输入锚定至此空间，我们提出了一种可微分自动调平模块，该模块能在推理时无需相机参数的情况下稳定特征方向。全景生成还引入了拓扑挑战：标准生成架构假设有界的欧几里得图像平面，而等距柱状投影全景具有固有的$S^1$周期性，欧几里得操作因此会破坏边界连续性。我们通过在隐空间中强制拓扑等变性来解决这一失配问题，以保持无缝的周期结构。为支持此框架，我们引入了Horizon360——一个精心构建的大规模重力对齐全景环境数据集。大量实验表明，显式标准化几何与拓扑先验使Gimbal360在结构一致的$360^\circ$场景补全任务中实现了最先进的性能。

摘要 (Abstract)

Diffusion models excel at 2D outpainting, but extending them to $360^\circ$ panoramic completion from unposed perspective images is challenging due to the geometric and topological mismatch between perspective projections and spherical panoramas. We present Gimbal360, a principled framework that explicitly bridges perspective observations and spherical panoramas. We introduce a Canonical Viewing Space that regularizes projective geometry and provides a consistent intermediate representation between the two domains. To anchor in-the-wild inputs to this space, we propose a Differentiable Auto-Leveling module that stabilizes feature orientation without requiring camera parameters at inference. Panoramic generation also introduces a topological challenge. Standard generative architectures assume a bounded Euclidean image plane, while Equirectangular Projection (ERP) panoramas exhibit intrinsic $S^1$ periodicity. Euclidean operations therefore break boundary continuity. We address this mismatch by enforcing topological equivariance in the latent space to preserve seamless periodic structure. To support this formulation, we introduce Horizon360, a curated large-scale dataset of gravity-aligned panoramic environments. Extensive experiments show that explicitly standardizing geometric and topological priors enables Gimbal360 to achieve state-of-the-art performance in structurally consistent $360^\circ$ scene completion.

关键词: 360-degree panoramic completion, diffusion models, canonical viewing space, differentiable auto-leveling, topological equivariance, Equirectangular Projection, spherical panoramas, Horizon360 dataset

193. ❌ GSwap: Realistic Head Swapping with Dynamic Neural Gaussian Field

作者: Jingtao Zhou, Xuan Gao, Dongyu Liu, Junhui Hou, Yudong Guo, Juyong Zhang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23168v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文GSwap专注于计算机视觉和图形学领域，提出了一种基于动态神经高斯场的新型视频头部替换系统。其核心贡献在于3D高斯特征场、SMPL-X表面嵌入、神经重渲染等技术，属于生成模型和3D重建的具体应用。与评分关键词列表中的大模型、深度学习技术原理（如MoE、Scaling Laws、RLHF、PEFT等）以及AI for Science等科学应用领域均无直接关联。唯一的相关点是摘要中提到“adapt a pretrained 2D portrait generative model to the source head domain using only a few reference images, enabling efficient domain adaptation”，这与“Pre-training OR Continual Pre-training OR Domain Adaptation”有一定关联，因此给予5分（有一定关联）。其他关键词均未涉及，故评分为0分。

!!! tip deepseek-chat TL;DR

GSwap提出了一种基于动态神经高斯场和SMPL-X表面嵌入的视频头部替换系统，解决了现有方法在3D一致性、面部表情自然度和背景融合方面的局限性，实现了高保真、时间连贯的头部替换效果。

摘要翻译

本文提出GSwap——一种基于动态神经高斯人像先验的新型、具有一致性且逼真的视频头部替换系统，显著推进了人脸与头部替换的技术水平。与以往主要依赖二维生成模型或三维可变形人脸模型（3DMM）的方法不同，我们的方法克服了其固有局限，包括三维一致性差、面部表情不自然以及合成质量受限等问题。此外，现有技术因缺乏完整的头部建模和低效的背景融合，在处理完整头部替换任务时往往产生可见的伪影和对齐偏差。为解决这些挑战，GSwap引入了一种内嵌于全身SMPL-X表面的本征三维高斯特征场，将二维人像视频有效提升为动态神经高斯场。这一创新在保持自然头颈躯干关系和流畅运动动态的同时，确保了高保真、三维一致的人像渲染。为便于训练，我们仅使用少量参考图像将预训练的二维人像生成模型适配至源头部域，实现了高效的域适应。此外，我们提出一种神经重渲染策略，将合成前景与原始背景和谐融合，消除融合伪影并提升真实感。大量实验表明，GSwap在视觉质量、时序连贯性、身份保持和三维一致性等多个方面均超越现有方法。

摘要 (Abstract)

We present GSwap, a novel consistent and realistic video head-swapping system empowered by dynamic neural Gaussian portrait priors, which significantly advances the state of the art in face and head replacement. Unlike previous methods that rely primarily on 2D generative models or 3D Morphable Face Models (3DMM), our approach overcomes their inherent limitations, including poor 3D consistency, unnatural facial expressions, and restricted synthesis quality. Moreover, existing techniques struggle with full head-swapping tasks due to insufficient holistic head modeling and ineffective background blending, often resulting in visible artifacts and misalignments. To address these challenges, GSwap introduces an intrinsic 3D Gaussian feature field embedded within a full-body SMPL-X surface, effectively elevating 2D portrait videos into a dynamic neural Gaussian field. This innovation ensures high-fidelity, 3D-consistent portrait rendering while preserving natural head-torso relationships and seamless motion dynamics. To facilitate training, we adapt a pretrained 2D portrait generative model to the source head domain using only a few reference images, enabling efficient domain adaptation. Furthermore, we propose a neural re-rendering strategy that harmoniously integrates the synthesized foreground with the original background, eliminating blending artifacts and enhancing realism. Extensive experiments demonstrate that GSwap surpasses existing methods in multiple aspects, including visual quality, temporal coherence, identity preservation, and 3D consistency.

关键词: head swapping, neural Gaussian field, 3D consistency, SMPL-X, domain adaptation, neural re-rendering, video portrait, facial expression

194. ❌ Dual Contrastive Network for Few-Shot Remote Sensing Image Scene Classification

作者: Zhong Ji, Liyuan Hou, Xuan Wang, Gang Wang, Yanwei Pang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23161v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究遥感图像的小样本场景分类，提出了一种基于对比学习的双分支网络（DCN）。论文的核心是计算机视觉中的小样本学习和对比学习技术，用于遥感图像分析。所有关键词（共27个）中，只有最后一个关键词“AI for Science OR Bioinformatics OR Cheminformatics”与论文有一定关联，因为遥感图像分析可视为AI在科学（地球科学、环境科学）领域的一个应用，但论文并未明确提及大模型、深度学习技术原理创新或生物信息学/化学信息学，因此仅给5分（有一定关联）。其他26个关键词均专注于大模型（LLM）相关的技术、方法或应用（如MoE、缩放定律、对齐、RAG、推理加速、智能体等），而该论文完全不涉及任何大模型内容，也未使用深度学习的新技术原理，因此相关度为0。

!!! tip deepseek-chat TL;DR

该论文针对遥感图像小样本场景分类中类间差异小、类内差异大的挑战，提出了一种双对比网络（DCN），通过上下文引导和细节引导的对比学习分支来提升特征判别性和不变性，在四个公开数据集上取得了有竞争力的性能。

摘要翻译

小样本遥感图像场景分类（FS-RSISC）旨在仅利用少量标注样本对遥感图像进行分类。其主要挑战在于遥感图像固有的类间差异小、类内差异大的特性。为解决这些挑战，我们提出一种基于迁移的双重对比网络（DCN），该网络在训练过程中引入了两个辅助的监督对比学习分支。具体而言，一个是上下文引导对比学习（CCL）分支，另一个是细节引导对比学习（DCL）分支，分别侧重于提升类间区分度与类内不变性。在CCL分支中，我们首先设计了一个Condenser Network以捕获上下文特征，随后在获得的上下文特征基础上进行监督对比学习，以促使模型学习更具判别性的特征。在DCL分支中，我们设计了Smelter Network以突出重要的局部细节信息，并基于细节特征图构建监督对比学习，充分挖掘每张特征图中的空间信息，使模型能够聚焦于不变的细节特征。在四个公开的遥感基准数据集上进行的大量实验表明，我们提出的DCN具有优异的性能。

摘要 (Abstract)

Few-shot remote sensing image scene classification (FS-RSISC) aims at classifying remote sensing images with only a few labeled samples. The main challenges lie in small inter-class variances and large intra-class variances, which are the inherent property of remote sensing images. To address these challenges, we propose a transfer-based Dual Contrastive Network (DCN), which incorporates two auxiliary supervised contrastive learning branches during the training process. Specifically, one is a Context-guided Contrastive Learning (CCL) branch and the other is a Detail-guided Contrastive Learning (DCL) branch, which focus on inter-class discriminability and intra-class invariance, respectively. In the CCL branch, we first devise a Condenser Network to capture context features, and then leverage a supervised contrastive learning on top of the obtained context features to facilitate the model to learn more discriminative features. In the DCL branch, a Smelter Network is designed to highlight the significant local detail information. And then we construct a supervised contrastive learning based on the detail feature maps to fully exploit the spatial information in each map, enabling the model to concentrate on invariant detail features. Extensive experiments on four public benchmark remote sensing datasets demonstrate the competitive performance of our proposed DCN.

关键词: few-shot learning, remote sensing image classification, contrastive learning, scene classification, dual contrastive network, inter-class variance, intra-class variance, supervised contrastive learning

作者: Huy Hoang Nguyen, Cédric Jung, Shirin Salehi, Tobias Glück, Anke Schmeink, Andreas Kugi 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23159v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究视觉-语言模型（VLM）在主动学习中的应用，与’Foundation Models’高度相关（8分），因为论文明确使用预训练的VLM作为教师模型。与’Pre-training’有一定关联（5分），因为论文利用了预训练的VLM表示。其他关键词如MoE、SLMs、SFT、RAG等均未在论文中涉及，因此得0分。论文未涉及生物信息学等具体科学领域应用，因此’AI for Science’也得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为CCMA的主动学习框架，通过利用预训练的视觉-语言模型提供语义基础的不确定性估计来指导样本选择，从而在多个基准测试中实现了优于现有方法的数据效率。

摘要翻译

视觉基础模型通过强大的预训练表征和卓越的零样本能力革新了视觉识别领域，但其在数据高效学习方面的潜力仍很大程度上未被开发。主动学习旨在通过策略性地选择信息量最大的样本进行标注以最小化标注成本，但现有方法大多忽视了现代视觉-语言模型中嵌入的丰富多模态知识。我们提出Conformal Cross-Modal Acquisition，一种新颖的主动学习框架，通过师生架构桥接视觉与语言模态。CCMA采用预训练的视觉-语言模型作为教师，提供基于语义的不确定性估计，并经过保形校准以指导纯视觉学生模型的样本选择。通过将多模态保形评分与多样性感知选择策略相结合，CCMA在多个基准测试中实现了卓越的数据效率。我们的方法持续优于最先进的主动学习基线，相较于仅依赖不确定性或多样性度量的方法展现出明显优势。

摘要 (Abstract)

Foundation models for vision have transformed visual recognition with powerful pretrained representations and strong zero-shot capabilities, yet their potential for data-efficient learning remains largely untapped. Active Learning (AL) aims to minimize annotation costs by strategically selecting the most informative samples for labeling, but existing methods largely overlook the rich multimodal knowledge embedded in modern vision-language models (VLMs). We introduce Conformal Cross-Modal Acquisition (CCMA), a novel AL framework that bridges vision and language modalities through a teacher-student architecture. CCMA employs a pretrained VLM as a teacher to provide semantically grounded uncertainty estimates, conformally calibrated to guide sample selection for a vision-only student model. By integrating multimodal conformal scoring with diversity-aware selection strategies, CCMA achieves superior data efficiency across multiple benchmarks. Our approach consistently outperforms state-of-the-art AL baselines, demonstrating clear advantages over methods relying solely on uncertainty or diversity metrics.

关键词: Active Learning, Vision-Language Models, Conformal Prediction, Data Efficiency, Multimodal Learning, Sample Selection, Teacher-Student Architecture, Uncertainty Estimation

196. ❌ InterDyad: Interactive Dyadic Speech-to-Video Generation by Querying Intermediate Visual Guidance

作者: Dongwei Pan, Longwei Guo, Jiazhi Guan, Luying Huang, Yiding Li, Haojie Liu, Haocheng Feng, Wei He, Kaisiyuan Wang, Hang Zhou 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23132v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出InterDyad框架，利用多模态大语言模型（MLLM）从音频中提取语言意图以控制反应时机和适当性，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分）。其他关键词主要涉及大模型技术原理、训练方法、推理优化、对齐、代理系统、压缩加速、科学应用等，论文未涉及这些具体技术，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文解决了现有语音到视频合成方法在双人交互场景中难以捕捉跨个体依赖性和提供细粒度反应行为控制的问题，提出了InterDyad框架，通过查询中间视觉指导和利用多模态大语言模型，显著提升了双人交互的自然性和上下文相关性。

摘要翻译

尽管语音到视频合成技术已取得进展，但现有方法往往难以捕捉跨个体依赖关系，并在二元交互场景中实现对反应行为的细粒度控制。为应对这些挑战，我们提出InterDyad框架，该框架通过查询结构化运动指导来实现自然交互动态的合成。具体而言，我们首先设计了一个交互性注入器，该模块基于从参考视频中提取的身份无关运动先验实现视频重演。在此基础上，我们引入基于元查询的模态对齐机制，以弥合对话音频与这些运动先验之间的鸿沟。通过利用多模态大语言模型，我们的框架能够从音频中提炼语言意图，从而精确控制反应行为的时机与恰当性。为在极端头部姿态下进一步提升唇形同步质量，我们提出角色感知二元高斯引导机制，以增强唇部同步与空间一致性。最后，我们构建了专用评估体系，采用全新设计的指标来量化二元交互质量。综合实验表明，InterDyad在生成自然且符合上下文语境的双人交互方面显著优于现有先进方法。演示视频请参见项目页面：https://interdyad.github.io/。

摘要 (Abstract)

Despite progress in speech-to-video synthesis, existing methods often struggle to capture cross-individual dependencies and provide fine-grained control over reactive behaviors in dyadic settings. To address these challenges, we propose InterDyad, a framework that enables naturalistic interactive dynamics synthesis via querying structural motion guidance. Specifically, we first design an Interactivity Injector that achieves video reenactment based on identity-agnostic motion priors extracted from reference videos. Building upon this, we introduce a MetaQuery-based modality alignment mechanism to bridge the gap between conversational audio and these motion priors. By leveraging a Multimodal Large Language Model (MLLM), our framework is able to distill linguistic intent from audio to dictate the precise timing and appropriateness of reactions. To further improve lip-sync quality under extreme head poses, we propose Role-aware Dyadic Gaussian Guidance (RoDG) for enhanced lip-synchronization and spatial consistency. Finally, we introduce a dedicated evaluation suite with novelly designed metrics to quantify dyadic interaction. Comprehensive experiments demonstrate that InterDyad significantly outperforms state-of-the-art methods in producing natural and contextually grounded two-person interactions. Please refer to our project page for demo videos: https://interdyad.github.io/.

关键词: speech-to-video synthesis, dyadic interaction, Multimodal Large Language Model, interactive dynamics, motion guidance, lip-synchronization, video reenactment, modality alignment

197. ❌ 3rd Place of MeViS-Audio Track of the 5th PVUW: VIRST-Audio

作者: Jihwan Hong, Jaeyoung Do 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23126v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究音频驱动的视频对象分割（ARVOS），使用ASR将音频转换为文本，并基于预训练的视觉语言模型进行分割，与提供的大模型/深度学习技术关键词无直接关联。所有关键词均涉及大模型架构、训练、推理、对齐、应用等特定技术，而本文未涉及这些技术，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

本文提出VIRST-Audio框架，通过将音频查询转换为文本并利用预训练的视觉语言模型，解决了音频驱动的视频对象分割问题，在MeViS-Audio挑战赛中取得第三名。

摘要翻译

基于音频的指代视频目标分割（Audio-based Referring Video Object Segmentation, ARVOS）要求将音频查询实时定位到像素级的目标掩码，这给声学信号与时空视觉表征的关联带来了挑战。本报告提出VIRST-Audio，这是一个基于预训练RVOS模型并结合视觉-语言架构构建的实用框架。我们无需依赖音频特定训练，而是通过自动语音识别（ASR）模块将输入音频转换为文本，并利用文本监督进行分割，从而实现了从基于文本的推理到音频驱动场景的有效迁移。为提升鲁棒性，我们进一步引入了一种存在感知门控机制，该机制能够评估被指代的目标对象是否存在于视频中，并在其缺失时抑制预测，从而减少幻觉掩码并稳定分割行为。我们在第五届PVUW挑战赛的MeViS-Audio赛道上评估了所提方法，VIRST-Audio取得了第三名的成绩，展现了在基于音频的指代视频分割任务中强大的泛化能力和可靠的性能。

摘要 (Abstract)

Audio-based Referring Video Object Segmentation (ARVOS) requires grounding audio queries into pixel-level object masks over time, posing challenges in bridging acoustic signals with spatio-temporal visual representations. In this report, we present VIRST-Audio, a practical framework built upon a pretrained RVOS model integrated with a vision-language architecture. Instead of relying on audio-specific training, we convert input audio into text using an ASR module and perform segmentation using text-based supervision, enabling effective transfer from text-based reasoning to audio-driven scenarios. To improve robustness, we further incorporate an existence-aware gating mechanism that estimates whether the referred target object is present in the video and suppresses predictions when it is absent, reducing hallucinated masks and stabilizing segmentation behavior. We evaluate our approach on the MeViS-Audio track of the 5th PVUW Challenge, where VIRST-Audio achieves 3rd place, demonstrating strong generalization and reliable performance in audio-based referring video segmentation.

关键词: Audio-based Referring Video Object Segmentation, ARVOS, ASR, vision-language architecture, pretrained RVOS model, existence-aware gating, hallucinated masks, MeViS-Audio

198. ❌ VoDaSuRe: A Large-Scale Dataset Revealing Domain Shift in Volumetric Super-Resolution

作者: August Leander Høeg, Sophia Wiinberg Bardenfleth, Hans Martin Kjer, Tim Bjørn Dyrby, Vedrana Andersen Dahl, Anders Bjorholm Dahl 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23153v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学/科学成像中的体积超分辨率（Volumetric Super-Resolution），这是一个计算机视觉和医学图像处理领域的研究。论文的核心贡献是引入了一个新的配对高-低分辨率3D扫描数据集（VoDaSuRe），并揭示了在降采样数据上训练的模型与在真实低分辨率扫描上训练的模型之间存在显著的领域偏移（domain shift）。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理（如MoE、Scaling Laws、训练方法、推理优化、智能体等）或大模型在不同领域的应用直接相关。而本文研究的是基于CNN/Transformer的经典超分辨率模型在医学成像中的应用，属于传统的计算机视觉任务，并未涉及大语言模型或评分关键词中列出的任何大模型相关技术。因此，除“AI for Science OR Bioinformatics OR Cheminformatics”因论文属于“AI for Science”（人工智能在科学领域的应用，具体为医学成像）而获得5分（有一定关联）外，其余所有关键词均得0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文通过引入VoDaSuRe数据集，揭示了当前基于深度学习的体积超分辨率方法在降采样数据上训练时性能被高估，因为当应用于真实低分辨率扫描时，这些模型无法恢复丢失的结构，而只是预测平滑的平均值。

摘要翻译

近期体数据超分辨率技术在医学与科学成像领域取得显著进展，基于Transformer和CNN的方法即使在极高缩放因子下也能获得令人印象深刻的结果。本研究发现，此类性能很大程度上源于对降采样数据的训练，而非真实低分辨率扫描数据。这种对降采样的依赖部分源于成对高-低分辨率三维数据集的稀缺性。为解决此问题，我们提出了VoDaSuRe——一个包含成对高、低分辨率扫描的大规模体数据集。在VoDaSuRe上训练模型时，我们揭示了显著差异：基于降采样数据训练的超分辨率模型产生的预测结果比基于真实低分辨率扫描训练的模型更锐利，而后者会平滑细微结构。反之，将基于降采样数据训练的模型应用于真实扫描时，虽能保留更多结构但准确性不足。我们的研究结果表明，当前超分辨率方法的性能被高估——当应用于真实数据时，它们无法恢复低分辨率扫描中丢失的结构，而是预测出经过平滑的平均结果。我们认为，基于深度学习的体数据超分辨率技术的进步需要具备高复杂度成对真实扫描的数据集，例如VoDaSuRe。我们的数据集与代码已通过以下网址公开：https://augusthoeg.github.io/VoDaSuRe/

摘要 (Abstract)

Recent advances in volumetric super-resolution (SR) have demonstrated strong performance in medical and scientific imaging, with transformer- and CNN-based approaches achieving impressive results even at extreme scaling factors. In this work, we show that much of this performance stems from training on downsampled data rather than real low-resolution scans. This reliance on downsampling is partly driven by the scarcity of paired high- and low-resolution 3D datasets. To address this, we introduce VoDaSuRe, a large-scale volumetric dataset containing paired high- and low-resolution scans. When training models on VoDaSuRe, we reveal a significant discrepancy: SR models trained on downsampled data produce substantially sharper predictions than those trained on real low-resolution scans, which smooth fine structures. Conversely, applying models trained on downsampled data to real scans preserves more structure but is inaccurate. Our findings suggest that current SR methods are overstated - when applied to real data, they do not recover structures lost in low-resolution scans and instead predict a smoothed average. We argue that progress in deep learning-based volumetric SR requires datasets with paired real scans of high complexity, such as VoDaSuRe. Our dataset and code are publicly available through: https://augusthoeg.github.io/VoDaSuRe/

关键词: volumetric super-resolution, domain shift, paired dataset, medical imaging, 3D scans, deep learning, VoDaSuRe, downsampling

199. ❌ PiCo: Active Manifold Canonicalization for Robust Robotic Visual Anomaly Detection

作者: Teng Yan, Binkai Liu, Shuai Liu, Yue Yu, Bingzhuo Zhong 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23122v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于机器人视觉异常检测（VAD）的主动规范化方法，提出PiCo框架通过物理重定向和神经潜在规范化来增强鲁棒性。所有评分关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文研究的是计算机视觉和机器人感知中的具体工程问题，未涉及大模型、语言模型、训练技术、推理方法、代理系统或科学AI应用等主题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为PiCo的主动流形规范化框架，通过物理重定向和神经潜在规范化来解决机器人视觉异常检测在多样化姿态和不稳定操作条件下的鲁棒性问题，在M2AD基准测试中实现了93.7%的O-AUROC和98.5%的闭环场景准确率。

摘要翻译

机器人视觉异常检测（VAD）的工业部署，根本上受限于其在多种六自由度姿态配置及不稳定操作条件（如光照变化与阴影）下的被动感知能力，其中内在语义异常与物理干扰共存并相互影响。为克服这些局限，本文提出从被动特征学习向主动规范化（Active Canonicalization）的范式转变。我们引入PiCo（姿态条件规范化）作为一个统一框架，主动将观测投影至条件不变的规范流形上。PiCo通过级联机制运行：第一阶段为主动物理规范化，使机器人能够重新定向物体，从而从源头降低几何不确定性；第二阶段为神经潜在规范化，采用包含输入层光度处理、特征层潜在精炼与语义层上下文推理的三阶段去噪层次结构，逐步消除跨表征尺度的干扰因素。在大规模M2AD基准上的广泛评估验证了该范式的优越性。PiCo实现了93.7%的O-AUROC最优性能，在静态设置中较现有方法提升3.7%，并在主动闭环场景中达到98.5%的准确率。这些结果表明，主动流形规范化对于实现鲁棒的具身感知至关重要。

摘要 (Abstract)

Industrial deployment of robotic visual anomaly detection (VAD) is fundamentally constrained by passive perception under diverse 6-DoF pose configurations and unstable operating conditions such as illumination changes and shadows, where intrinsic semantic anomalies and physical disturbances coexist and interact. To overcome these limitations, a paradigm shift from passive feature learning to Active Canonicalization is proposed. PiCo (Pose-in-Condition Canonicalization) is introduced as a unified framework that actively projects observations onto a condition-invariant canonical manifold. PiCo operates through a cascaded mechanism. The first stage, Active Physical Canonicalization, enables a robotic agent to reorient objects in order to reduce geometric uncertainty at its source. The second stage, Neural Latent Canonicalization, adopts a three-stage denoising hierarchy consisting of photometric processing at the input level, latent refinement at the feature level, and contextual reasoning at the semantic level, progressively eliminating nuisance factors across representational scales. Extensive evaluations on the large-scale M2AD benchmark demonstrate the superiority of this paradigm. PiCo achieves a state-of-the-art 93.7% O-AUROC, representing a 3.7% improvement over prior methods in static settings, and attains 98.5% accuracy in active closed-loop scenarios. These results demonstrate that active manifold canonicalization is critical for robust embodied perception.

关键词: robotic visual anomaly detection, active canonicalization, manifold canonicalization, pose-in-condition canonicalization, physical canonicalization, neural latent canonicalization, embodied perception, M2AD benchmark

200. ❌ Automatic Segmentation of 3D CT scans with SAM2 using a zero-shot approach

作者: Miquel Lopez Escoriza, Pau Amargant Alvarez 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23116v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文研究SAM2（Segment Anything Model 2）在3D医学CT扫描中的零样本分割应用，属于AI for Science（生物医学影像分析）领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文提到SAM2是基础模型（Foundation Model），因此与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分）。其他关键词主要涉及大语言模型的技术细节（如MoE、RLHF、量化等）或特定应用（如Agent、推理方法），与本文的计算机视觉/医学影像分割主题无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过零样本方法将Segment Anything Model 2（SAM2）应用于3D CT扫描的自动分割，通过调整推理流程使其适应体积数据，证明了无需微调即可实现有效分割的可行性。

摘要翻译

用于图像分割的基础模型在自然图像中展现出强大的泛化能力，但其在三维医学影像中的适用性仍有限。本研究探讨了无需任何微调或领域特定训练、将Segment Anything Model 2（SAM2）零样本应用于体计算机断层扫描（CT）数据自动分割的方法。我们分析了如何将SAM2应用于CT体数据，并指出其主要局限：缺乏固有的三维空间感知能力。为解决这一问题，我们提出了一系列仅需推理阶段调整的架构与流程改进方案，通过将CT切片视为有序序列，使SAM2基于视频的记忆机制适配三维数据。我们在TotalSegmentator数据集的500例CT扫描子集上进行了系统消融实验，以评估提示策略、记忆传播方案和多轮优化方法。基于这些发现，我们选取了性能最佳的配置方案，并在包含2500例CT扫描的TotalSegmentator数据集更大样本上报告了最终结果。研究表明，即使权重完全冻结，通过精心构建推理流程，SAM2仍能生成连贯的三维分割结果，这证明了全零样本方法在体医学图像分割中的可行性。

摘要 (Abstract)

Foundation models for image segmentation have shown strong generalization in natural images, yet their applicability to 3D medical imaging remains limited. In this work, we study the zero-shot use of Segment Anything Model 2 (SAM2) for automatic segmentation of volumetric CT data, without any fine-tuning or domain-specific training. We analyze how SAM2 should be applied to CT volumes and identify its main limitation: the lack of inherent volumetric awareness. To address this, we propose a set of inference-alone architectural and procedural modifications that adapt SAM2’s video-based memory mechanism to 3D data by treating CT slices as ordered sequences. We conduct a systematic ablation study on a subset of 500 CT scans from the TotalSegmentator dataset to evaluate prompt strategies, memory propagation schemes and multi-pass refinement. Based on these findings, we select the best-performing configuration and report final results on a bigger sample of the TotalSegmentator dataset comprising 2,500 CT scans. Our results show that, even with frozen weights, SAM2 can produce coherent 3D segmentations when its inference pipeline is carefully structured, demonstrating the feasibility of a fully zero-shot approach for volumetric medical image segmentation.

关键词: SAM2, 3D CT segmentation, zero-shot approach, medical imaging, volumetric data, foundation models, TotalSegmentator dataset, inference pipeline

201. ❌ SMSP: A Plug-and-Play Strategy of Multi-Scale Perception for MLLMs to Perceive Visual Illusions

作者: Jinzhe Tu, Ruilei Guo, Zihan Guo, Junxiao Yang, Shiyao Cui, Minlie Huang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23118v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多模态大语言模型（MLLMs）在视觉感知方面的缺陷，特别是对隐藏模式视觉错觉的脆弱性，并提出了一个即插即用的多尺度感知策略（SMSP）来改善这一缺陷。论文的核心是MLLMs的视觉感知对齐问题，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为MLLMs是LLMs的扩展。其他关键词主要涉及纯文本LLMs的技术细节、训练方法、推理优化、代理系统、模型压缩等，与论文的视觉感知研究主题无直接关联，因此得0分。

!!! tip deepseek-chat TL;DR

该论文研究了多模态大语言模型（MLLMs）在感知隐藏模式视觉错觉方面的缺陷，发现其失败源于高频注意力偏差，并提出了一个即插即用的多尺度感知策略（SMSP），通过抑制干扰性高频背景，显著提升了多种MLLMs在错觉图像上的性能。

摘要翻译

近期研究表明，多模态大语言模型（MLLMs）对隐藏式视觉错觉高度脆弱，其中隐藏内容对人类显而易见，却难以被模型感知。这一缺陷揭示了当前多模态大语言模型与人类之间的感知错位，并引发了潜在的安全隐患。为系统探究此问题，我们构建了IlluChar——一个全面且具有挑战性的错觉数据集，并揭示了模型失效的关键内在机制：高频注意力偏差，即模型易受错觉图像中高频背景纹理干扰，从而忽略隐藏模式。为解决该问题，我们提出多尺度感知策略（SMSP），一种符合人类视觉感知策略的即插即用框架。通过抑制干扰性的高频背景，SMSP能生成更贴近人类感知的图像。实验表明，SMSP显著提升了所有受测多模态大语言模型在错觉图像上的性能，例如将Qwen3-VL-8B-Instruct的准确率从13.0%提升至84.0%。本研究为理解多模态大语言模型的视觉感知提供了新视角，并为增强其感知能力提供了实用且鲁棒的解决方案。代码已公开于https://github.com/Tujz2023/SMSP。

摘要 (Abstract)

Recent works have shown that Multimodal Large Language Models (MLLMs) are highly vulnerable to hidden-pattern visual illusions, where the hidden content is imperceptible to models but obvious to humans. This deficiency highlights a perceptual misalignment between current MLLMs and humans, and also introduces potential safety concerns. To systematically investigate this failure, we introduce IlluChar, a comprehensive and challenging illusion dataset, and uncover a key underlying mechanism for the models’ failure: high-frequency attention bias, where the models are easily distracted by high-frequency background textures in illusion images, causing them to overlook hidden patterns. To address the issue, we propose the Strategy of Multi-Scale Perception (SMSP), a plug-and-play framework that aligns with human visual perceptual strategies. By suppressing distracting high-frequency backgrounds, SMSP generates images closer to human perception. Our experiments demonstrate that SMSP significantly improves the performance of all evaluated MLLMs on illusion images, for instance, increasing the accuracy of Qwen3-VL-8B-Instruct from 13.0% to 84.0%. Our work provides novel insights into MLLMs’ visual perception, and offers a practical and robust solution to enhance it. Our code is publicly available at https://github.com/Tujz2023/SMSP.

关键词: Multimodal Large Language Models, MLLMs, visual illusions, perceptual misalignment, high-frequency attention bias, multi-scale perception, SMSP, human perception

202. ❌ A Synchronized Audio-Visual Multi-View Capture System

作者: Xiangwei Shi, Era Dorta Perez, Ruud de Jong, Ojas Shirekar, Chirag Raman 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23089v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文描述了一个同步音频-视频多视角采集系统，属于硬件系统、信号采集和数据采集技术领域，与所有评分关键词（均涉及大模型、深度学习技术原理、AI应用等）完全无关。论文未提及任何AI模型、算法、训练方法或AI科学应用。

!!! tip deepseek-chat TL;DR

该论文解决了现有多视角采集系统缺乏严格音频-视频同步的问题，开发了一个同步音频-视频多视角采集系统，并通过量化同步性能验证了其支持细粒度对话行为分析的能力。

摘要翻译

多视角采集系统一直是受控条件下记录人体运动研究的重要工具。现有系统大多围绕视频流设计，对音频采集与严格的音视频同步支持不足甚至缺失，而这两者对研究对话互动至关重要——其中话轮转换、话语重叠及韵律特征等层面的时间精度具有关键意义。本技术报告描述了一种视听多视角采集系统，该系统通过将同步音频与同步视频视作一等信号来弥补这一缺陷。该系统在多相机采集流程中整合了多通道麦克风录音，并采用统一的时序架构，提供了一套涵盖标定、采集与质量控制的实用工作流程，支持大规模可重复录制。我们量化了实际部署中的同步性能，结果表明所生成的录制素材在时间维度上具有足够的一致性，能够支持对话行为的细粒度分析与数据驱动建模。

摘要 (Abstract)

Multi-view capture systems have been an important tool in research for recording human motion under controlling conditions. Most existing systems are specified around video streams and provide little or no support for audio acquisition and rigorous audio-video alignment, despite both being essential for studying conversational interaction where timing at the level of turn-taking, overlap, and prosody matters. In this technical report, we describe an audio-visual multi-view capture system that addresses this gap by treating synchronized audio and synchronized video as first-class signals. The system combines a multi-camera pipeline with multi-channel microphone recording under a unified timing architecture and provides a practical workflow for calibration, acquisition, and quality control that supports repeatable recordings at scale. We quantify synchronization performance in deployment and show that the resulting recordings are temporally consistent enough to support fine-grained analysis and data-driven modeling of conversation behavior.

关键词: audio-visual, multi-view capture system, synchronization, conversational interaction, temporal consistency, calibration, data-driven modeling, quality control

203. ❌ AgentFoX: LLM Agent-Guided Fusion with eXplainability for AI-Generated Image Detection

作者: Yangxin Yu, Yue Zhou, Bin Li, Kaiqing Lin, Haodong Li, Jiangqun Ni, Bo Cao 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23115v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出AgentFoX框架，核心是使用LLM驱动的智能体进行AI生成图像检测，因此与’Large Language Models’和’LLM Agents’高度相关（10分）。框架强调可解释性，与’Mechanistic Interpretability’高度相关（10分）。智能体采用多阶段分析、结构化推理，与’Chain of Thought’和’System 2 Thinking’有一定关联（5分）。智能体整合专家证据，可视为一种工具使用，与’Tool Use’有一定关联（5分）。其他关键词如MoE、SLMs、训练方法、推理优化、科学AI应用等，论文未涉及或仅边缘提及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对AI生成图像检测中现有方法依赖特定伪造痕迹、性能专一且可能产生矛盾判断的问题，提出了一个名为AgentFoX的LLM智能体驱动框架，通过动态多阶段分析整合专家证据，生成可解释的详细取证报告，从而提高了检测的可靠性和可信任度。

摘要翻译

随着人工智能生成图像（AIGI）的真实感日益增强，迫切需要能够可靠区分合成内容与真实图像的取证工具。现有检测器通常针对特定伪造痕迹（如频域模式或语义不一致性）进行定制，导致其性能专门化，有时甚至产生相互矛盾的判断。为应对这些局限性，我们提出了AgentFoX——一个由大语言模型驱动的框架，将AIGI检测重新定义为动态的多阶段分析过程。该方法采用快速集成融合机制，其运作由经过校准的专家画像（Expert Profiles）与上下文聚类画像（Clustering Profiles）构成的规范化知识库引导。在推理过程中，智能体首先进行高层语义评估，随后过渡到细粒度、上下文感知的信号级专家证据综合，通过结构化推理解决矛盾。AgentFoX不返回粗糙的二元输出，而是生成一份详细、人类可读的取证报告，为其判定提供依据，从而增强实际部署中的可解释性与可信度。除了提供新颖的检测方案外，本研究还引入了一种可扩展的智能体范式，为未来持续演进的取证工具实现智能化集成提供了可能。

摘要 (Abstract)

The increasing realism of AI-Generated Images (AIGI) has created an urgent need for forensic tools capable of reliably distinguishing synthetic content from authentic imagery. Existing detectors are typically tailored to specific forgery artifacts–such as frequency-domain patterns or semantic inconsistencies–leading to specialized performance and, at times, conflicting judgments. To address these limitations, we present \textbf{AgentFoX}, a Large Language Model-driven framework that redefines AIGI detection as a dynamic, multi-phase analytical process. Our approach employs a quick-integration fusion mechanism guided by a curated knowledge base comprising calibrated Expert Profiles and contextual Clustering Profiles. During inference, the agent begins with high-level semantic assessment, then transitions to fine-grained, context-aware synthesis of signal-level expert evidence, resolving contradictions through structured reasoning. Instead of returning a coarse binary output, AgentFoX produces a detailed, human-readable forensic report that substantiates its verdict, enhancing interpretability and trustworthiness for real-world deployment. Beyond providing a novel detection solution, this work introduces a scalable agentic paradigm that facilitates intelligent integration of future and evolving forensic tools.

关键词: AI-Generated Image Detection, LLM Agent, Forensic Framework, Explainability, Multi-phase Analysis, Knowledge Base Integration, Structured Reasoning, Human-readable Report

204. ❌ NeuroSeg Meets DINOv3: Transferring 2D Self-Supervised Visual Priors to 3D Neuron Segmentation via DINOv3 Initialization

作者: Yik San Cheng, Runkai Zhao, Weidong Cai 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23104v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于将2D视觉基础模型DINOv3迁移到3D神经元分割任务，属于AI for Science（生物信息学）领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文涉及从预训练模型迁移到新领域，与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），但并非核心。其他关键词主要涉及大语言模型（LLMs）、推理、对齐、优化等技术，与论文的计算机视觉和生物医学图像处理主题无关，因此评分为0分。

!!! tip deepseek-chat TL;DR

该研究解决了3D神经元分割中缺乏高质量基础模型的问题，通过将2D自监督视觉模型DINOv3迁移到3D生物医学图像，实现了更数据高效和形态保真的神经元重建，在多个数据集上超越了现有方法。

摘要翻译

二维视觉基础模型（例如DINOv3）是一种在大规模自然图像上训练的自监督模型，已展现出强大的零样本泛化能力，能够同时捕捉丰富的全局上下文与细粒度结构特征。然而，针对下游体数据神经影像的类似三维基础模型仍然缺乏，这主要源于三维图像采集的挑战以及高质量标注数据的稀缺。为填补这一空白，我们提出将DINOv3学习到的二维视觉表征适配至三维生物医学分割模型，从而实现更高数据效率且形态保真的神经元重建。具体而言，我们设计了一种基于膨胀的适配策略，将二维滤波器扩展为三维操作符，在保留DINOv3语义先验的同时适应三维神经元体数据块。此外，我们引入了一种拓扑感知的骨架损失函数，以显式增强基于图的神经元树突重建的结构保真度。在四个神经元影像数据集（包括两个来自BigNeuron的数据集，以及两个公开数据集NeuroFly和CWMBS）上的大量实验表明，本方法在重建精度上相较于当前最优方法（SoTA）取得了一致性提升，其中整体结构平均值（Entire Structure Average）平均提升2.9%，差异结构平均值（Different Structure Average）平均提升2.8%，差异结构百分比（Percentage of Different Structure）平均提升3.8%。代码：https://github.com/yy0007/NeurINO。

摘要 (Abstract)

2D visual foundation models, such as DINOv3, a self-supervised model trained on large-scale natural images, have demonstrated strong zero-shot generalization, capturing both rich global context and fine-grained structural cues. However, an analogous 3D foundation model for downstream volumetric neuroimaging remains lacking, largely due to the challenges of 3D image acquisition and the scarcity of high-quality annotations. To address this gap, we propose to adapt the 2D visual representations learned by DINOv3 to a 3D biomedical segmentation model, enabling more data-efficient and morphologically faithful neuronal reconstruction. Specifically, we design an inflation-based adaptation strategy that inflates 2D filters into 3D operators, preserving semantic priors from DINOv3 while adapting to 3D neuronal volume patches. In addition, we introduce a topology-aware skeleton loss to explicitly enforce structural fidelity of graph-based neuronal arbor reconstruction. Extensive experiments on four neuronal imaging datasets, including two from BigNeuron and two public datasets, NeuroFly and CWMBS, demonstrate consistent improvements in reconstruction accuracy over SoTA methods, with average gains of 2.9% in Entire Structure Average, 2.8% in Different Structure Average, and 3.8% in Percentage of Different Structure. Code: https://github.com/yy0007/NeurINO.

关键词: DINOv3, 3D neuron segmentation, self-supervised learning, biomedical imaging, transfer learning, foundation models, neuroimaging, topology-aware loss

205. ❌ PolarAPP: Beyond Polarization Demosaicking for Polarimetric Applications

作者: Yidong Luo, Chenggong Li, Yunfeng Song, Ping Wang, Boxin Shi, Junchao Zhang, Xin Yuan 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23071v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PolarAPP专注于偏振成像中的去马赛克和下游任务联合优化，属于计算机视觉和图像处理领域。所有评分关键词均涉及大模型、深度学习技术原理或AI在科学领域的应用，而本文研究的是偏振成像的特定图像重建问题，未涉及任何大模型技术、深度学习创新方法或AI在生物/化学信息学中的应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出了PolarAPP框架，首次通过元学习和等效成像约束联合优化偏振成像的去马赛克和下游任务，显著提升了重建质量和下游性能。

摘要翻译

偏振成像通过捕捉独特的表面-材料相互作用，实现了法线估计与反光消除等高级视觉应用。然而，现有应用（亦称为下游任务）依赖于对焦平面分割传感器原始测量值进行简单重组而构建的数据集，其中相同偏振角度的像素被提取并对齐为稀疏图像，而未经过恰当的去马赛克处理。这种重建策略产生了次优且不完整的目标数据，限制了下游任务的性能。此外，当前的去马赛克方法普遍与具体任务无关，仅针对光度保真度进行优化，而未考虑其在下游任务中的实际效用。为此，我们提出了PolarAPP，这是首个联合优化去马赛克处理及其下游任务的框架。PolarAPP引入了一种特征对齐机制，通过元学习在语义层面将去马赛克网络与下游任务网络的表征对齐，从而引导重建过程具备任务感知能力。该框架进一步采用等效成像约束进行去马赛克训练，使其能够直接回归到具有物理意义的输出，而无需依赖重组数据。最后，通过任务精调阶段，利用稳定的去马赛克前端对任务网络进行微调，以进一步提升精度。大量实验结果表明，PolarAPP在去马赛克质量与下游任务性能上均优于现有方法。代码将在论文录用后公开。

摘要 (Abstract)

Polarimetric imaging enables advanced vision applications such as normal estimation and de-reflection by capturing unique surface-material interactions. However, existing applications (alternatively called downstream tasks) rely on datasets constructed by naively regrouping raw measurements from division-of-focal-plane sensors, where pixels of the same polarization angle are extracted and aligned into sparse images without proper demosaicking. This reconstruction strategy results in suboptimal, incomplete targets that limit downstream performance. Moreover, current demosaicking methods are task-agnostic, optimizing only for photometric fidelity rather than utility in downstream tasks. Towards this end, we propose PolarAPP, the first framework to jointly optimize demosaicking and its downstream tasks. PolarAPP introduces a feature alignment mechanism that semantically aligns the representations of demosaicking and downstream networks via meta-learning, guiding the reconstruction to be task-aware. It further employs an equivalent imaging constraint for demosaicking training, enabling direct regression to physically meaningful outputs without relying on rearranged data. Finally, a task-refinement stage fine-tunes the task network using the stable demosaicking front-end to further enhance accuracy. Extensive experimental results demonstrate that PolarAPP outperforms existing methods in both demosaicking quality and downstream performance. Code is available upon acceptance.

关键词: polarimetric imaging, demosaicking, downstream tasks, feature alignment, meta-learning, equivalent imaging constraint, task-aware reconstruction, joint optimization

206. ❌ Policy-based Tuning of Autoregressive Image Models with Instance- and Distribution-Level Rewards

作者: Orhun Buğra Baran, Melih Kandemir, Ramazan Gokberk Cinbis 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23086v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于图像生成领域，研究自回归模型的强化学习调优方法，核心贡献是提出了一种结合实例级和分布级奖励的轻量级RL框架。虽然论文涉及强化学习（RL）和模型调优，但所有关键词都明确针对大语言模型（LLMs）或大模型技术，而本文研究的是图像生成的自回归模型，与文本大模型无关。关键词如’Large Language Models’、‘LLM Agents’、‘RLHF’等均特指语言模型，而本文未涉及任何语言模型或大模型技术，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合实例级和分布级奖励的轻量级强化学习框架，用于优化自回归图像生成模型，解决了标准训练方法在样本质量和多样性上的不足，实验表明该方法能在少量调优迭代中显著提升生成质量并避免模式崩溃。

摘要翻译

自回归（AR）模型在图像生成中表现出色，但其标准的极大似然估计训练方法缺乏对样本质量和多样性的直接优化。尽管强化学习（RL）已被用于对齐扩散模型，但这些方法通常存在输出多样性崩溃的问题。类似地，当前针对AR模型的强化学习方法严格依赖于实例级奖励，往往以牺牲分布覆盖度为代价来换取质量提升。为解决这些局限性，我们提出了一种轻量级强化学习框架，将基于令牌的自回归合成建模为马尔可夫决策过程，并通过组相对策略优化（GRPO）进行优化。我们的核心贡献是引入了一种新颖的分布级留一FID（LOO-FID）奖励：通过利用特征矩的指数移动平均，它明确鼓励样本多样性，并在策略更新过程中防止模式崩溃。我们将此奖励与复合实例级奖励（CLIP和HPSv2）相结合，以确保严格的语义和感知保真度，并通过自适应熵正则化项来稳定多目标学习。在LlamaGen和VQGAN架构上进行的大量实验表明，仅需数百次调优迭代，模型在标准质量和多样性指标上均取得了显著提升。结果还显示，即使在没有无分类器引导的情况下，该模型也能通过更新生成具有竞争力的样本，从而绕过其两倍的推理成本。

摘要 (Abstract)

Autoregressive (AR) models are highly effective for image generation, yet their standard maximum-likelihood estimation training lacks direct optimization for sample quality and diversity. While reinforcement learning (RL) has been used to align diffusion models, these methods typically suffer from output diversity collapse. Similarly, concurrent RL methods for AR models rely strictly on instance-level rewards, often trading off distributional coverage for quality. To address these limitations, we propose a lightweight RL framework that casts token-based AR synthesis as a Markov Decision Process, optimized via Group Relative Policy Optimization (GRPO). Our core contribution is the introduction of a novel distribution-level Leave-One-Out FID (LOO-FID) reward; by leveraging an exponential moving average of feature moments, it explicitly encourages sample diversity and prevents mode collapse during policy updates. We integrate this with composite instance-level rewards (CLIP and HPSv2) for strict semantic and perceptual fidelity, and stabilize the multi-objective learning with an adaptive entropy regularization term. Extensive experiments on LlamaGen and VQGAN architectures demonstrate clear improvements across standard quality and diversity metrics within only a few hundred tuning iterations. The results also show that the model can be updated to produce competitive samples even without Classifier-Free Guidance, and bypass its 2x inference cost.

关键词: Autoregressive Models, Image Generation, Reinforcement Learning, Policy Optimization, Distribution-Level Reward, Instance-Level Reward, Sample Diversity, Mode Collapse

207. ❌ MLLM-HWSI: A Multimodal Large Language Model for Hierarchical Whole Slide Image Understanding

作者: Basit Alawode, Arif Mahmood, Muaz Khalifa Al-Radi, Shahad Albastaki, Asim Khan, Muhammad Bilal, Moshira Ali Abdalla, Mohammed Bennamoun, Sajid Javed 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23067v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出了一种用于全切片图像（WSI）理解的多模态大语言模型（MLLM），核心是开发一种分层架构，将视觉特征与病理语言在四个尺度上对齐，并使用指令调优的LLM进行开放式推理。因此，与’Large Language Models’、‘Instruction Tuning’、‘AI for Science’高度相关（10分）。模型强调可解释性和证据推理，与’Mechanistic Interpretability’高度相关（10分）。论文涉及训练（预训练、微调）和推理（多步推理、深入推理、事实性），这些方面有一定关联（5分）。其他关键词如MoE、量化、RAG等未在摘要中提及或不是核心，评为0分。

!!! tip deepseek-chat TL;DR

该研究解决了现有计算病理学多模态大语言模型将整个全切片图像压缩为单一嵌入、忽略病理学家跨尺度合成证据的问题，通过提出分层多尺度对齐的MLLM-HWSI模型，在13个WSI级基准测试中取得了新的最先进结果。

摘要翻译

全切片图像（Whole Slide Images, WSIs）具有层次化结构，其诊断信息源自细胞形态、区域组织结构和全局上下文。现有的计算病理学（Computational Pathology, CPath）多模态大语言模型（Multimodal Large Language Models, MLLMs）通常将整个WSI压缩为单一嵌入表示，这阻碍了细粒度定位，并忽略了病理学家如何综合不同尺度证据的过程。我们提出了 MLLM-HWSI，一种层次化的WSI级多模态大语言模型，它在四个不同尺度上将视觉特征与病理学语言对齐：细胞如词、图像块如短语、区域如句子、WSI如段落，以支持可解释的基于证据的推理。MLLM-HWSI通过尺度特定的投影器将每个WSI分解为多尺度嵌入，并联合优化（i）层次化对比学习目标和（ii）跨尺度一致性损失，从而保持从细胞到WSI的语义连贯性。我们计算诊断相关的图像块，并使用轻量级的 细胞-细胞注意力融合（Cell-Cell Attention Fusion, CAAF） Transformer将分割后的细胞嵌入聚合为每个图像块的紧凑细胞标记。投影后的多尺度标记与文本标记融合，并输入至经过指令调优的大语言模型，以执行开放式推理、视觉问答、报告生成和描述生成任务。通过三阶段训练，MLLM-HWSI在六项CPath任务的13个WSI级基准测试中取得了新的最优性能。通过将语言与多尺度视觉证据对齐，MLLM-HWSI提供了准确、可解释的输出，这些输出反映了诊断工作流程，并推进了对WSI的整体理解。代码发布于：\href{https://github.com/BasitAlawode/HWSI-MLLM}{GitHub}。

摘要 (Abstract)

Whole Slide Images (WSIs) exhibit hierarchical structure, where diagnostic information emerges from cellular morphology, regional tissue organization, and global context. Existing Computational Pathology (CPath) Multimodal Large Language Models (MLLMs) typically compress an entire WSI into a single embedding, which hinders fine-grained grounding and ignores how pathologists synthesize evidence across different scales. We introduce \textbf{MLLM-HWSI}, a Hierarchical WSI-level MLLM that aligns visual features with pathology language at four distinct scales, cell as word, patch as phrase, region as sentence, and WSI as paragraph to support interpretable evidence-grounded reasoning. MLLM-HWSI decomposes each WSI into multi-scale embeddings with scale-specific projectors and jointly enforces (i) a hierarchical contrastive objective and (ii) a cross-scale consistency loss, preserving semantic coherence from cells to the WSI. We compute diagnostically relevant patches and aggregate segmented cell embeddings into a compact cellular token per-patch using a lightweight \textit{Cell-Cell Attention Fusion (CCAF)} transformer. The projected multi-scale tokens are fused with text tokens and fed to an instruction-tuned LLM for open-ended reasoning, VQA, report, and caption generation tasks. Trained in three stages, MLLM-HWSI achieves new SOTA results on 13 WSI-level benchmarks across six CPath tasks. By aligning language with multi-scale visual evidence, MLLM-HWSI provides accurate, interpretable outputs that mirror diagnostic workflows and advance holistic WSI understanding. Code is available at: \href{https://github.com/BasitAlawode/HWSI-MLLM}{GitHub}.

关键词: Multimodal Large Language Model, Whole Slide Image, Hierarchical Understanding, Computational Pathology, Instruction Tuning, Interpretable Reasoning, Multi-scale Alignment, Visual-Language Alignment

208. ❌ Generative Event Pretraining with Foundation Model Alignment

作者: Jianwen Cao, Jiaxu Xing, Nico Messikommer, Davide Scaramuzza 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23032v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文提出GEP框架，将视觉基础模型（VFMs）的知识迁移到事件相机数据，核心涉及基础模型对齐（关键词1得8分）和预训练（关键词5得10分），通过回归对比目标实现对齐（关键词7得10分）。研究属于AI for Science在视觉感知领域的应用（关键词27得8分）。其他关键词如MoE、SFT、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了事件相机数据缺乏标注、难以训练视觉基础模型的问题，通过提出的GEP框架将图像基础模型的语义知识对齐到事件数据并学习事件特有的时间动态，在多个下游任务上超越了现有方法。

摘要翻译

事件相机凭借其微秒级延迟和高动态范围，在快速运动与挑战性光照条件下仍能提供鲁棒的视觉信号。然而，其独特的传感特性与有限的标注数据使得训练基于事件的视觉基础模型（VFMs）面临挑战，而此类模型对于学习可跨任务迁移的视觉特征至关重要。为解决这一问题，我们提出GEP（生成式事件预训练），这是一个两阶段框架，能够将从互联网规模图像数据集中学到的语义知识迁移至事件数据，同时学习事件特有的时序动态。首先，通过联合回归-对比目标将事件编码器与一个冻结的视觉基础模型对齐，使事件特征植根于图像语义。其次，在混合的事件-图像序列上对Transformer主干网络进行自回归预训练，以捕捉事件独有的时序结构。我们的方法在多种下游任务（包括物体识别、分割和深度估计）上超越了现有的事件预训练方法。视觉基础模型引导的对齐与生成式序列建模相结合，共同产生了一个语义丰富、具备时序感知能力的事件模型，该模型能够稳健地跨领域泛化。

摘要 (Abstract)

Event cameras provide robust visual signals under fast motion and challenging illumination conditions thanks to their microsecond latency and high dynamic range. However, their unique sensing characteristics and limited labeled data make it challenging to train event-based visual foundation models (VFMs), which are crucial for learning visual features transferable across tasks. To tackle this problem, we propose GEP (Generative Event Pretraining), a two-stage framework that transfers semantic knowledge learned from internet-scale image datasets to event data while learning event-specific temporal dynamics. First, an event encoder is aligned to a frozen VFM through a joint regression-contrastive objective, grounding event features in image semantics. Second, a transformer backbone is autoregressively pretrained on mixed event-image sequences to capture the temporal structure unique to events. Our approach outperforms state-of-the-art event pretraining methods on a diverse range of downstream tasks, including object recognition, segmentation, and depth estimation. Together, VFM-guided alignment and generative sequence modeling yield a semantically rich, temporally aware event model that generalizes robustly across domains.

关键词: Event cameras, Visual foundation models, Pretraining, Alignment, Generative sequence modeling, Temporal dynamics, Domain transfer, Object recognition

209. ❌ Traffic Sign Recognition in Autonomous Driving: Dataset, Benchmark, and Field Experiment

作者: Guoyang Zhao, Weiqing Qi, Kai Zhang, Chenguang Zhang, Zeying Gong, Zhihai Bi, Kai Chen, Benshan Ma, Ming Liu, Jun Ma 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23034v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于交通标志识别（TSR）在自动驾驶中的应用，提出了一个大规模数据集TS-1M和诊断基准，并比较了经典监督模型、自监督预训练模型和多模态视觉语言模型（VLMs）。虽然涉及多模态视觉语言模型（VLMs），但论文主要关注视觉感知和数据集构建，而非大语言模型（LLMs）或深度学习技术原理的创新。所有关键词均与大语言模型、深度学习技术原理或AI for Science（如生物信息学、化学信息学）直接相关，而本文属于计算机视觉在自动驾驶领域的应用，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一个大规模、全球多样化的交通标志数据集TS-1M和诊断基准，用于系统评估不同学习范式在跨区域、长尾类别和语义模糊等挑战下的性能，并通过真实场景实验验证了其实际相关性。

摘要翻译

交通标志识别（Traffic Sign Recognition, TSR）是自动驾驶的核心感知能力，其对跨区域差异、长尾类别以及语义模糊性的鲁棒性对于实际场景的可靠部署至关重要。尽管识别准确率已取得稳步进展，但现有的交通标志数据集与基准测试在诊断不同建模范式如何应对这些实际挑战方面提供的洞察力有限。本文提出了TS-1M，一个大规模且全球多样化的交通标志数据集，包含涵盖454个标准化类别的一百余万张真实世界图像，并配套一个旨在分析模型能力边界的诊断性基准。除了标准的训练-测试评估外，我们还提供了一系列面向挑战的设定，包括跨区域识别、稀有类别识别、低清晰度鲁棒性以及语义文本理解，从而能够对现代TSR模型进行系统化、细粒度的评估。利用TS-1M，我们对三种代表性学习范式进行了统一基准测试：经典监督模型、自监督预训练模型以及多模态视觉-语言模型（Vision-Language Models, VLMs）。我们的分析揭示了一致的范式依赖性行为，表明语义对齐是跨区域泛化和稀有类别识别的关键因素，而纯视觉模型仍对外观变化和数据不平衡敏感。最后，我们通过真实场景自动驾驶实验验证了TS-1M的实际相关性，其中交通标志识别与语义推理及空间定位相结合，以支持地图层级的决策约束。总体而言，TS-1M为TSR建立了一个参考级的诊断基准，并为鲁棒且具有语义感知的交通标志感知提供了原理性见解。项目页面：https://guoyangzhao.github.io/projects/ts1m。

摘要 (Abstract)

Traffic Sign Recognition (TSR) is a core perception capability for autonomous driving, where robustness to cross-region variation, long-tailed categories, and semantic ambiguity is essential for reliable real-world deployment. Despite steady progress in recognition accuracy, existing traffic sign datasets and benchmarks offer limited diagnostic insight into how different modeling paradigms behave under these practical challenges. We present TS-1M, a large-scale and globally diverse traffic sign dataset comprising over one million real-world images across 454 standardized categories, together with a diagnostic benchmark designed to analyze model capability boundaries. Beyond standard train-test evaluation, we provide a suite of challenge-oriented settings, including cross-region recognition, rare-class identification, low-clarity robustness, and semantic text understanding, enabling systematic and fine-grained assessment of modern TSR models. Using TS-1M, we conduct a unified benchmark across three representative learning paradigms: classical supervised models, self-supervised pretrained models, and multimodal vision-language models (VLMs). Our analysis reveals consistent paradigm-dependent behaviors, showing that semantic alignment is a key factor for cross-region generalization and rare-category recognition, while purely visual models remain sensitive to appearance shift and data imbalance. Finally, we validate the practical relevance of TS-1M through real-scene autonomous driving experiments, where traffic sign recognition is integrated with semantic reasoning and spatial localization to support map-level decision constraints. Overall, TS-1M establishes a reference-level diagnostic benchmark for TSR and provides principled insights into robust and semantic-aware traffic sign perception. Project page: https://guoyangzhao.github.io/projects/ts1m.

关键词: Traffic Sign Recognition, Autonomous Driving, Large-scale Dataset, Cross-region Generalization, Multimodal Vision-Language Models, Semantic Alignment, Real-world Deployment, Diagnostic Benchmark

210. ❌ Cog3DMap: Multi-View Vision-Language Reasoning with 3D Cognitive Maps

作者: Chanyoung Gwak, Yoonwoo Jeong, Byungwoo Jeon, Hyunseok Lee, Jinwoo Shin, Minsu Cho 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23023v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文Cog3DMap专注于多模态大语言模型（MLLMs）在空间推理方面的应用，核心是解决MLLMs在3D空间理解上的局限性。因此，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为MLLMs是LLMs的扩展。论文涉及空间推理，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’和’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’有一定关联（8分），因为构建3D地图并进行推理需要多步和深度思考过程。其他关键词如MoE、SLMs、训练技术、优化方法、代理系统、压缩加速等，论文未直接涉及，给0分。论文不属于生物信息学等特定科学领域，因此’AI for Science’等也得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在从多视角图像进行精确空间理解时缺乏显式几何基础的问题，提出了Cog3DMap框架，通过递归构建显式3D记忆来增强模型的空间推理能力，并在多个空间推理基准测试中取得了最先进的性能。

摘要翻译

从多视角图像中实现精确的空间理解，对于多模态大语言模型（MLLMs）而言，仍然是一个根本性挑战，因为其视觉表征主要基于语义，缺乏显式的几何基础。现有方法虽然通过视觉几何模型提供的几何线索来增强视觉标记（visual tokens），但其MLLM仍需从这些增强后的标记中隐式推断场景的底层三维结构，这限制了其空间推理能力。为解决这一问题，我们提出了Cog3DMap框架，该框架能够从多视角图像中循环构建一个显式的三维记忆（3D memory），其中每个标记都基于三维空间，并同时具备语义和几何信息。通过将这些标记输入MLLM，我们的框架能够直接在具有空间结构的三维地图上进行推理，从而在多种空间推理基准测试中取得了最先进的性能。代码将公开提供。

摘要 (Abstract)

Precise spatial understanding from multi-view images remains a fundamental challenge for Multimodal Large Language Models (MLLMs), as their visual representations are predominantly semantic and lack explicit geometric grounding. While existing approaches augment visual tokens with geometric cues from visual geometry models, their MLLM is still required to implicitly infer the underlying 3D structure of the scene from these augmented tokens, limiting its spatial reasoning capability. To address this issue, we introduce Cog3DMap, a framework that recurrently constructs an explicit 3D memory from multi-view images, where each token is grounded in 3D space and possesses both semantic and geometric information. By feeding these tokens into the MLLM, our framework enables direct reasoning over a spatially structured 3D map, achieving state-of-the-art performance on various spatial reasoning benchmarks. Code will be made publicly available.

关键词: Multimodal Large Language Models, 3D Cognitive Maps, Spatial Reasoning, Multi-view Images, Geometric Grounding, Visual Representations, State-of-the-art Performance, Explicit 3D Memory

211. ❌ Zero-Shot Personalization of Objects via Textual Inversion

作者: Aniket Roy, Maitreya Suin, Rama Chellappa 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23010v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于文本到图像扩散模型中的个性化定制技术，特别是通过文本反转嵌入实现零样本对象个性化。所有评分关键词均针对大语言模型（LLM）及相关技术，而本文研究的是扩散模型（一种生成模型，与LLM不同），因此与所有关键词完全无关。论文未涉及LLM、MoE、SLM、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理方法、代理系统、模型压缩、解码加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于文本反转嵌入的零样本对象个性化框架，通过预测对象特定嵌入并集成到扩散模型中，实现了快速、无需训练的图像定制。

摘要翻译

文本到图像扩散模型的最新进展显著提升了图像定制化的质量，能够合成高度逼真的图像。尽管取得了这些进展，实现快速高效的个人化仍然是一个关键挑战，特别是在现实世界应用中。现有方法主要通过向扩散模型注入特定身份嵌入来加速人物主体的定制，但这些策略无法很好地泛化到任意物体类别，限制了其应用范围。为应对这一局限，我们提出了一种新颖框架，该框架采用一个学习网络来预测物体特定的文本反转嵌入，随后将这些嵌入整合到扩散模型的UNet时间步中，以实现文本条件定制。这一设计使得在单次前向传播中即可对广泛物体进行快速、零样本的个人化，兼具灵活性与可扩展性。在多种任务和设置下的大量实验证明了我们方法的有效性，凸显了其在支持快速、通用和包容性图像定制方面的潜力。据我们所知，本研究首次尝试在扩散模型中实现此类通用、无需训练的个人化，为个性化图像生成的未来研究铺平了道路。

摘要 (Abstract)

Recent advances in text-to-image diffusion models have substantially improved the quality of image customization, enabling the synthesis of highly realistic images. Despite this progress, achieving fast and efficient personalization remains a key challenge, particularly for real-world applications. Existing approaches primarily accelerate customization for human subjects by injecting identity-specific embeddings into diffusion models, but these strategies do not generalize well to arbitrary object categories, limiting their applicability. To address this limitation, we propose a novel framework that employs a learned network to predict object-specific textual inversion embeddings, which are subsequently integrated into the UNet timesteps of a diffusion model for text-conditional customization. This design enables rapid, zero-shot personalization of a wide range of objects in a single forward pass, offering both flexibility and scalability. Extensive experiments across multiple tasks and settings demonstrate the effectiveness of our approach, highlighting its potential to support fast, versatile, and inclusive image customization. To the best of our knowledge, this work represents the first attempt to achieve such general-purpose, training-free personalization within diffusion models, paving the way for future research in personalized image generation.

关键词: text-to-image diffusion models, personalization, textual inversion, zero-shot, object customization, training-free, image generation, UNet

212. ❌ VQ-Jarvis: Retrieval-Augmented Video Restoration Agent with Sharp Vision and Fast Thought

作者: Xuanyu Zhang, Weiqi Li, Qunliang Xing, Jingfen Xie, Bin Chen, Junlin Li, Li Zhang, Jian Zhang, Shijie Zhao 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22998v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文VQ-Jarvis主要研究视频修复智能体，其核心创新在于结合检索增强生成（RAG）和智能体工作流（Agentic Workflow）来优化视频修复过程。因此，与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（10分），因为论文明确使用了RAG库进行检索；与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为论文设计了一个智能体系统进行动态决策和轨迹搜索。其他关键词如大模型、训练方法、推理优化等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文提出VQ-Jarvis，一个检索增强的视频修复智能体，通过构建大规模视频对比数据集和分层调度策略，在复杂退化视频上优于现有方法。

摘要翻译

现实场景中的视频复原任务常受异构退化问题的挑战，静态架构与固定推理流程往往难以泛化。近期基于智能体的方法虽能实现动态决策，但现有视频复原智能体仍受限于感知能力不足与搜索策略低效。我们提出VQ-Jarvis——一个具备更敏锐视觉与更快思维速度、基于检索增强的一体化智能视频复原智能体。VQ-Jarvis旨在精准感知退化类型及配对复原结果间的细微差异，同时高效探索最优复原轨迹。为实现敏锐视觉，我们构建了首个大规模视频配对增强数据集VSR-Compare，包含2万组对比对，涵盖7种退化类型、11种增强算子及多样内容领域。基于该数据集，我们训练了多算子评判模型与退化感知模型以指导智能体决策。为实现快速思维，我们提出分层算子调度策略以适应视频难度：对于简单案例，通过检索增强生成（RAG）库一步检索最优复原轨迹；对于复杂案例，则执行逐步贪婪搜索以平衡效率与精度。大量实验表明，VQ-Jarvis在处理复杂退化视频时持续优于现有方法。

摘要 (Abstract)

Video restoration in real-world scenarios is challenged by heterogeneous degradations, where static architectures and fixed inference pipelines often fail to generalize. Recent agent-based approaches offer dynamic decision making, yet existing video restoration agents remain limited by insufficient quality perception and inefficient search strategies. We propose VQ-Jarvis, a retrieval-augmented, all-in-one intelligent video restoration agent with sharper vision and faster thought. VQ-Jarvis is designed to accurately perceive degradations and subtle differences among paired restoration results, while efficiently discovering optimal restoration trajectories. To enable sharp vision, we construct VSR-Compare, the first large-scale video paired enhancement dataset with 20K comparison pairs covering 7 degradation types, 11 enhancement operators, and diverse content domains. Based on this dataset, we train a multiple operator judge model and a degradation perception model to guide agent decisions. To achieve fast thought, we introduce a hierarchical operator scheduling strategy that adapts to video difficulty: for easy cases, optimal restoration trajectories are retrieved in a one-step manner from a retrieval-augmented generation (RAG) library; for harder cases, a step-by-step greedy search is performed to balance efficiency and accuracy. Extensive experiments demonstrate that VQ-Jarvis consistently outperforms existing methods on complex degraded videos.

关键词: video restoration, retrieval-augmented generation, agent, VSR-Compare dataset, hierarchical operator scheduling, degradation perception, restoration trajectories, paired enhancement

213. ❌ VLA-IAP: Training-Free Visual Token Pruning via Interaction Alignment for Vision-Language-Action Models

作者: Jintao Cheng, Haozhe Wang, Weibin Li, Gang Wang, Yipu Zhang, Xiaoyu Tang, Jin Wu, Xieyuanli Chen, Yunhui Liu, Wei Zhang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22991v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Vision-Language-Action (VLA)模型的视觉令牌剪枝方法，属于大模型在机器人/具身智能领域的应用，与’Large Language Models’和’LLM Agents’有一定关联（5分）。核心贡献是训练自由的推理加速方法，通过剪枝减少计算成本，与’Quantization/Model Compression’和’Speculative Decoding/Inference Acceleration’高度相关（8分）。方法涉及语义-运动对齐，与’Alignment’有概念关联（5分）。其他关键词如MoE、SFT、RAG、CoT等未涉及（0分）。

!!! tip deepseek-chat TL;DR

论文提出了一种训练自由的视觉令牌剪枝方法VLA-IAP，通过交互对齐优化Vision-Language-Action模型的推理效率，在保持性能的同时实现了最高1.54倍的加速。

摘要翻译

视觉-语言-动作（Vision-Language-Action，VLA）模型已快速推动了具身智能的发展，使机器人能够执行复杂的指令驱动任务。然而，随着模型容量和视觉上下文长度的增长，VLA系统的推理成本成为在资源受限平台上实际部署的主要瓶颈。现有的视觉令牌剪枝方法主要依赖于语义显著性或简单的时间线索，忽视了VLA任务的一个基本特性——持续的物理交互。因此，当前方法往往会剪除视觉上稀疏但在结构上对操作至关重要的区域，导致任务早期阶段行为不稳定。为克服这一问题，我们提出转向一种明确的“交互优先”范式。我们提出的无需训练的方法——VLA-IAP（交互对齐剪枝），引入了一种几何先验机制来保留结构锚点，以及一种基于语义-运动对齐动态调整剪枝强度的调度策略。这使得剪枝过程能够从保守过渡到激进，确保在早期不确定性阶段的鲁棒性，并在交互锁定后实现高效性。大量实验表明，VLA-IAP在LIBERO基准测试中实现了97.8%的成功率，并带来1.25倍的加速，同时最高可达1.54倍的加速，且性能与未剪枝的骨干模型相当。此外，该方法在多种模型架构、三个不同的仿真环境以及一个真实机器人平台上均表现出优越且一致的性能，验证了其强大的泛化能力和实际适用性。我们的项目网站是：\href{https://chengjt1999.github.io/VLA-IAP.github.io/}{VLA-IAP.com}。

摘要 (Abstract)

Vision-Language-Action (VLA) models have rapidly advanced embodied intelligence, enabling robots to execute complex, instruction-driven tasks. However, as model capacity and visual context length grow, the inference cost of VLA systems becomes a major bottleneck for real-world deployment on resource-constrained platforms. Existing visual token pruning methods mainly rely on semantic saliency or simple temporal cues, overlooking the continuous physical interaction, a fundamental property of VLA tasks. Consequently, current approaches often prune visually sparse yet structurally critical regions that support manipulation, leading to unstable behavior during early task phases. To overcome this, we propose a shift toward an explicit Interaction-First paradigm. Our proposed \textbf{training-free} method, VLA-IAP (Interaction-Aligned Pruning), introduces a geometric prior mechanism to preserve structural anchors and a dynamic scheduling strategy that adapts pruning intensity based on semantic-motion alignment. This enables a conservative-to-aggressive transition, ensuring robustness during early uncertainty and efficiency once interaction is locked. Extensive experiments show that VLA-IAP achieves a \textbf{97.8% success rate} with a \textbf{$1.25\times$ speedup} on the LIBERO benchmark, and up to \textbf{$1.54\times$ speedup} while maintaining performance \textbf{comparable to the unpruned backbone}. Moreover, the method demonstrates superior and consistent performance across multiple model architectures and three different simulation environments, as well as a real robot platform, validating its strong generalization capability and practical applicability. Our project website is: \href{https://chengjt1999.github.io/VLA-IAP.github.io/}{VLA-IAP.com}.

关键词: Vision-Language-Action Models, Visual Token Pruning, Interaction Alignment, Training-Free Method, Inference Acceleration, Embodied Intelligence, Robot Task Execution, Resource-Constrained Deployment

214. ❌ WorldMesh: Generating Navigable Multi-Room 3D Scenes via Mesh-Conditioned Image Diffusion

作者: Manuel-Andreas Schneider, Angela Dai 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22972v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《WorldMesh: Generating Navigable Multi-Room 3D Scenes via Mesh-Conditioned Image Diffusion》专注于3D场景生成，提出了一种基于网格条件图像扩散的方法来生成可导航的多房间3D场景。论文的核心技术涉及计算机视觉、3D几何建模和图像合成，但未涉及任何大语言模型（LLMs）、深度学习技术原理创新或AI在科学领域的应用。所有评分关键词均与大语言模型、深度学习技术原理或特定AI科学应用相关，而本文研究内容属于纯粹的3D计算机视觉和图形学领域，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文解决了大规模3D场景生成中难以保持场景和对象一致性的问题，提出了一种基于网格条件图像扩散的方法，通过先构建几何网格骨架再合成外观，实现了可扩展、高一致性的多房间3D场景生成。

摘要翻译

图像与视频合成领域的最新进展启发了其在三维场景生成中的应用。然而，我们观察到，由于缺乏显式几何结构，文本到图像及文本到视频的方法在超出有限环境尺度后难以保持场景与物体层面的一致性。为此，我们提出一种几何优先的方法，将大规模三维场景合成这一复杂问题解耦为两个部分：以网格骨架表示的结构构成，以及基于该网格骨架、利用强大图像合成模型实现的真实感外观合成。根据输入的文本描述，我们首先构建捕捉环境几何结构（墙壁、地面等）的网格，随后借助图像合成、分割与物体重建技术，在网格结构中以逼真布局填充物体。该网格骨架随后被渲染以作为图像合成的条件，为一致的外观生成提供结构支撑。这种方法能够生成可扩展、任意尺寸且具有高度物体丰富性与多样性的三维场景，同时兼顾稳健的三维一致性与照片级真实细节。我们相信，这标志着向生成真正环境尺度、沉浸式三维世界迈出了重要一步。

摘要 (Abstract)

Recent progress in image and video synthesis has inspired their use in advancing 3D scene generation. However, we observe that text-to-image and -video approaches struggle to maintain scene- and object-level consistency beyond a limited environment scale due to the absence of explicit geometry. We thus present a geometry-first approach that decouples this complex problem of large-scale 3D scene synthesis into its structural composition, represented as a mesh scaffold, and realistic appearance synthesis, which leverages powerful image synthesis models conditioned on the mesh scaffold. From an input text description, we first construct a mesh capturing the environment’s geometry (walls, floors, etc.), and then use image synthesis, segmentation and object reconstruction to populate the mesh structure with objects in realistic layouts. This mesh scaffold is then rendered to condition image synthesis, providing a structural backbone for consistent appearance generation. This enables scalable, arbitrarily-sized 3D scenes of high object richness and diversity, combining robust 3D consistency with photorealistic detail. We believe this marks a significant step toward generating truly environment-scale, immersive 3D worlds.

关键词: 3D scene generation, mesh scaffold, image diffusion, geometry-first approach, object consistency, multi-room scenes, photorealistic detail, scalable synthesis

215. ❌ FCL-COD: Weakly Supervised Camouflaged Object Detection with Frequency-aware and Contrastive Learning

作者: Jingchen Ni, Quan Zhang, Dan Jiang, Keyu Lv, Ke Zhang, Chun Yuan 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22969v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《FCL-COD: Weakly Supervised Camouflaged Object Detection with Frequency-aware and Contrastive Learning》专注于计算机视觉领域的伪装目标检测（COD），提出了一种基于频率感知和对比学习的弱监督方法。虽然论文提到了Segment Anything Model（SAM），但仅作为基础模型使用，并未涉及大语言模型（LLMs）、深度学习技术原理创新或大模型在不同领域的应用。所有评分关键词均与大语言模型、深度学习技术原理或AI for Science相关，而本文研究内容属于传统计算机视觉任务，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于频率感知和对比学习的弱监督伪装目标检测框架FCL-COD，通过引入频率感知低秩适应、梯度感知对比学习和多尺度频率感知表示学习，在三个基准测试上超越了现有弱监督和全监督方法。

摘要翻译

现有的伪装目标检测方法通常依赖于掩码标注指导的全监督学习。然而，获取掩码标注耗时耗力。与全监督方法相比，现有的弱监督伪装目标检测方法性能明显较差。即使是分割一切模型，在处理弱监督伪装目标检测时仍面临诸多挑战，例如：a. 非伪装目标响应，b. 局部响应，c. 极端响应，以及d. 缺乏精细边界感知，导致在伪装场景中效果不佳。为缓解这些问题，本文提出了一种基于频率感知与对比学习的弱监督伪装目标检测框架，命名为FCL-COD。为减轻非伪装目标响应问题，我们提出了频率感知低秩适配方法，将频率感知的伪装场景知识融入SAM模型。为克服局部响应与极端响应挑战，我们引入了梯度感知对比学习方法，能有效划定精确的前景-背景边界。此外，针对精细边界感知缺失的问题，我们提出了多尺度频率感知表征学习策略，以促进更精细边界建模。通过在三个广泛认可的伪装目标检测基准数据集上进行大量实证实验，我们验证了所提方法的有效性。实验结果证实，我们的方法超越了当前最先进的弱监督方法，甚至优于部分全监督技术。

摘要 (Abstract)

Existing camouflage object detection (COD) methods typically rely on fully-supervised learning guided by mask annotations. However, obtaining mask annotations is time-consuming and labor-intensive. Compared to fully-supervised methods, existing weakly-supervised COD methods exhibit significantly poorer performance. Even for the Segment Anything Model (SAM), there are still challenges in handling weakly-supervised camouflage object detection (WSCOD), such as: a. non-camouflage target responses, b. local responses, c. extreme responses, and d. lack of refined boundary awareness, which leads to unsatisfactory results in camouflage scenes. To alleviate these issues, we propose a frequency-aware and contrastive learning-based WSCOD framework in this paper, named FCL-COD. To mitigate the problem of non-camouflaged object responses, we propose the Frequency-aware Low-rank Adaptation (FoRA) method, which incorporates frequency-aware camouflage scene knowledge into SAM. To overcome the challenges of local and extreme responses, we introduce a gradient-aware contrastive learning approach that effectively delineates precise foreground-background boundaries. Additionally, to address the lack of refined boundary perception, we present a multi-scale frequency-aware representation learning strategy that facilitates the modeling of more refined boundaries. We validate the effectiveness of our approach through extensive empirical experiments on three widely recognized COD benchmarks. The results confirm that our method surpasses both state-of-the-art weakly supervised and even fully supervised techniques.

关键词: Camouflaged Object Detection, Weakly Supervised Learning, Frequency-aware Learning, Contrastive Learning, Segment Anything Model, Low-rank Adaptation, Multi-scale Representation, Foreground-Background Boundary

216. ❌ Few-Shot Generative Model Adaption via Identity Injection and Preservation

作者: Yeqi He, Liang Li, Jiehua Zhang, Yaoqi Sun, Xichun Sheng, Zhidong Zhao, Chenggang Yan 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22965v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是生成式模型的少样本适应问题，属于计算机视觉领域的生成模型研究。论文的核心是提出I²P方法来解决源域身份知识遗忘问题，主要涉及领域适应技术。因此，仅与关键词’Pre-training OR Continual Pre-training OR Domain Adaptation’有较强相关性（评分为8分），因为论文明确提到’adapt a large pretrained generative model upon a target domain’，这属于领域适应范畴。其他关键词主要针对大语言模型、推理、对齐、压缩等具体技术，与该论文的生成式图像模型研究无关，故均评为0分。

!!! tip deepseek-chat TL;DR

该论文针对少样本生成模型适应中源域身份知识遗忘的问题，提出了身份注入与保持方法，在多个数据集和指标上显著优于现有方法。

摘要翻译

在有限数据下训练生成模型面临严重的模式崩溃挑战。一种常见方法是在极少样本（少于10个）的目标域上适配大型预训练生成模型，这被称为少样本生成模型适配。然而，现有方法在适配过程中常会遗忘源域的身份知识，导致目标域生成图像质量下降。为解决这一问题，我们提出身份注入与保持方法（Identity Injection and Preservation, I$^2$P），通过身份注入和一致性对齐来保持源域身份知识。具体而言，我们首先引入身份注入模块，将源域身份知识整合到目标域的潜在空间中，确保生成图像保留源域的关键身份知识。其次，我们设计了身份替换模块，该模块包含风格-内容解耦器和重建调制器，以进一步增强源域身份保持。我们通过对齐身份替换产生的特征来施加身份一致性约束，从而保护身份知识。定量与定性实验均表明，在多个公共数据集和5项评估指标上，我们的方法相较现有最优方法取得了显著提升。

摘要 (Abstract)

Training generative models with limited data presents severe challenges of mode collapse. A common approach is to adapt a large pretrained generative model upon a target domain with very few samples (fewer than 10), known as few-shot generative model adaptation. However, existing methods often suffer from forgetting source domain identity knowledge during adaptation, which degrades the quality of generated images in the target domain. To address this, we propose Identity Injection and Preservation (I$^2$P), which leverages identity injection and consistency alignment to preserve the source identity knowledge. Specifically, we first introduce an identity injection module that integrates source domain identity knowledge into the target domain’s latent space, ensuring the generated images retain key identity knowledge of the source domain. Second, we design an identity substitution module, which includes a style-content decoupler and a reconstruction modulator, to further enhance source domain identity preservation. We enforce identity consistency constraints by aligning features from identity substitution, thereby preserving identity knowledge. Both quantitative and qualitative experiments show that our method achieves substantial improvements over state-of-the-art methods on multiple public datasets and 5 metrics.

关键词: few-shot adaptation, generative models, domain adaptation, identity preservation, mode collapse, image generation, pretrained models, latent space

217. ❌ Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining

作者: Weijun Zhuang, Yuqing Huang, Weikang Meng, Xin Li, Ming Liu, Xiaopeng Hong, Yaowei Wang, Wangmeng Zuo 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22953v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频-语言预训练的高效方法，核心创新是Cluster-Wise Spatio-Temporal Masking策略，属于预训练技术范畴，因此与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分）。其他关键词主要涉及大语言模型、推理、对齐、压缩、科学应用等，论文未直接涉及这些主题，故均评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ClusterSTM的集群级时空掩码策略，解决了视频-语言预训练中高掩码率下的视觉信息丢失和帧间相关性导致的时间信息泄露问题，在多个基准测试中实现了高效视频-语言模型的最先进性能。

摘要翻译

大规模视频-语言预训练能够实现跨模态任务的强大泛化能力，但往往伴随高昂的计算成本。尽管近期掩码视觉建模的进展有助于缓解这一问题，现有方法仍存在两个根本性局限：高掩码比例下严重的视觉信息丢失，以及帧间相关性导致的时间信息泄露。为应对这些挑战，我们提出ClusterSTM——一种面向高效视频-语言预训练的聚类式时空掩码策略。该方法首先通过帧内聚类将视觉标记划分为多个语义独立的簇，随后以簇为单位进行掩码，仅保留每个簇中时间密度最高的标记。我们的掩码策略确保保留的标记既能捕捉完整的视频内容，又具备强时间相关性。此外，我们引入了视频-文本相关性重构目标，该目标在传统视觉重构基础上实现了更高层次的多模态语义对齐。在多个基准测试上的广泛实验表明，ClusterSTM在视频-文本检索、视频问答和视频描述生成任务中均取得优越性能，为高效视频-语言模型确立了新的技术标杆。

摘要 (Abstract)

Large-scale video-language pretraining enables strong generalization across multimodal tasks but often incurs prohibitive computational costs. Although recent advances in masked visual modeling help mitigate this issue, they still suffer from two fundamental limitations: severe visual information loss under high masking ratios and temporal information leakage caused by inter-frame correlations. To address these challenges, we propose ClusterSTM, a Cluster-Wise Spatio-Temporal Masking strategy for efficient video-language pretraining. ClusterSTM first performs intra-frame clustering to partition visual tokens into multiple semantically independent clusters, then conducts cluster-wise masking by retaining the token with the highest temporal density within each cluster. Our masking strategy ensure that the retained tokens capture holistic video content while exhibit strong temporal correlation. Additionally, we introduce a video-text relevance reconstruction objective that aligns high-level multimodal semantics beyond conventional visual reconstruction. Extensive experiments across multiple benchmarks demonstrate that ClusterSTM achieves superior performance on video-text retrieval, video question answering, and video captioning tasks, establishing a new state-of-the-art among efficient video-language models.

关键词: video-language pretraining, masked visual modeling, spatio-temporal masking, efficient pretraining, multimodal tasks, video-text retrieval, video question answering, video captioning

218. ❌ Caption Generation for Dongba Paintings via Prompt Learning and Semantic Fusion

作者: Shuangwu Qian, Xiaochan Yuan, Pengfei Liu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22946v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究东巴绘画的自动图像描述生成，属于计算机视觉和自然语言处理的交叉领域，而非大模型或深度学习技术原理的创新。论文使用了预训练的BERT权重初始化Transformer解码器，与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为涉及预训练模型的应用。同时，东巴绘画属于文化遗产领域，与’AI for Science OR Bioinformatics OR Cheminformatics’中的科学应用有一定关联（5分），但并非核心的生物信息学或化学信息学。其他关键词如LLMs、MoE、SFT、RLHF等均未涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合提示学习和语义融合的编码器-解码器框架（PVGF-DPC），用于解决东巴绘画因文化特异性导致的自动描述生成难题，并构建了一个包含9408张增强图像的专用数据集。

摘要翻译

东巴绘画作为中国西南纳西族珍贵的图像遗产，具有层次丰富的视觉元素、鲜明的色彩运用以及突出的民族与地域文化象征意义，然而，由于主流图像描述模型直接应用时存在严重的领域偏移问题，其自动文本描述研究仍处于探索不足的状态。本文提出 PVGF-DPC（基于提示与视觉语义生成融合的东巴绘画描述方法），这是一种编码器-解码器框架，通过整合内容提示模块与新颖的视觉语义生成融合损失，以弥合通用自然图像描述与东巴艺术中文化特定意象之间的差距。该方法采用 MobileNetV2 编码器提取区分性视觉特征，并将其注入到由预训练 BERT 权重初始化的十层 Transformer 解码器的层归一化中；同时，内容提示模块将图像特征向量映射为具有文化感知的标签——例如“神明(deity)”、“仪式图案(ritual pattern)”或“地狱鬼怪(hell ghost)”——并构建后置提示，以引导解码器生成主题准确的描述。视觉语义生成融合损失联合优化了提示预测器与描述生成器的交叉熵目标，促使模型提取关键的文化与视觉线索，并生成与输入图像语义对齐的描述文本。我们构建了一个专门的东巴绘画描述数据集，包含 9,408 张经过数据增强的图像，并提供了涵盖七个主题类别的、基于文化背景的标注。

摘要 (Abstract)

Dongba paintings, the treasured pictorial legacy of the Naxi people in southwestern China, feature richly layered visual elements, vivid color palettes, and pronounced ethnic and regional cultural symbolism, yet their automatic textual description remains largely unexplored owing to severe domain shift when mainstream captioning models are applied directly. This paper proposes \textbf{PVGF-DPC} (\textit{Prompt and Visual Semantic-Generation Fusion-based Dongba Painting Captioning}), an encoder-decoder framework that integrates a content prompt module with a novel visual semantic-generation fusion loss to bridge the gap between generic natural-image captioning and the culturally specific imagery found in Dongba art. A MobileNetV2 encoder extracts discriminative visual features, which are injected into the layer normalization of a 10-layer Transformer decoder initialized with pretrained BERT weights; meanwhile, the content prompt module maps the image feature vector to culture-aware labels – such as \emph{deity}, \emph{ritual pattern}, or \emph{hell ghost} – and constructs a post-prompt that steers the decoder toward thematically accurate descriptions. The visual semantic-generation fusion loss jointly optimizes the cross-entropy objectives of both the prompt predictor and the caption generator, encouraging the model to extract key cultural and visual cues and to produce captions that are semantically aligned with the input image. We construct a dedicated Dongba painting captioning dataset comprising 9{}408 augmented images with culturally grounded annotations spanning seven thematic categories.

关键词: Dongba paintings, caption generation, prompt learning, semantic fusion, Transformer decoder, BERT initialization, cultural specificity, visual semantic-generation fusion loss

219. ❌ FixationFormer: Direct Utilization of Expert Gaze Trajectories for Chest X-Ray Classification

作者: Daniel Beckmann, Benjamin Risse 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22939v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文《FixationFormer》专注于医学图像分析（胸部X光分类），提出了一种基于Transformer的架构，直接利用专家眼动轨迹（gaze trajectories）作为序列输入，通过图像特征与注视标记序列的交叉注意力实现更精细的整合。该研究与大多数关键词（如LLMs、MoE、SFT、RAG、CoT等）完全无关，因为这些关键词涉及大语言模型、训练方法、推理技术或通用AI代理，而本文核心是特定领域的Transformer应用（医学图像分析），未涉及语言模型或相关技术。唯一相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，评分为8.0，因为论文属于AI在科学（医学/生物信息学）领域的应用，但并非核心生物信息学或化学信息学，而是放射学图像分析，因此非满分。其他关键词评分为0.0，因无直接关联。加权总分计算为8.0（仅一个关键词得分）。

!!! tip deepseek-chat TL;DR

该论文研究如何将专家眼动轨迹直接整合到基于Transformer的架构中，以改善胸部X光分类，结果表明FixationFormer方法在三个公开数据集上实现了最先进的分类性能。

摘要翻译

专家眼动轨迹为放射学领域知识提供了一个丰富且被动的来源，为将诊断推理整合到计算机辅助分析中提供了强有力的线索。然而，将其直接整合到历史上主导医学影像分析领域的基于CNN的系统中具有挑战性：注视记录是时序性的，时间上密集但空间上稀疏，存在噪声，并且在不同专家之间存在差异。因此，大多数现有的基于图像的模型使用诸如热图之类的简化表示。相比之下，注视轨迹与变换器（transformer）架构天然契合，因为两者本质上都是序列性的，并且都依赖注意力机制来突出相关的输入区域。在本工作中，我们提出了FixationFormer，一种基于变换器的架构，它将专家注视轨迹表示为令牌（token）序列，从而保留了其时间和空间结构。通过将注视序列与图像特征联合建模，我们的方法解决了注视数据的稀疏性和可变性问题，同时通过图像和注视令牌序列之间显式的交叉注意力，实现了对专家诊断线索更直接、更细粒度的整合。我们在三个公开可用的基准胸部X射线数据集上评估了我们的方法，并证明其达到了最先进的分类性能，这凸显了在基于变换器的医学影像分析中将注视表示为序列的价值。

摘要 (Abstract)

Expert eye movements provide a rich, passive source of domain knowledge in radiology, offering a powerful cue for integrating diagnostic reasoning into computer-aided analysis. However, direct integration into CNN-based systems, which historically have dominated the medical image analysis domain, is challenging: gaze recordings are sequential, temporally dense yet spatially sparse, noisy, and variable across experts. As a consequence, most existing image-based models utilize reduced representations such as heatmaps. In contrast, gaze naturally aligns with transformer architectures, as both are sequential in nature and rely on attention to highlight relevant input regions. In this work, we introduce FixationFormer, a transformer-based architecture that represents expert gaze trajectories as sequences of tokens, thereby preserving their temporal and spatial structure. By modeling gaze sequences jointly with image features, our approach addresses sparsity and variability in gaze data while enabling a more direct and fine-grained integration of expert diagnostic cues through explicit cross-attention between the image and gaze token sequences. We evaluate our method on three publicly available benchmark chest X-ray datasets and demonstrate that it achieves state-of-the-art classification performance, highlighting the value of representing gaze as a sequence in transformer-based medical image analysis.

关键词: FixationFormer, expert gaze trajectories, transformer architecture, chest X-ray classification, medical image analysis, cross-attention, sequence modeling, diagnostic reasoning

220. ❌ When AVSR Meets Video Conferencing: Dataset, Degradation, and the Hidden Mechanism Behind Performance Collapse

作者: Yihuan Huang, Jun Xue, Liu Jiajun, Daixian Li, Tong Zhang, Zhuolin Yi, Yanzhen Ren, Kai Li 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22915v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究音频-视觉语音识别（AVSR）在视频会议中的性能下降问题，创建了MLD-VC数据集并通过微调提升模型鲁棒性。该研究属于计算机视觉和语音处理的交叉领域，与深度学习应用相关，但未涉及大语言模型（LLM）技术。唯一相关的关键词是’Post-training OR Supervised Fine-tuning OR SFT’，因为论文提到对AVSR模型进行微调（fine-tuning），但这不是核心创新点，只是解决方案的一部分，因此给5分（有一定关联）。其他关键词均与大语言模型、推理、对齐、压缩等技术无关，给0分。

!!! tip deepseek-chat TL;DR

该论文首次系统评估了音频-视觉语音识别模型在视频会议中的性能，发现传输失真和人类超表达导致严重性能下降，通过创建MLD-VC数据集并微调模型，平均降低了17.5%的字符错误率。

摘要翻译

视听语音识别（AVSR）在离线条件下已取得显著进展，但其在现实世界视频会议（VC）中的鲁棒性仍很大程度上未被探索。本文首次对主流VC平台上的先进AVSR模型进行了系统性评估，揭示了由传输失真和人类自发超表达行为导致的严重性能下降。为填补这一空白，我们构建了首个专为VC定制的多模态数据集——\textbf{MLD-VC}，该数据集包含31名说话者、22.79小时的视听数据，并显式利用隆巴德效应以增强人类超表达行为。通过综合分析，我们发现语音增强算法是分布偏移的主要来源，它会改变音频的第一和第二共振峰。有趣的是，我们发现隆巴德效应引起的分布偏移与语音增强引入的偏移高度相似，这解释了为何基于隆巴德数据训练的模型在VC中表现出更强的鲁棒性。在MLD-VC上对AVSR模型进行微调可缓解此问题，在多个VC平台上平均实现了17.5%的字错误率（CER）下降。我们的研究结果和数据集为开发更鲁棒、更具泛化性的现实世界视频会议AVSR系统奠定了基础。MLD-VC数据集发布于https://huggingface.co/datasets/nccm2p2/MLD-VC。

摘要 (Abstract)

Audio-Visual Speech Recognition (AVSR) has achieved remarkable progress in offline conditions, yet its robustness in real-world video conferencing (VC) remains largely unexplored. This paper presents the first systematic evaluation of state-of-the-art AVSR models across mainstream VC platforms, revealing severe performance degradation caused by transmission distortions and spontaneous human hyper-expression. To address this gap, we construct \textbf{MLD-VC}, the first multimodal dataset tailored for VC, comprising 31 speakers, 22.79 hours of audio-visual data, and explicit use of the Lombard effect to enhance human hyper-expression. Through comprehensive analysis, we find that speech enhancement algorithms are the primary source of distribution shift, which alters the first and second formants of audio. Interestingly, we find that the distribution shift induced by the Lombard effect closely resembles that introduced by speech enhancement, which explains why models trained on Lombard data exhibit greater robustness in VC. Fine-tuning AVSR models on MLD-VC mitigates this issue, achieving an average 17.5% reduction in CER across several VC platforms. Our findings and dataset provide a foundation for developing more robust and generalizable AVSR systems in real-world video conferencing. MLD-VC is available at https://huggingface.co/datasets/nccm2p2/MLD-VC.

关键词: Audio-Visual Speech Recognition, Video Conferencing, Performance Degradation, MLD-VC Dataset, Lombard Effect, Fine-tuning, Speech Enhancement, Robustness

221. ❌ Dual-Teacher Distillation with Subnetwork Rectification for Black-Box Domain Adaptation

作者: Zhe Zhang, Jing Li, Wanli Xue, Xu Cheng, Jianhua Zhang, Qinghua Hu, Shengyong Chen 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22908v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的黑盒域适应问题，提出了一种双教师蒸馏与子网络校正方法。论文与大多数关键词无关，因为这些关键词主要针对大语言模型（LLMs）及其相关技术（如微调、对齐、推理、压缩等）。唯一相关的关键词是"Pre-training OR Continual Pre-training OR Domain Adaptation”，因为论文的核心是域适应（Domain Adaptation），这是迁移学习的一个子领域，涉及将模型从源域调整到目标域。论文未涉及大模型、科学AI应用或其他特定技术如MoE、SFT、RAG等。

!!! tip deepseek-chat TL;DR

该论文解决了黑盒域适应问题，提出了一种双教师蒸馏与子网络校正方法，通过整合黑盒源模型和视觉语言模型的互补预测来生成可靠的伪标签，并在多个基准数据集上实现了优于现有方法的性能。

摘要翻译

在源域数据与源模型均不可获取的前提下，黑盒域自适应是一种极具实用性又极富挑战性的设定，因为可迁移信息仅限于黑盒源模型的预测结果，且该模型仅能通过目标域样本进行查询。现有方法试图通过伪标签优化或借助外部视觉语言模型（Vision Language Models, ViLs）来提取可迁移知识，但这些方法常受限于噪声监督或未能充分利用ViLs提供的语义先验，最终影响自适应性能。为克服这些局限，我们提出一种结合子网络校正的双教师蒸馏（Dual Teacher Distillation with Subnetwork Rectification, DDSR）模型，该模型协同利用黑盒源模型中嵌入的特定知识与ViL的通用语义信息。DDSR自适应地整合二者的互补预测，为目标域生成可靠的伪标签，并引入子网络驱动的正则化策略以缓解噪声监督导致的过拟合问题。此外，优化后的目标域预测会迭代增强伪标签与ViL提示，从而实现更精确且语义一致的自适应。最终，目标模型通过基于类别原型的自训练进一步优化。在多个基准数据集上的大量实验验证了我们方法的有效性，其性能持续优于现有先进方法，包括那些使用源域数据或模型的方法。

摘要 (Abstract)

Assuming that neither source data nor the source model is accessible, black box domain adaptation represents a highly practical yet extremely challenging setting, as transferable information is restricted to the predictions of the black box source model, which can only be queried using target samples. Existing approaches attempt to extract transferable knowledge through pseudo label refinement or by leveraging external vision language models (ViLs), but they often suffer from noisy supervision or insufficient utilization of the semantic priors provided by ViLs, which ultimately hinder adaptation performance. To overcome these limitations, we propose a dual teacher distillation with subnetwork rectification (DDSR) model that jointly exploits the specific knowledge embedded in black box source models and the general semantic information of a ViL. DDSR adaptively integrates their complementary predictions to generate reliable pseudo labels for the target domain and introduces a subnetwork driven regularization strategy to mitigate overfitting caused by noisy supervision. Furthermore, the refined target predictions iteratively enhance both the pseudo labels and ViL prompts, enabling more accurate and semantically consistent adaptation. Finally, the target model is further optimized through self training with classwise prototypes. Extensive experiments on multiple benchmark datasets validate the effectiveness of our approach, demonstrating consistent improvements over state of the art methods, including those using source data or models.

关键词: black-box domain adaptation, dual teacher distillation, subnetwork rectification, vision language models, pseudo label refinement, self-training, target domain adaptation, transfer learning

222. ❌ SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes

作者: Zhicheng Qiu, Jiarui Meng, Tong-an Luo, Yican Huang, Xuan Feng, Xuanfu Li, ZHan Xu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22893v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SLARM专注于动态场景重建、语义理解和实时流式推理的计算机视觉任务，虽然使用了语言对齐的语义特征（从LSeg蒸馏），但核心内容不涉及大语言模型（LLMs）或深度学习技术原理的创新。所有评分关键词均与大模型技术、训练方法、推理优化、对齐技术、代理系统等直接相关，而本文是纯粹的计算机视觉/3D重建工作，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

SLARM提出了一种统一动态场景重建、语义理解和实时流式推理的前馈模型，通过高阶运动建模和语言对齐表示，在动态估计、渲染质量和场景解析方面实现了最先进的性能。

摘要翻译

我们提出SLARM，一种前馈模型，它统一了动态场景重建、语义理解与实时流式推理。SLARM通过高阶运动建模捕捉复杂非均匀运动，仅基于可微分渲染进行训练，无需任何光流监督。此外，SLARM从LSeg中蒸馏语义特征，以获得语言对齐的表征。这一设计支持通过自然语言进行语义查询，且语义与几何的紧密耦合进一步提升了动态重建的准确性与鲁棒性。同时，SLARM采用基于窗口的因果注意力处理图像序列，实现了稳定、低延迟的流式推理，且无需累积内存开销。在此统一框架内，SLARM在动态估计、渲染质量与场景解析方面均达到最先进水平，相较于现有方法，其运动精度提升21%，重建PSNR提高1.6 dB，分割mIoU提升20%。

摘要 (Abstract)

We propose SLARM, a feed-forward model that unifies dynamic scene reconstruction, semantic understanding, and real-time streaming inference. SLARM captures complex, non-uniform motion through higher-order motion modeling, trained solely on differentiable renderings without any flow supervision. Besides, SLARM distills semantic features from LSeg to obtain language-aligned representations. This design enables semantic querying via natural language, and the tight coupling between semantics and geometry further enhances the accuracy and robustness of dynamic reconstruction. Moreover, SLARM processes image sequences using window-based causal attention, achieving stable, low-latency streaming inference without accumulating memory cost. Within this unified framework, SLARM achieves state-of-the-art results in dynamic estimation, rendering quality, and scene parsing, improving motion accuracy by 21%, reconstruction PSNR by 1.6 dB, and segmentation mIoU by 20% over existing methods.

关键词: dynamic scene reconstruction, semantic understanding, real-time streaming inference, language-aligned representations, higher-order motion modeling, window-based causal attention, feed-forward model, differentiable rendering

223. ❌ Group Editing : Edit Multiple Images in One Go

作者: Yue Ma, Xinyu Wang, Qianli Ma, Qinghe Wang, Mingzhe Zheng, Xiangpeng Yang, Hao Li, Chongbo Zhao, Jixuan Ying, Harry Yang, Hongyu Liu, Qifeng Chen 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22883v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的多图像一致性编辑，提出GroupEditing框架，使用VGGT提取几何对应关系，并利用预训练视频模型的时序一致性先验。虽然涉及深度学习技术，但所有关键词均针对大语言模型（LLM）及其相关技术（如MoE、SFT、RLHF、RAG、CoT、量化等），或特定科学领域AI应用。论文未提及任何语言模型、大模型技术原理、或AI for Science的具体应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文解决了对一组相关图像进行一致和统一修改的挑战，提出了GroupEditing框架，通过结合显式几何对应和隐式时序一致性先验，显著提升了多图像编辑的视觉质量、跨视图一致性和语义对齐。

摘要翻译

本文致力于解决在一组相关图像中进行一致且统一修改的问题。该任务尤其具有挑战性，因为这些图像在姿态、视角和空间布局上可能存在显著差异。实现连贯的编辑需要在图像间建立可靠的对应关系，以便修改能够准确地应用于语义对齐的区域。为此，我们提出了GroupEditing，一个新颖的框架，可在图像组内构建显式和隐式关系。在显式方面，我们使用VGGT提取几何对应关系，它基于视觉特征提供空间对齐。在隐式方面，我们将图像组重新表述为伪视频，并利用预训练视频模型学习到的时间一致性先验来捕捉潜在关系。为了有效融合这两种对应关系，我们通过一种新颖的融合机制，将来自VGGT的显式几何线索注入到视频模型中。为了支持大规模训练，我们构建了GroupEditData，这是一个包含大量图像组的高质量掩码和详细描述的新数据集。此外，为了确保编辑过程中的身份一致性，我们引入了一个对齐增强的RoPE模块，该模块提升了模型在多个图像间保持外观一致性的能力。最后，我们提出了GroupEditBench，一个专门用于评估组级图像编辑效果的基准测试。大量实验表明，GroupEditing在视觉质量、跨视图一致性和语义对齐方面显著优于现有方法。

摘要 (Abstract)

In this paper, we tackle the problem of performing consistent and unified modifications across a set of related images. This task is particularly challenging because these images may vary significantly in pose, viewpoint, and spatial layout. Achieving coherent edits requires establishing reliable correspondences across the images, so that modifications can be applied accurately to semantically aligned regions. To address this, we propose GroupEditing, a novel framework that builds both explicit and implicit relationships among images within a group. On the explicit side, we extract geometric correspondences using VGGT, which provides spatial alignment based on visual features. On the implicit side, we reformulate the image group as a pseudo-video and leverage the temporal coherence priors learned by pre-trained video models to capture latent relationships. To effectively fuse these two types of correspondences, we inject the explicit geometric cues from VGGT into the video model through a novel fusion mechanism. To support large-scale training, we construct GroupEditData, a new dataset containing high-quality masks and detailed captions for numerous image groups. Furthermore, to ensure identity preservation during editing, we introduce an alignment-enhanced RoPE module, which improves the model’s ability to maintain consistent appearance across multiple images. Finally, we present GroupEditBench, a dedicated benchmark designed to evaluate the effectiveness of group-level image editing. Extensive experiments demonstrate that GroupEditing significantly outperforms existing methods in terms of visual quality, cross-view consistency, and semantic alignment.

关键词: Group Editing, Multiple Images, Consistent Modifications, Geometric Correspondences, Temporal Coherence, Video Models, Identity Preservation, Benchmark Evaluation

224. ❌ TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration

作者: Chunxiao Li, Lijun Li, Jing Shao 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22882v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	8.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用LLM作为Orchestrator构建自主红队框架TreeTeaming，因此与’Large Language Models’高度相关（10分）。框架涉及LLM自主决策和策略探索，与’LLM Agents’高度相关（10分）。策略探索采用树状结构，与’Monte Carlo Tree Search AND LLM’相关（8分）。研究涉及安全对齐和策略推理，与’Instruction Tuning/Alignment’（5分）、‘Chain of Thought Reasoning’（5分）、‘System 2 Thinking’（5分）和’Tool Use’（5分）有一定关联。其他关键词如MoE、SLMs、Scaling Laws、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了TreeTeaming框架，使用LLM驱动的Orchestrator自主探索分层策略来红队测试视觉语言模型的安全漏洞，在12个主流VLMs上实现了最先进的攻击成功率，最高达87.60%，并展示了更高的策略多样性和更低的毒性。

摘要翻译

视觉语言模型（Vision-Language Models, VLMs）的快速发展使其安全漏洞问题日益凸显。然而，现有的红队测试方法本质上受限于固有的线性探索范式，只能在预定义的策略集合内进行优化，难以发现新颖且多样化的攻击手段。为突破这一局限，我们提出了TreeTeaming——一种自动化红队测试框架，将策略探索从静态测试重构为动态的演化发现过程。其核心是由大语言模型（Large Language Model, LLM）驱动的策略编排器，该编排器能自主决策是演化有潜力的攻击路径，还是探索多样化的策略分支，从而动态构建并扩展策略树。随后，多模态执行器负责实施这些复杂策略。在对12个主流VLM的实验中，TreeTeaming在11个模型上实现了最优的攻击成功率，超越现有方法，并在GPT-4o上达到87.60%。该框架还展现出比以往公开越狱策略合集更卓越的策略多样性。此外，生成的攻击手段平均毒性降低了23.09%，体现了其隐蔽性与微妙性。本研究为自动化漏洞发现引入了新范式，强调有必要超越静态启发式方法，通过主动探索来保障前沿人工智能模型的安全。

摘要 (Abstract)

The rapid advancement of Vision-Language Models (VLMs) has brought their safety vulnerabilities into sharp focus. However, existing red teaming methods are fundamentally constrained by an inherent linear exploration paradigm, confining them to optimizing within a predefined strategy set and preventing the discovery of novel, diverse exploits. To transcend this limitation, we introduce TreeTeaming, an automated red teaming framework that reframes strategy exploration from static testing to a dynamic, evolutionary discovery process. At its core lies a strategic Orchestrator, powered by a Large Language Model (LLM), which autonomously decides whether to evolve promising attack paths or explore diverse strategic branches, thereby dynamically constructing and expanding a strategy tree. A multimodal actuator is then tasked with executing these complex strategies. In the experiments across 12 prominent VLMs, TreeTeaming achieves state-of-the-art attack success rates on 11 models, outperforming existing methods and reaching up to 87.60% on GPT-4o. The framework also demonstrates superior strategic diversity over the union of previously public jailbreak strategies. Furthermore, the generated attacks exhibit an average toxicity reduction of 23.09%, showcasing their stealth and subtlety. Our work introduces a new paradigm for automated vulnerability discovery, underscoring the necessity of proactive exploration beyond static heuristics to secure frontier AI models.

关键词: Vision-Language Models, Red Teaming, Autonomous Agents, Strategy Exploration, Hierarchical Tree, Safety Vulnerabilities, LLM Orchestrator, Multimodal Actuator

225. ❌ Template-Based Feature Aggregation Network for Industrial Anomaly Detection

作者: Wei Luo, Haiming Yao, Wenyong Yu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22874v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于工业异常检测，提出了一种基于模板的特征聚合网络（TFA-Net）。虽然使用了预训练的卷积神经网络（CNN）提取特征，但论文的核心内容与所有评分关键词（均围绕大语言模型、深度学习技术原理创新、AI for Science等）完全无关。论文未涉及任何大模型技术、语言模型、对齐、推理、代理、压缩、科学AI应用等主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于模板的特征聚合网络（TFA-Net）来解决工业异常检测中特征重建方法的捷径学习问题，通过将输入特征聚合到模板特征上来过滤异常特征，在多个真实工业数据集上实现了最先进的检测性能并满足实时性要求。

摘要翻译

工业异常检测在确保产品质量控制方面发挥着至关重要的作用。因此，提出一种有效的异常检测模型具有重要意义。尽管现有的特征重建方法已展现出优异的性能，但它们面临着捷径学习（shortcut learning）的挑战，这可能导致异常特征被不良地重建。为解决这一问题，我们提出了一种新颖的特征重建模型，称为基于模板的特征聚合网络（Template-based Feature Aggregation Network，简称TFA-Net），通过基于模板的特征聚合进行异常检测。具体而言，TFA-Net首先从一个预训练的卷积神经网络中为固定模板图像和输入图像提取多层级特征。TFA-Net并非直接重建输入特征，而是将其聚合到模板特征上，从而有效过滤掉与正常模板特征相似度低的异常特征。接着，TFA-Net利用已在输入特征中融合了正常特征的模板特征来细化特征细节，并获得重建后的特征图。最后，通过比较输入特征与重建特征之间的差异，可以定位缺陷区域。此外，模型采用了针对输入特征的随机掩码策略，以增强整体检测性能。我们提出的基于模板的特征聚合方案构建了一个非平凡且富有意义的特征重建任务。这种简洁而高效的TFA-Net在多个真实工业数据集上展现了最先进的检测性能。同时，它满足了工业场景的实时性需求，使其非常适合于实际工业应用。代码可在https://github.com/luow23/TFA-Net获取。

摘要 (Abstract)

Industrial anomaly detection plays a crucial role in ensuring product quality control. Therefore, proposing an effective anomaly detection model is of great significance. While existing feature-reconstruction methods have demonstrated excellent performance, they face challenges with shortcut learning, which can lead to undesirable reconstruction of anomalous features. To address this concern, we present a novel feature-reconstruction model called the \textbf{T}emplate-based \textbf{F}eature \textbf{A}ggregation \textbf{Net}work (TFA-Net) for anomaly detection via template-based feature aggregation. Specifically, TFA-Net first extracts multiple hierarchical features from a pre-trained convolutional neural network for a fixed template image and an input image. Instead of directly reconstructing input features, TFA-Net aggregates them onto the template features, effectively filtering out anomalous features that exhibit low similarity to normal template features. Next, TFA-Net utilizes the template features that have already fused normal features in the input features to refine feature details and obtain the reconstructed feature map. Finally, the defective regions can be located by comparing the differences between the input and reconstructed features. Additionally, a random masking strategy for input features is employed to enhance the overall inspection performance of the model. Our template-based feature aggregation schema yields a nontrivial and meaningful feature reconstruction task. The simple, yet efficient, TFA-Net exhibits state-of-the-art detection performance on various real-world industrial datasets. Additionally, it fulfills the real-time demands of industrial scenarios, rendering it highly suitable for practical applications in the industry. Code is available at https://github.com/luow23/TFA-Net.

关键词: industrial anomaly detection, feature reconstruction, template-based feature aggregation, shortcut learning, pre-trained CNN, real-time inspection, state-of-the-art performance, TFA-Net

作者: Hyojin Park, Yi Li, Janghoon Cho, Sungha Choi, Jungsoo Lee, Taotao Jing, Shuai Zhang, Munawar Hayat, Dashan Gao, Ning Bi, Fatih Porikli 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22872v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究视频监控中的AI法证搜索系统，核心创新在于多模态查询（图像+文本）的视频问答和事件定位。与关键词的相关性分析：1）高度相关（10分）：‘Retrieval-Augmented Generation (RAG)’，因为论文明确提出了VideoRAG系统，并改进了现有VideoRAG方法；2）中度相关（8分）：‘Large Language Models (LLMs)’，论文使用了Video Large Language Model (VideoLLM)进行问答；3）轻度相关（5分）：‘AI for Science’，视频监控可视为AI在安防领域的应用，属于广义的科学应用；4）其他关键词（0分）：论文未涉及MoE、量化、对齐、推理优化等其他大模型技术细节。

!!! tip deepseek-chat TL;DR

该论文针对视频监控中多模态查询的检索难题，提出了ForeSeaQA基准和ForeSea系统，通过三阶段管道显著提升了视频问答的准确性和时间定位精度。

摘要翻译

尽管经过数十年的研究，在跨多摄像头的长时视频中定位特定目标仍是监控领域的难题。现有方法——包括追踪流程、基于CLIP的模型以及视频检索增强生成（VideoRAG）——需要大量人工筛选、仅能捕捉浅层属性，且缺乏时序推理能力。现实世界的搜索本质上是多模态的（例如，结合人物图像提问“此人何时参与斗殴？”），但这一场景仍未得到充分探索。此外，目前缺乏合适的基准来评估这种基于多模态查询的视频问答系统。为填补这一空白，我们提出了ForeSeaQA——一个专为图像-文本联合查询的视频问答任务设计的新基准，其标注包含关键事件的时间戳。该数据集由长时监控视频片段与多样化的多模态问题配对构成，能够在真实取证场景下系统评估检索、时序定位及多模态推理能力。不限于此基准，我们还提出了ForeSea，一个包含三阶段即插即用流程的AI取证搜索系统：（1）追踪模块过滤无关视频片段；（2）多模态嵌入模块对剩余片段建立索引；（3）在推理阶段，系统检索出前K个候选片段，交由视频大语言模型（VideoLLM）进行查询回答与事件定位。在ForeSeaQA上，ForeSea相较于先前的VideoRAG模型将准确率提升了3.5%，时序交并比（IoU）提高了11.0%。据我们所知，ForeSeaQA是首个支持精确时序定位的复杂多模态查询基准，而ForeSea是首个专为此场景优化构建的VideoRAG系统。

摘要 (Abstract)

Despite decades of work, surveillance still struggles to find specific targets across long, multi-camera video. Prior methods – tracking pipelines, CLIP based models, and VideoRAG – require heavy manual filtering, capture only shallow attributes, and fail at temporal reasoning. Real-world searches are inherently multimodal (e.g., “When does this person join the fight?” with the person’s image), yet this setting remains underexplored. Also, there are no proper benchmarks to evaluate those setting - asking video with multimodal queries. To address this gap, we introduce ForeSeaQA, a new benchmark specifically designed for video QA with image-and-text queries and timestamped annotations of key events. The dataset consists of long-horizon surveillance footage paired with diverse multimodal questions, enabling systematic evaluation of retrieval, temporal grounding, and multimodal reasoning in realistic forensic conditions. Not limited to this benchmark, we propose ForeSea, an AI forensic search system with a 3-stage, plug-and-play pipeline. (1) A tracking module filters irrelevant footage; (2) a multimodal embedding module indexes the remaining clips; and (3) during inference, the system retrieves top-K candidate clips for a Video Large Language Model (VideoLLM) to answer queries and localize events. On ForeSeaQA, ForeSea improves accuracy by 3.5% and temporal IoU by 11.0 over prior VideoRAG models. To our knowledge, ForeSeaQA is the first benchmark to support complex multimodal queries with precise temporal grounding, and ForeSea is the first VideoRAG system built to excel in this setting.

关键词: Video Surveillance, Multimodal Queries, Forensic Search, Video Large Language Model, Retrieval-Augmented Generation, Temporal Grounding, Benchmark Dataset, Video Question Answering

227. ❌ Designing to Forget: Deep Semi-parametric Models for Unlearning

作者: Amber Yijia Zheng, Yu-Shan Tai, Raymond A. Yeh 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22870v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究机器遗忘（machine unlearning）问题，提出深度半参数模型（SPMs）实现高效遗忘，属于机器学习模型训练后修改的特定领域。所有评分关键词均聚焦于大模型（LLMs）及相关技术（如MoE、量化、推理加速、对齐等），或大模型在科学领域的应用。本文不涉及大模型技术，也未应用于科学领域，与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种深度半参数模型（SPMs），通过融合模块实现训练样本的显式删除，在图像分类和生成任务中达到与参数模型相当的性能，同时显著提高了机器遗忘的效率和效果。

摘要翻译

近期机器学习遗忘领域的研究进展主要集中于开发从已训练模型中移除特定训练样本的算法。与此相对，我们观察到并非所有模型都同样易于实施遗忘。为此，我们提出了一类深度半参数模型，该类模型在遗忘过程中表现出非参数特性。SPMs采用融合模块聚合每个训练样本的信息，从而能够在测试阶段显式删除选定样本，而无需修改模型参数。实验表明，在图像分类与生成任务中，SPMs的任务性能与参数模型相当，同时在遗忘效率上显著更优。值得注意的是，在ImageNet分类任务中，SPMs将相对于重新训练基准模型的预测差距降低了11%，且相比现有参数模型遗忘方法实现了超过10倍的加速。代码已发布于https://github.com/amberyzheng/spm_unlearning。

摘要 (Abstract)

Recent advances in machine unlearning have focused on developing algorithms to remove specific training samples from a trained model. In contrast, we observe that not all models are equally easy to unlearn. Hence, we introduce a family of deep semi-parametric models (SPMs) that exhibit non-parametric behavior during unlearning. SPMs use a fusion module that aggregates information from each training sample, enabling explicit test-time deletion of selected samples without altering model parameters. Empirically, we demonstrate that SPMs achieve competitive task performance to parametric models in image classification and generation, while being significantly more efficient for unlearning. Notably, on ImageNet classification, SPMs reduce the prediction gap relative to a retrained (oracle) baseline by $11%$ and achieve over $10\times$ faster unlearning compared to existing approaches on parametric models. The code is available at https://github.com/amberyzheng/spm_unlearning.

关键词: machine unlearning, deep semi-parametric models, non-parametric behavior, fusion module, explicit deletion, image classification, efficient unlearning, prediction gap reduction

228. ❌ A Feature Shuffling and Restoration Strategy for Universal Unsupervised Anomaly Detection

作者: Wei Luo, Haiming Yao, Zhenfeng Qiang, Xiaotian Zhang, Weihang Zhang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22861v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的无监督异常检测，提出了一种基于特征重排和恢复的通用框架（FSR），用于解决重建方法中的相同捷径问题。论文内容涉及图像处理、特征提取和重建网络，但完全不涉及大语言模型（LLM）、深度学习技术原理创新或大模型在不同领域的应用。所有评分关键词均与大模型、深度学习技术或AI科学应用相关，而该论文的研究领域（计算机视觉异常检测）与这些关键词无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为FSR（特征重排和恢复）的通用无监督异常检测框架，通过使用多尺度特征作为重建目标并进行特征块重排恢复，有效缓解了不同场景下的相同捷径问题，实现了跨场景的优异检测性能。

摘要翻译

无监督异常检测在工业领域至关重要，其中基于重构的方法因其简洁性和有效性而备受青睐。然而，重构方法常遭遇“同一性捷径”问题，即正常区域和异常区域均可能被良好重构，导致无法识别异常值。该问题的严重性随正常数据分布复杂度的增加而加剧。因此，现有方法可能在特定场景下表现出优异的检测性能，但在迁移至另一场景时性能急剧下降。本文致力于建立一个适用于不同场景下异常检测任务的通用模型，即通用异常检测。本工作中，我们提出了一种新颖、简洁而高效的通用异常检测框架：\uline{特}征\uline{打}乱与\uline{重}构（FSR），该框架能够缓解不同场景下的同一性捷径问题。首先，FSR采用具有丰富语义信息的多尺度特征作为重构目标，而非原始图像像素。随后，这些多尺度特征被划分为非重叠的特征块，经过随机打乱后，通过一个重构网络将其恢复至原始状态。这一简单范式促使模型更关注全局上下文信息。此外，我们引入了一个新概念——打乱率，以调节FSR任务的复杂度，从而缓解不同场景下的同一性捷径问题。进一步，我们从网络结构和互信息两个视角为FSR框架的有效性提供了理论解释。大量实验结果验证了FSR框架在不同场景下的优越性和高效性。代码发布于https://github.com/luow23/FSR。

摘要 (Abstract)

Unsupervised anomaly detection is vital in industrial fields, with reconstruction-based methods favored for their simplicity and effectiveness. However, reconstruction methods often encounter an identical shortcut issue, where both normal and anomalous regions can be well reconstructed and fail to identify outliers. The severity of this problem increases with the complexity of the normal data distribution. Consequently, existing methods may exhibit excellent detection performance in a specific scenario, but their performance sharply declines when transferred to another scenario. This paper focuses on establishing a universal model applicable to anomaly detection tasks across different settings, termed as universal anomaly detection. In this work, we introduce a novel, straightforward yet efficient framework for universal anomaly detection: \uline{F}eature \uline{S}huffling and \uline{R}estoration (FSR), which can alleviate the identical shortcut issue across different settings. First and foremost, FSR employs multi-scale features with rich semantic information as reconstruction targets, rather than raw image pixels. Subsequently, these multi-scale features are partitioned into non-overlapping feature blocks, which are randomly shuffled and then restored to their original state using a restoration network. This simple paradigm encourages the model to focus more on global contextual information. Additionally, we introduce a novel concept, the shuffling rate, to regulate the complexity of the FSR task, thereby alleviating the identical shortcut across different settings. Furthermore, we provide theoretical explanations for the effectiveness of FSR framework from two perspectives: network structure and mutual information. Extensive experimental results validate the superiority and efficiency of the FSR framework across different settings.Code is available at https://github.com/luow23/FSR.

关键词: unsupervised anomaly detection, feature shuffling, feature restoration, reconstruction-based methods, identical shortcut issue, universal anomaly detection, multi-scale features, restoration network

作者: Chengxin Lv, Yihui Li, Hongyu Yang, YunHong Wang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22852v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于自动驾驶中的3D语义占用预测，提出了一种基于3D高斯的多模态框架。虽然涉及计算机视觉、深度学习和多模态融合技术，但所有关键词均与大语言模型（LLMs）及其相关技术（如MoE、RLHF、RAG、量化、推理加速等）或AI for Science（生物信息学、化学信息学）直接相关。论文内容完全不涉及语言模型、大模型技术原理或科学领域的AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Gau-Occ的多模态框架，通过将场景建模为紧凑的语义3D高斯集合来解决自动驾驶中3D语义占用预测的计算效率问题，实现了最先进的性能并显著提升了计算效率。

摘要翻译

三维语义占据预测对自动驾驶至关重要。尽管多模态融合相比纯视觉方法提升了准确性，但其通常依赖于计算密集的密集体素或鸟瞰图张量。我们提出Gau-Occ，一种多模态框架，通过将场景建模为紧凑的语义三维高斯集合，绕过了密集体素处理。为确保几何完整性，我们提出激光雷达补全扩散器，该模块从稀疏激光雷达点云中恢复缺失结构，以初始化鲁棒的高斯锚点。此外，我们引入高斯锚点融合，该方法通过几何对齐的二维采样与跨模态对齐，高效整合多视角图像语义。通过优化这些紧凑的高斯描述符，Gau-Occ同时捕捉了空间一致性与语义区分性。在多个挑战性基准测试上的大量实验表明，Gau-Occ以显著的计算效率实现了最先进的性能。

摘要 (Abstract)

3D semantic occupancy prediction is crucial for autonomous driving. While multi-modal fusion improves accuracy over vision-only methods, it typically relies on computationally expensive dense voxel or BEV tensors. We present Gau-Occ, a multi-modal framework that bypasses dense volumetric processing by modeling the scene as a compact collection of semantic 3D Gaussians. To ensure geometric completeness, we propose a LiDAR Completion Diffuser (LCD) that recovers missing structures from sparse LiDAR to initialize robust Gaussian anchors. Furthermore, we introduce Gaussian Anchor Fusion (GAF), which efficiently integrates multi-view image semantics via geometry-aligned 2D sampling and cross-modal alignment. By refining these compact Gaussian descriptors, Gau-Occ captures both spatial consistency and semantic discriminability. Extensive experiments across challenging benchmarks demonstrate that Gau-Occ achieves state-of-the-art performance with significant computational efficiency.

关键词: 3D semantic occupancy prediction, autonomous driving, multi-modal fusion, 3D Gaussians, LiDAR completion, computational efficiency, state-of-the-art performance

230. ❌ Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

作者: Yunheng Li, Hangyi Kuang, Hengrui Zhang, Jiangxia Cao, Zhaojie Liu, Qibin Hou, Ming-Ming Cheng 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22847v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于多模态思维链推理的优化方法，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（10分），因为这是论文的核心研究对象。与’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’有一定关联（8分），因为论文涉及深度推理过程的分析和优化。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分），因为论文研究的多模态推理通常基于大型视觉语言模型。其他关键词如MoE、量化、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有多模态思维链推理方法在优化粒度上的不足，提出了一种基于感知先验和探索策略的令牌级优化方法PEPO，在多个基准测试中实现了优于现有强化学习基线的性能提升。

摘要翻译

多模态思维链推理要求大型视觉语言模型构建感知基础与多步推理交替进行的推理轨迹。然而，现有的可验证奖励强化学习方法通常在粗粒度上优化推理，将思维链视为统一整体而未区分其不同程度的视觉基础。本研究对多模态推理轨迹进行了词元级分析，结果表明成功的推理具有结构化词元动态特征，这种动态同时反映了感知基础与探索性推理。基于此分析，我们提出感知-探索策略优化方法，该方法通过隐藏状态相似性推导感知先验，并通过平滑门控机制将其与词元熵整合，从而生成词元级优势值。该方法可与现有可验证奖励强化学习框架（如GRPO和DAPO）无缝集成，既不需要额外监督信号，也无需辅助分支。在涵盖几何推理、视觉定位、视觉谜题求解和少样本分类的多样化多模态基准测试中，大量实验表明该方法相较于强强化学习基线模型取得了持续且稳健的性能提升，同时保持了稳定的训练动态。代码：https://github.com/xzxxntxdy/PEPO

摘要 (Abstract)

Multimodal Chain-of-Thought (CoT) reasoning requires large vision-language models to construct reasoning trajectories that interleave perceptual grounding with multi-step inference. However, existing Reinforcement Learning with Verifiable Rewards (RLVR) methods typically optimize reasoning at a coarse granularity, treating CoT uniformly without distinguishing their varying degrees of visual grounding. In this work, we conduct a token-level analysis of multimodal reasoning trajectories and show that successful reasoning is characterized by structured token dynamics reflecting both perceptual grounding and exploratory inference. Building upon this analysis, we propose Perception-Exploration Policy Optimization (PEPO), which derives a perception prior from hidden state similarity and integrates it with token entropy through a smooth gating mechanism to produce token-level advantages. PEPO integrates seamlessly with existing RLVR frameworks such as GRPO and DAPO, requiring neither additional supervision nor auxiliary branches. Extensive experiments across diverse multimodal benchmarks demonstrate consistent and robust improvements over strong RL baselines, spanning geometry reasoning, visual grounding, visual puzzle solving, and few-shot classification, while maintaining stable training dynamics. Code: https://github.com/xzxxntxdy/PEPO

关键词: Multimodal Chain-of-Thought, Token-level Analysis, Perception-Exploration Policy Optimization, Reinforcement Learning with Verifiable Rewards, Visual Grounding, Multi-step Inference, PEPO, Reasoning Trajectories

231. ❌ L-UNet: An LSTM Network for Remote Sensing Image Change Detection

作者: Shuting Sun, Lin Mu, Lizhe Wang, Peng Liu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22842v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于遥感图像变化检测，提出了一种基于LSTM和UNet的深度学习网络（L-UNet和AL-UNet），属于计算机视觉和遥感应用领域。所有关键词均与大模型（LLMs）技术、训练方法、推理优化、对齐、代理系统等直接相关，而本文未涉及任何大模型或相关技术，仅使用传统的深度学习架构（LSTM、CNN）解决特定领域任务。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为遥感属于地球科学应用，可视为AI在科学领域的应用，但论文未明确提及生物信息学或化学信息学，且创新点在于网络结构而非大模型技术，因此给予5分（有一定关联）。其他关键词完全无关，均为0分。

!!! tip deepseek-chat TL;DR

该论文针对高分辨率遥感图像变化检测任务，提出了一种结合LSTM和UNet的端到端时空网络L-UNet及其改进版本AL-UNet，实验表明该方法在定量和定性评估上优于其他对比方法。

摘要翻译

高分辨率遥感影像变化检测是地球观测领域的一项重要任务，并已得到广泛研究。近年来，深度学习在众多遥感任务中展现出显著成效。当前基于深度学习的变化检测方法主要依赖于传统长短期记忆网络（Conv-LSTM），但其缺乏空间特征。由于变化检测是一个兼具空间性与时序性的过程，有必要提出一种端到端的时空网络。为此，本文引入了Conv-LSTM结构的扩展形式。鉴于其与卷积层具有相似的空间特性，我们提出了L-UNet——将UNet部分卷积层替换为Conv-LSTM，并进一步提出空洞L-UNet（AL-UNet），该模型利用空洞结构以提取多尺度空间信息。在两个数据集上进行的实验表明，与其他方法相比，所提方法在定量评估与视觉质量上均展现出优势。

摘要 (Abstract)

Change detection of high-resolution remote sensing images is an important task in earth observation and was extensively investigated. Recently, deep learning has shown to be very successful in plenty of remote sensing tasks. The current deep learning-based change detection method is mainly based on conventional long short-term memory (Conv-LSTM), which does not have spatial characteristics. Since change detection is a process with both spatiality and temporality, it is necessary to propose an end-to-end spatiotemporal network. To achieve this, Conv-LSTM, an extension of the Conv-LSTM structure, is introduced. Since it shares similar spatial characteristics with the convolutional layer, L-UNet, which substitutes partial convolution layers of UNet-to-Conv-LSTM and Atrous L-UNet (AL-UNet), which further using Atrous structure to multiscale spatial information is proposed. Experiments on two data sets are conducted and the proposed methods show the advantages both in quantity and quality when compared with some other methods.

关键词: change detection, remote sensing images, LSTM, UNet, spatiotemporal network, deep learning, Conv-LSTM, Atrous structure

232. ❌ MultiCam: On-the-fly Multi-Camera Pose Estimation Using Spatiotemporal Overlaps of Known Objects

作者: Shiyu Li, Hannah Schieber, Kristoffer Waldow, Benjamin Busam, Julian Kreimeier, Daniel Roth 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22839v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉和增强现实领域，研究多摄像头动态姿态估计，使用已知物体的时空视场重叠进行无标记跟踪。论文内容完全不涉及大语言模型、深度学习技术原理、AI for Science或任何评分关键词中的技术概念。所有关键词均与大模型、深度学习、AI科学应用等相关，而本文是纯粹的计算机视觉/AR系统研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用已知物体的时空视场重叠进行动态多摄像头姿态估计的无标记方法，在YCB-V和T-LESS数据集上超越了现有方法，并发布了新的多摄像头多物体姿态估计数据集。

摘要翻译

多相机动态增强现实（AR）应用需要通过相机姿态估计，在一个统一系统中利用每个相机的独立信息。这可以通过整合多视角下的上下文信息（如标记或物体）来实现。通常，相机会在初始步骤进行标定，或通过持续使用标记进行更新，另一种方案则是利用场景中已有的已知物体信息。基于标记追踪的另一缺点是，标记必须始终处于相机视野（FoV）范围内。
为克服这些限制，我们提出了一种基于已知物体时空视野重叠的持续动态相机姿态估计方法。为此，我们改进了当前最优的物体姿态估计器，以更新我们的时空场景图，从而建立即使非重叠视野相机之间的关系。为评估本方法，我们引入了一个包含静态与动态相机、具有时序视野重叠的多相机多物体姿态估计数据集。此外，在视野重叠场景中，我们在广泛使用的YCB-V和T-LESS数据集上的相机姿态精度超越了现有最优方法。我们在既有数据集及新提出数据集上的表现，验证了本无标记方法在AR应用中的有效性。
代码与数据集发布于https://github.com/roth-hex-lab/IEEE-VR-2026-MultiCam。

摘要 (Abstract)

Multi-camera dynamic Augmented Reality (AR) applications require a camera pose estimation to leverage individual information from each camera in one common system. This can be achieved by combining contextual information, such as markers or objects, across multiple views. While commonly cameras are calibrated in an initial step or updated through the constant use of markers, another option is to leverage information already present in the scene, like known objects. Another downside of marker-based tracking is that markers have to be tracked inside the field-of-view (FoV) of the cameras. To overcome these limitations, we propose a constant dynamic camera pose estimation leveraging spatiotemporal FoV overlaps of known objects on the fly. To achieve that, we enhance the state-of-the-art object pose estimator to update our spatiotemporal scene graph, enabling a relation even among non-overlapping FoV cameras. To evaluate our approach, we introduce a multi-camera, multi-object pose estimation dataset with temporal FoV overlap, including static and dynamic cameras. Furthermore, in FoV overlapping scenarios, we outperform the state-of-the-art on the widely used YCB-V and T-LESS dataset in camera pose accuracy. Our performance on both previous and our proposed datasets validates the effectiveness of our marker-less approach for AR applications. The code and dataset are available on https://github.com/roth-hex-lab/IEEE-VR-2026-MultiCam.

关键词: multi-camera pose estimation, spatiotemporal FoV overlaps, marker-less tracking, object pose estimator, scene graph, augmented reality, dynamic cameras, dataset

233. ❌ MVRD-Bench: Multi-View Learning and Benchmarking for Dynamic Remote Photoplethysmography under Occlusion

作者: Zuxian He, Xu Cheng, Zhaodong Sun, Haoyu Chen, Jingang Shi, Xiaobai Li, Guoying Zhao 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22826v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算机视觉和生物医学信号处理领域，研究远程光电容积描记（rPPG）技术，通过多视角学习解决面部运动和遮挡问题。论文内容涉及深度学习在生物医学信号分析中的应用，但与所有大模型（LLM）相关的技术关键词（如LLMs、MoE、Scaling Laws、RLHF、RAG、CoT等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为rPPG属于生物医学信号处理，是AI在科学（特别是生物信息学相关）领域的应用，但论文未涉及大模型或深度学习技术原理的创新，仅使用传统深度学习框架，因此给予5分（有一定关联）。其他关键词均得0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文针对面部运动和遮挡场景下的远程光电容积描记（rPPG）测量问题，提出了一个多视角数据集MVRD和一个融合自适应运动补偿、双流网络和多视角注意力机制的学习框架MVRD-rPPG，在实验中实现了高精度的心率估计（MAE 0.90，Pearson R 0.99）。

摘要翻译

远程光电容积描记术（rPPG）是一种非接触式技术，通过分析面部视频中细微的肤色变化来估计生理信号。现有的rPPG方法通常依赖于静态单视角面部视频，因此在面部运动和遮挡场景下常出现性能下降。为此，本研究致力于解决无约束多视角面部视频中rPPG测量所面临的运动诱发遮挡问题。具体而言，我们引入了多视角rPPG数据集（MVRD），这是一个高质量基准数据集，包含静态、说话和头部运动三种场景下从三个视角同步采集的面部视频，以更好地匹配真实环境。我们还提出了MVRD-rPPG，一个统一的多视角rPPG学习框架，通过融合互补的视觉线索来维持稳健的面部皮肤覆盖，尤其在运动条件下。我们的方法集成了自适应时序光学补偿（ATOC）模块以抑制运动伪影，采用节律-视觉双流网络来解耦节律特征与外观相关特征，并利用多视角关联感知注意力（MVCA）机制进行自适应的视角间信号聚合。此外，我们引入了关联频率对抗（CFA）学习策略，该策略在预测信号中同时强化时序准确性、频谱一致性和感知真实性。在MVRD数据集上进行的大量实验与消融研究证明了我们方法的优越性。在MVRD运动场景中，MVRD-rPPG取得了0.90的平均绝对误差（MAE）和0.99的皮尔逊相关系数（R）。源代码与数据集将公开提供。

摘要 (Abstract)

Remote photoplethysmography (rPPG) is a non-contact technique that estimates physiological signals by analyzing subtle skin color changes in facial videos. Existing rPPG methods often encounter performance degradation under facial motion and occlusion scenarios due to their reliance on static and single-view facial videos. Thus, this work focuses on tackling the motion-induced occlusion problem for rPPG measurement in unconstrained multi-view facial videos. Specifically, we introduce a Multi-View rPPG Dataset (MVRD), a high-quality benchmark dataset featuring synchronized facial videos from three viewpoints under stationary, speaking, and head movement scenarios to better match real-world conditions. We also propose MVRD-rPPG, a unified multi-view rPPG learning framework that fuses complementary visual cues to maintain robust facial skin coverage, especially under motion conditions. Our method integrates an Adaptive Temporal Optical Compensation (ATOC) module for motion artifact suppression, a Rhythm-Visual Dual-Stream Network to disentangle rhythmic and appearance-related features, and a Multi-View Correlation-Aware Attention (MVCA) for adaptive view-wise signal aggregation. Furthermore, we introduce a Correlation Frequency Adversarial (CFA) learning strategy, which jointly enforces temporal accuracy, spectral consistency, and perceptual realism in the predicted signals. Extensive experiments and ablation studies on the MVRD dataset demonstrate the superiority of our approach. In the MVRD movement scenario, MVRD-rPPG achieves an MAE of 0.90 and a Pearson correlation coefficient (R) of 0.99. The source code and dataset will be made available.

关键词: remote photoplethysmography, multi-view learning, facial occlusion, motion artifact suppression, physiological signal estimation, benchmark dataset, deep learning, biomedical signal processing

作者: Zhiceng Shi, Changmiao Wang, Jun Wan, Wenwen Min 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22821v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于生物信息学领域，提出了一种基于多模态异构图对比学习的空间基因表达推断方法SpaHGC。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、智能体等）完全无关，因为这些关键词主要针对自然语言处理领域的大语言模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，该论文属于生物信息学应用，因此给予10分（高度相关）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SpaHGC的多模态异构图对比学习模型，用于从病理图像中预测空间转录组学数据，解决了现有方法难以捕捉复杂跨切片空间关系的问题，并在多个数据集上显著超越了现有方法。

摘要翻译

空间转录组学（ST）虽已深化了我们对组织环境中基因表达的理解，但其高昂的实验成本限制了其大规模应用。基于病理图像预测空间转录组数据是一种具有前景且经济高效的替代方案，但现有方法难以捕捉复杂的跨切片空间关系。为应对这一挑战，我们提出了SpaHGC，一种基于多模态异质图（multi-modal heterogeneous graph）的模型，能够从组织学图像中捕获切片内和切片间的点-点关系。该模型整合了目标切片内的局部空间上下文，以及通过病理学基础模型提取的图像嵌入（image embeddings）计算得出的跨切片相似性。这些嵌入实现了切片间的知识迁移，SpaHGC进一步结合了掩码图对比学习（Masked Graph Contrastive Learning），以增强特征表示，并将空间基因表达知识从参考切片迁移至目标切片，从而使其能够建模复杂的空间依赖性并显著提升预测准确性。我们在来自不同平台、组织和癌症亚型的七个匹配的组织学-ST数据集上进行了全面的基准测试。结果表明，在所有评估指标上，SpaHGC均显著优于现有的九种最先进方法。此外，其预测结果在多个癌症相关通路中显著富集，从而凸显了其强大的生物学相关性和应用潜力。

摘要 (Abstract)

While spatial transcriptomics (ST) has advanced our understanding of gene expression in tissue context, its high experimental cost limits its large-scale application. Predicting ST from pathology images is a promising, cost-effective alternative, but existing methods struggle to capture complex cross-slide spatial relationships. To address the challenge, we propose SpaHGC, a multi-modal heterogeneous graph-based model that captures both intra-slice and inter-slice spot-spot relationships from histology images. It integrates local spatial context within the target slide and cross-slide similarities computed from image embeddings extracted by a pathology foundation model. These embeddings enable inter-slice knowledge transfer, and SpaHGC further incorporates Masked Graph Contrastive Learning to enhance feature representation and transfer spatial gene expression knowledge from reference to target slides, enabling it to model complex spatial dependencies and significantly improve prediction accuracy. We conducted comprehensive benchmarking on seven matched histology-ST datasets from different platforms, tissues, and cancer subtypes. The results demonstrate that SpaHGC significantly outperforms the existing nine state-of-the-art methods across all evaluation metrics. Additionally, the predictions are significantly enriched in multiple cancer-related pathways, thereby highlighting its strong biological relevance and application potential.

关键词: spatial transcriptomics, histology images, heterogeneous graph, contrastive learning, knowledge transfer, gene expression prediction, bioinformatics, AI for science

235. ❌ Estimating Flow Velocity and Vehicle Angle-of-Attack from Non-invasive Piezoelectric Structural Measurements Using Deep Learning

作者: Chandler B. Smith, S. Hales Swift, Andrew Steyer, Ihab El-Kady 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23496v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用卷积神经网络（CNN）从结构振动测量中估计空气动力学状态变量（流速和攻角），属于深度学习在科学工程领域的应用。论文未涉及任何大语言模型（LLM）、大模型技术原理、训练方法、推理优化、智能体系统或模型压缩等主题。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文展示了深度学习在空气动力学实验（科学领域）中的应用，但并非核心匹配（如生物信息学或化学信息学），因此给予5分（有一定关联）。其他所有关键词均与论文内容完全无关，评分为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于卷积神经网络（CNN）的非侵入式方法，利用安装在航空器外壳上的压电传感器测量的结构振动数据，来估计飞行器的流速和攻角，在受控风洞实验中实现了低于2.27 m/s的流速误差和0.44°的攻角误差。

摘要翻译

准确估算自由流速度与攻角（Angle of Attack, AoA）等气动状态变量，对于气动载荷预测、飞行控制及模型验证至关重要。本研究提出一种非侵入式方法，通过结构振动测量而非皮托管等直接流动测量设备来估算飞行器速度与攻角。在飞行器外壳内壁安装密集排列的压电传感器阵列，用于捕获湍流边界层压力波动引发的结构振动；通过训练卷积神经网络（Convolutional Neural Network, CNN），将这些结构响应反演为速度与攻角信息。
该方法的概念验证在桑迪亚高超声速风洞中完成，实验覆盖零攻角与非零攻角构型、马赫数约5与约8的工况，以及恒定状态与连续变化的风洞运行模式。利用16次风洞实验数据对CNN进行训练与评估，每次实验数据中保留时间居中的一段作为独立测试集，以此构建训练集、验证集与测试集，并评估模型在单次实验内的时间泛化能力。原始CNN预测在连续变化工况下方差较大；通过短窗口移动中位数后处理步骤可抑制方差并提升鲁棒性。经后处理后，该方法在同一实验项目的独立测试数据上，相对于低通滤波参考速度的平均速度误差低于2.27 m/s（0.21%），平均攻角误差为0.44°（8.25%），证明了在受控实验室环境下基于振动实现速度与攻角估算的可行性。

摘要 (Abstract)

Accurate estimation of aerodynamic state variables such as freestream velocity and angle of attack (AoA) is important for aerodynamic load prediction, flight control, and model validation. This work presents a non-intrusive method for estimating vehicle velocity and AoA from structural vibration measurements rather than direct flow instrumentation such as pitot tubes. A dense array of piezoelectric sensors mounted on the interior skin of an aeroshell capture vibrations induced by turbulent boundary layer pressure fluctuations, and a convolutional neural network (CNN) is trained to invert these structural responses to recover velocity and AoA. Proof-of-concept is demonstrated through controlled experiments in Sandia’s hypersonic wind tunnel spanning zero and nonzero AoA configurations, Mach~~5 and Mach~~8 conditions, and both constant and continuously varying tunnel operations. The CNN is trained and evaluated using data from 16 wind tunnel runs, with a temporally centered held-out interval within each run used to form training, validation, and test datasets and assess intra-run temporal generalization. Raw CNN predictions exhibit increased variance during continuously varying conditions; a short-window moving-median post-processing step suppresses this variance and improves robustness. After post-processing, the method achieves a mean velocity error relative to the low-pass filtered reference velocity below 2.27~m/s (0.21%) and a mean AoA error of $0.44^{\circ} (8.25%)$ on held-out test data from the same experimental campaign, demonstrating feasibility of vibration-based velocity and AoA estimation in a controlled laboratory environment.

关键词: aerodynamic state estimation, convolutional neural network, piezoelectric sensors, structural vibration measurements, hypersonic wind tunnel, velocity estimation, angle of attack, non-intrusive method

236. ❌ Byzantine-Robust and Differentially Private Federated Optimization under Weaker Assumptions

作者: Rustem Islamov, Grigory Malinovsky, Alexander Gaponov, Aurelien Lucchi, Peter Richtárik, Eduard Gorbunov 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23472v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于联邦学习中的隐私保护（差分隐私）和鲁棒性（拜占庭容错）算法设计，属于分布式机器学习的安全优化领域。所有评分关键词均涉及大模型、深度学习技术原理或AI在科学领域的应用，而本文研究的是通用的联邦学习框架，不针对特定模型类型（如LLMs）、训练技术（如RLHF、PEFT）或应用领域（如生物信息学），因此与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Byz-Clip21-SGD2M的新算法，在联邦学习框架下统一解决了差分隐私和拜占庭鲁棒性问题，并在较弱假设下提供了收敛性保证和实证验证。

摘要翻译

联邦学习（Federated Learning, FL）使得异构客户端能够在无需集中原始数据的情况下协作训练共享模型，从而提供了一定程度的隐私保护。然而，梯度和模型更新仍可能泄露敏感信息，同时恶意服务器可能发起诸如拜占庭篡改等对抗性攻击。这些漏洞凸显了在统一框架内解决差分隐私（Differential Privacy, DP）与拜占庭鲁棒性问题的必要性。然而，现有方法通常依赖于不切实际的假设（例如梯度有界）、需要辅助的服务器端数据集，或无法提供收敛性保证。为克服这些局限，我们提出了Byz-Clip21-SGD2M算法，该算法通过结合鲁棒聚合、双动量机制以及精心设计的梯度裁剪技术，实现了对上述问题的综合处理。我们在标准的$L$-平滑性和$σ$-次高斯梯度噪声假设下，证明了算法具有高概率收敛保证，从而放宽了先前工作中占主导地位的条件限制。我们的分析在无攻击者情况下恢复了最优收敛速率，并在拜占庭与差分隐私设置下提升了效用保证。基于MNIST数据集对CNN和MLP模型进行的实证评估进一步验证了该方法的有效性。

摘要 (Abstract)

Federated Learning (FL) enables heterogeneous clients to collaboratively train a shared model without centralizing their raw data, offering an inherent level of privacy. However, gradients and model updates can still leak sensitive information, while malicious servers may mount adversarial attacks such as Byzantine manipulation. These vulnerabilities highlight the need to address differential privacy (DP) and Byzantine robustness within a unified framework. Existing approaches, however, often rely on unrealistic assumptions such as bounded gradients, require auxiliary server-side datasets, or fail to provide convergence guarantees. We address these limitations by proposing Byz-Clip21-SGD2M, a new algorithm that integrates robust aggregation with double momentum and carefully designed clipping. We prove high-probability convergence guarantees under standard $L$-smoothness and $σ$-sub-Gaussian gradient noise assumptions, thereby relaxing conditions that dominate prior work. Our analysis recovers state-of-the-art convergence rates in the absence of adversaries and improves utility guarantees under Byzantine and DP settings. Empirical evaluations on CNN and MLP models trained on MNIST further validate the effectiveness of our approach.

关键词: Federated Learning, Differential Privacy, Byzantine Robustness, Convergence Guarantees, Optimization Algorithm, Clipping, Momentum, Adversarial Attacks

237. ❌ End-to-End Efficient RL for Linear Bellman Complete MDPs with Deterministic Transitions

作者: Zakaria Mhammedi, Alexander Rakhlin, Nneka Okolo 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23461v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究强化学习（RL）在线性Bellman完备MDPs中的高效算法，属于传统强化学习理论范畴，未涉及大模型、深度学习、AI for Science等关键词。所有关键词均与大模型技术、应用或科学AI相关，而本文专注于经典RL理论，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文针对具有线性Bellman完备性和确定性转移的马尔可夫决策过程，提出了一种计算高效的强化学习算法，能在有限或无限动作空间中学习到ε-最优策略，其样本和计算复杂度在时间范围、特征维度和1/ε上为多项式级别。

摘要翻译

我们研究满足线性贝尔曼完备性的马尔可夫决策过程（MDPs）中基于线性函数逼近的强化学习（RL）——在这一基础设定中，任何线性价值函数的贝尔曼备份仍保持线性。尽管该设定在统计上易于处理，但先前计算高效的算法要么局限于较小的动作空间，要么需要对特征空间施加较强的预言机假设。针对具有确定性转移、随机初始状态及随机奖励的线性贝尔曼完备MDPs，我们提出了一种计算高效的算法。对于有限动作空间，我们的算法是端到端高效的；对于大规模或无限动作空间，我们仅需一个标准的关于动作的argmax预言机。我们的算法能够学习一个$\varepsilon$最优策略，其样本与计算复杂度在时间跨度、特征维度及$1/\varepsilon$上均为多项式级别。

摘要 (Abstract)

We study reinforcement learning (RL) with linear function approximation in Markov Decision Processes (MDPs) satisfying \emph{linear Bellman completeness} – a fundamental setting where the Bellman backup of any linear value function remains linear. While statistically tractable, prior computationally efficient algorithms are either limited to small action spaces or require strong oracle assumptions over the feature space. We provide a computationally efficient algorithm for linear Bellman complete MDPs with \emph{deterministic transitions}, stochastic initial states, and stochastic rewards. For finite action spaces, our algorithm is end-to-end efficient; for large or infinite action spaces, we require only a standard argmax oracle over actions. Our algorithm learns an $\varepsilon$-optimal policy with sample and computational complexity polynomial in the horizon, feature dimension, and $1/\varepsilon$.

关键词: reinforcement learning, linear function approximation, linear Bellman completeness, deterministic transitions, computationally efficient algorithm, ε-optimal policy, sample complexity, computational complexity

238. ❌ CSTS: A Canonical Security Telemetry Substrate for AI-Native Cyber Detection

作者: Abdul Rahman 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23459v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于AI驱动的网络安全检测系统，提出了一种名为CSTS的实体关系抽象方法来解决跨环境部署中的碎片化表示问题。虽然论文涉及AI在网络安全领域的应用，但所有关键词都直接针对大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理等），而论文内容完全不涉及LLM技术、深度学习模型架构或科学领域AI应用。论文的核心是网络安全系统的数据表示和跨环境部署问题，而非大模型技术本身。

!!! tip deepseek-chat TL;DR

论文针对AI驱动的网络安全系统在跨环境部署中因碎片化、事件中心的遥测表示而失败的问题，提出了CSTS实体关系抽象方法，该方法通过强制身份持久性、类型关系和时态状态不变量，提高了跨拓扑传输性能并防止模式扰动下的崩溃。

摘要翻译

人工智能驱动的网络安全系统在跨环境部署时，常因碎片化、以事件为中心的遥测表征而失效。我们提出了规范安全遥测基板（Canonical Security Telemetry Substrate，CSTS），这是一种实体关系抽象模型，它强制实现身份持久性、类型化关系与时间状态不变性。在异构环境中，CSTS提升了以身份为中心的检测任务在跨拓扑结构中的迁移能力，并防止其在模式扰动下失效。针对零日威胁检测，CSTS将语义定向不稳定性识别为建模问题而非模式现象，从而明确了分层可移植性要求。

摘要 (Abstract)

AI-driven cybersecurity systems often fail under cross-environment deployment due to fragmented, event-centric telemetry representations. We introduce the Canonical Security Telemetry Substrate (CSTS), an entity-relational abstraction that enforces identity persistence, typed relationships, and temporal state invariants. Across heterogeneous environments, CSTS improves cross-topology transfer for identity-centric detection and prevents collapse under schema perturbation. For zero-day detection, CSTS isolates semantic orientation instability as a modeling, not schema, phenomenon, clarifying layered portability requirements.

关键词: AI-driven cybersecurity, cross-environment deployment, telemetry representation, entity-relational abstraction, identity persistence, zero-day detection, semantic orientation instability, layered portability

239. ❌ Similarity-Aware Mixture-of-Experts for Data-Efficient Continual Learning

作者: Connor Mclaughlin, Nigel Lee, Lili Su 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23436v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心贡献是提出了一种基于预训练模型的相似性感知混合专家（MoE）框架，用于解决数据稀缺和任务重叠的持续学习问题。因此，与’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分），与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（8分），因为涉及预训练模型和持续学习。论文未明确提及大语言模型，但MoE框架是大模型常用技术，且研究背景关注大模型应用，故’Large Language Models OR LLMs OR Foundation Models’给5分。其他关键词如SLMs、SFT、RAG、推理加速等均未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对数据稀缺且任务可能任意重叠的持续学习挑战，提出了一种基于预训练模型的相似性感知混合专家框架，通过增量全局池化和实例级提示掩码来提升样本效率并防止负迁移。

摘要翻译

机器学习模型在部署后常需适应现实世界中结构化或非结构化的动态变化。持续学习框架支持模型的连续适应，但现有方法大多假设每个任务包含充足的数据样本，或假设学习任务互不重叠。本文研究更一般的场景：每个任务可能仅拥有有限数据集，且任务之间可能以任意方式重叠，而无需先验知识。这一通用场景的挑战性显著更高，原因有二：一方面，数据稀缺要求对通用知识进行有效情境化，并实现跨任务的高效知识迁移；另一方面，非结构化的任务重叠容易导致负向知识迁移。为应对上述挑战，我们提出一种基于预训练模型的自适应专家混合框架，该框架能逐步建立任务间的相似性感知。我们的设计包含两个创新算法组件：增量全局池化与实例级提示掩码。前者通过随时间逐步引入提示来降低提示关联噪声；后者将输入任务样本分解为与当前提示对齐的分布内样本和需要新提示的分布外样本。二者协同工作，使我们的设计能够策略性地利用潜在的任务重叠，同时在每个任务数据稀缺的情况下主动防止负向相互干扰。在不同数据量和任务间相似度的实验表明，本方法提升了样本效率并具有广泛适用性。

摘要 (Abstract)

Machine learning models often need to adapt to new data after deployment due to structured or unstructured real-world dynamics. The Continual Learning (CL) framework enables continuous model adaptation, but most existing approaches either assume each task contains sufficiently many data samples or that the learning tasks are non-overlapping. In this paper, we address the more general setting where each task may have a limited dataset, and tasks may overlap in an arbitrary manner without a priori knowledge. This general setting is substantially more challenging for two reasons. On the one hand, data scarcity necessitates effective contextualization of general knowledge and efficient knowledge transfer across tasks. On the other hand, unstructured task overlapping can easily result in negative knowledge transfer. To address the above challenges, we propose an adaptive mixture-of-experts (MoE) framework over pre-trained models that progressively establishes similarity awareness among tasks. Our design contains two innovative algorithmic components: incremental global pooling and instance-wise prompt masking. The former mitigates prompt association noise through gradual prompt introduction over time. The latter decomposes incoming task samples into those aligning with current prompts (in-distribution) and those requiring new prompts (out-of-distribution). Together, our design strategically leverages potential task overlaps while actively preventing negative mutual interference in the presence of per-task data scarcity. Experiments across varying data volumes and inter-task similarity show that our method enhances sample efficiency and is broadly applicable.

关键词: Continual Learning, Mixture of Experts, Data Scarcity, Task Overlapping, Pre-trained Models, Sample Efficiency, Negative Knowledge Transfer, Similarity Awareness

240. ❌ Central Dogma Transformer III: Interpretable AI Across DNA, RNA, and Protein

作者: Nobuyuki Ota 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23361v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文CDT-III专注于生物信息学领域，开发了一种用于DNA、RNA和蛋白质预测的机制导向AI模型，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为它直接应用于生物信息学。同时，论文强调模型的可解释性，与’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为它旨在连接学习表示与分子过程。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、训练方法、推理优化、代理系统等），因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该研究提出了CDT-III模型，通过虚拟细胞嵌入器架构预测DNA、RNA和蛋白质的相互作用，实现了高精度预测并增强了模型在生物过程中的可解释性，成功应用于基因敲除模拟和临床副作用预测。

摘要翻译

生物人工智能模型日益能够预测复杂的细胞反应，但其学习到的表征仍与它们试图捕捉的分子过程相脱节。我们提出CDT-III模型，它将机制导向的人工智能扩展至完整的中心法则全过程：DNA、RNA和蛋白质。其两阶段虚拟细胞嵌入器架构模拟了细胞的空间区室化：VCE-N（Virtual Cell Embedder-Nucleus）模拟细胞核内的转录过程，VCE-C（Virtual Cell Embedder-Cytosol）模拟细胞质中的翻译过程。在五个留出基因的测试中，CDT-III实现了单基因RNA预测相关系数r=0.843和蛋白质预测相关系数r=0.969。增加蛋白质预测任务提升了RNA预测性能（r从0.804提升至0.843），证明下游任务能对上游表征起到正则化作用。蛋白质监督增强了DNA层面的可解释性，使CTCF富集度提高了30%。将该模型应用于模拟阿仑单抗（Alemtuzumab）作用的CD52基因敲除计算机实验中，模型在未使用临床数据的情况下，正确预测了29/29的蛋白质变化，并重新发现了7种已知临床副作用中的5种。基于梯度的副作用分析方法仅需未扰动的基础数据（r=0.939），即可实现对全部2,361个基因的筛查，无需进行新实验。

摘要 (Abstract)

Biological AI models increasingly predict complex cellular responses, yet their learned representations remain disconnected from the molecular processes they aim to capture. We present CDT-III, which extends mechanism-oriented AI across the full central dogma: DNA, RNA, and protein. Its two-stage Virtual Cell Embedder architecture mirrors the spatial compartmentalization of the cell: VCE-N models transcription in the nucleus and VCE-C models translation in the cytosol. On five held-out genes, CDT-III achieves per-gene RNA r=0.843 and protein r=0.969. Adding protein prediction improves RNA performance (r=0.804 to 0.843), demonstrating that downstream tasks regularize upstream representations. Protein supervision sharpens DNA-level interpretability, increasing CTCF enrichment by 30%. Applied to in silico CD52 knockdown approximating Alemtuzumab, the model predicts 29/29 protein changes correctly and rediscovers 5 of 7 known clinical side effects without clinical data. Gradient-based side effect profiling requires only unperturbed baseline data (r=0.939), enabling screening of all 2,361 genes without new experiments.

关键词: Central Dogma Transformer, Virtual Cell Embedder, DNA-RNA-protein prediction, mechanism-oriented AI, interpretable AI, bioinformatics, gradient-based profiling, in silico knockdown

241. ❌ Robustness Quantification for Discriminative Models: a New Robustness Metric and its Application to Dynamic Classifier Selection

作者: Rodrigo F. L. Lassance, Jasper De Bock 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23318v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是分类器的鲁棒性量化指标及其在动态分类器选择中的应用，属于传统机器学习中的模型评估和选择领域。论文内容完全不涉及大语言模型、深度学习技术原理、大模型在不同领域的应用或任何指定的关键词技术。所有关键词均与大模型、深度学习、AI for Science等主题相关，而该论文专注于传统概率判别分类器的评估方法，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种适用于任何概率判别分类器和特征类型的新鲁棒性度量指标，并利用该指标开发了动态分类器选择的新策略。

摘要翻译

在评估分类器个体预测可靠性的多种可能策略中，鲁棒性量化作为一种方法脱颖而出，它评估分类器在改变其预测前能够应对的不确定性程度。然而，其适用性相较于其他替代方法更为有限，因为它需要使用生成模型，并将分析限制在特定模型架构或离散特征上。在本研究中，我们提出了一种新的鲁棒性度量指标，适用于任何概率判别式分类器及任何类型的特征。我们证明这一新指标能够区分可靠与不可靠的预测，并利用这一观察结果开发了动态分类器选择的新策略。

摘要 (Abstract)

Among the different possible strategies for evaluating the reliability of individual predictions of classifiers, robustness quantification stands out as a method that evaluates how much uncertainty a classifier could cope with before changing its prediction. However, its applicability is more limited than some of its alternatives, since it requires the use of generative models and restricts the analyses either to specific model architectures or discrete features. In this work, we propose a new robustness metric applicable to any probabilistic discriminative classifier and any type of features. We demonstrate that this new metric is capable of distinguishing between reliable and unreliable predictions, and use this observation to develop new strategies for dynamic classifier selection.

关键词: robustness quantification, discriminative models, robustness metric, probabilistic classifiers, dynamic classifier selection, reliability evaluation, uncertainty analysis, prediction reliability

242. ❌ Contextual Graph Matching with Correlated Gaussian Features

作者: Mohammad Hassan Ahmad Yarandi, Luca Ganassali 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23305v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是图匹配的理论问题，具体分析在具有相关高斯特征的上下文图匹配中的精确恢复阈值。所有关键词均与大模型、深度学习技术原理或AI在科学领域的应用无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在节点特征和边权重均相关的两个网络中进行上下文图匹配的问题，推导了精确恢复的信息论阈值，并揭示了结构信息和上下文信息如何相互作用的新现象。

摘要翻译

我们在高斯设定下研究上下文图匹配问题，其中两个网络间的边权重与节点特征均存在相关性。通过分析图结构相关性强度、特征相关性强度、节点数量及特征维度等参数，我们推导出精确恢复的信息论精确阈值，并确定了几乎精确恢复可能或不可能实现的条件。有趣的是，在标准图匹配场景中观察到的“全有或全无”相变现象，在引入额外上下文信息后呈现出更丰富的结构：精确恢复与几乎精确恢复的阈值不再重合。我们的研究首次严格刻画了结构信息与上下文信息在图匹配中的交互机制，并为设计高效算法建立了理论基准。

摘要 (Abstract)

We investigate contextual graph matching in the Gaussian setting, where both edge weights and node features are correlated across two networks. We derive precise information-theoretic thresholds for exact recovery, and identify conditions under which almost exact recovery is possible or impossible, in terms of graph and feature correlation strengths, the number of nodes, and feature dimension. Interestingly, whereas an all-or-nothing phase transition is observed in the standard graph-matching scenario, the additional contextual information introduces a richer structure: thresholds for exact and almost exact recovery no longer coincide. Our results provide the first rigorous characterization of how structural and contextual information interact in graph matching, and establish a benchmark for designing efficient algorithms.

关键词: Contextual Graph Matching, Gaussian Features, Exact Recovery, Information-Theoretic Thresholds, Phase Transition, Correlated Networks, Node Features, Edge Weights

243. ❌ SynForceNet: A Force-Driven Global-Local Latent Representation Framework for Lithium-Ion Battery Fault Diagnosis

作者: Rongxiu Chen, Yuting Su 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23265v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于锂离子电池故障诊断，提出了一种结合核单类分类和最小体积估计的深度异常检测框架，并引入了机械约束和基于STDP的动态表示。论文属于深度学习在工程/科学领域的应用，但与所有大模型（LLM）相关的技术关键词（如LLMs、MoE、SFT、RLHF、RAG、CoT、Agents等）完全无关，因为这些关键词特指自然语言处理或大语言模型技术，而本文未涉及任何语言模型或相关方法。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将深度学习应用于电池安全诊断（属于科学/工程领域），但并非核心匹配（如生物信息学或化学信息学），因此给予5分（有一定关联）。其他关键词均得0分（完全无关）。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于深度异常检测的在线锂离子电池故障诊断框架，通过结合核单类分类、最小体积估计、机械约束和STDP动态表示，在真实电动汽车数据上实现了比先进基线方法更高的诊断性能（如F1分数提升18.28%），并揭示了不同故障类型间可能存在的共享因果结构。

摘要翻译

在线安全故障诊断对于电动汽车（EV）中的锂离子电池至关重要，尤其是在实际运行中复杂且罕见的安全关键工况下。本研究基于深度异常检测框架，结合核单类分类与最小体积估计，开发了一种在线电池故障诊断网络。通过引入机械约束和基于脉冲时序依赖可塑性（STDP）的动态表征，以改善复杂故障特征描述，并构建更紧凑的正常状态边界。所提方法使用从20辆电动汽车收集的860万个有效数据点进行验证。与多种先进基线方法相比，其在真阳性率（TPR）上平均提升7.59%，阳性预测值（PPV）提升27.92%，F1分数提升18.28%，受试者工作特征曲线下面积（AUC）提升23.68%。此外，我们分析了建模前后故障表征的空间分离特性，并通过学习潜在空间中的流形结构进一步增强了框架的鲁棒性。结果还表明，不同故障类型间可能存在共享的因果结构，这凸显了将深度学习与物理约束及神经动力学相结合在电池安全诊断领域的应用前景。

摘要 (Abstract)

Online safety fault diagnosis is essential for lithium-ion batteries in electric vehicles(EVs), particularly under complex and rare safety-critical conditions in real-world operation. In this work, we develop an online battery fault diagnosis network based on a deep anomaly detection framework combining kernel one-class classification and minimum-volume estimation. Mechanical constraints and spike-timing-dependent plasticity(STDP)-based dynamic representations are introduced to improve complex fault characterization and enable a more compact normal-state boundary. The proposed method is validated using 8.6 million valid data points collected from 20 EVs. Compared with several advanced baseline methods, it achieves average improvements of 7.59% in TPR, 27.92% in PPV, 18.28% in F1 score, and 23.68% in AUC. In addition, we analyze the spatial separation of fault representations before and after modeling, and further enhance framework robustness by learning the manifold structure in the latent space. The results also suggest the possible presence of shared causal structures across different fault types, highlighting the promise of integrating deep learning with physical constraints and neural dynamics for battery safety diagnosis.

关键词: lithium-ion battery, fault diagnosis, deep anomaly detection, kernel one-class classification, minimum-volume estimation, mechanical constraints, STDP dynamic representations, online safety monitoring

244. ❌ GEM: Guided Expectation-Maximization for Behavior-Normalized Candidate Action Selection in Offline RL

作者: Haoyu Wang, Jingcheng Wang, Shunyu Wu, Xinwei Xiao 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23232v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于离线强化学习（Offline RL）中的动作选择问题，提出GEM框架使用高斯混合模型（GMM）进行多模态动作选择和行为归一化支持评估。所有评分关键词均与大语言模型、深度学习技术原理或AI科学应用直接相关，而本文研究的是强化学习中的特定算法问题，未涉及大模型、深度学习创新或AI科学应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

论文针对离线强化学习中数据集导致的多模态动作选择问题，提出了GEM框架，通过高斯混合模型实现多模态动作选择和行为归一化支持评估，在D4RL基准测试中表现优异，并提供可调节的推理时间计算预算。

摘要翻译

离线强化学习（RL）能够从固定数据集中拟合出强大的价值函数，但可靠的部署仍取决于用于查询这些函数的动作选择接口。当数据集诱导出分支或多模态的动作分布时，单模态的策略提取可能会模糊相互竞争的假设，并产生数据支持薄弱的“中间”动作，导致即使拥有强大的价值函数评判器，决策依然脆弱。我们提出GEM（引导期望最大化）这一分析框架，它使动作选择兼具多模态特性和显式可控性。GEM通过评判器引导、优势加权的类EM更新来训练一个高斯混合模型（GMM）行动者，该更新在将概率质量向高价值区域转移的同时保留不同的混合成分，并学习一个易于处理的GMM行为模型以量化数据支持度。在推理阶段，GEM执行基于候选集的选择：它生成一个并行的候选动作集，并使用保守的集成下界置信度与行为归一化的支持度对动作进行重新排序，其中行为对数似然在每个状态的候选集内进行标准化，从而在不同状态和候选集规模下实现稳定且可比较的控制。实验表明，GEM在D4RL基准测试中具有竞争力，并提供了一个简单的推理时预算调节旋钮（候选动作数量），可在无需重新训练的情况下以计算量换取决策质量。

摘要 (Abstract)

Offline reinforcement learning (RL) can fit strong value functions from fixed datasets, yet reliable deployment still hinges on the action selection interface used to query them. When the dataset induces a branched or multimodal action landscape, unimodal policy extraction can blur competing hypotheses and yield “in-between” actions that are weakly supported by data, making decisions brittle even with a strong critic. We introduce GEM (Guided Expectation-Maximization), an analytical framework that makes action selection both multimodal and explicitly controllable. GEM trains a Gaussian Mixture Model (GMM) actor via critic-guided, advantage-weighted EM-style updates that preserve distinct components while shifting probability mass toward high-value regions, and learns a tractable GMM behavior model to quantify support. During inference, GEM performs candidate-based selection: it generates a parallel candidate set and reranks actions using a conservative ensemble lower-confidence bound together with behavior-normalized support, where the behavior log-likelihood is standardized within each state’s candidate set to yield stable, comparable control across states and candidate budgets. Empirically, GEM is competitive across D4RL benchmarks, and offers a simple inference-time budget knob (candidate count) that trades compute for decision quality without retraining.

关键词: Offline Reinforcement Learning, Action Selection, Gaussian Mixture Model, Behavior Normalization, Candidate-based Selection, Multimodal Policy, D4RL Benchmark, Inference-time Budget

245. ❌ Generative Inversion of Spectroscopic Data for Amorphous Structure Elucidation

作者: Jiawei Guo, Daniel Schwalbe-Koda 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23210v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于材料科学领域，提出了一种名为GLASS的生成框架，用于从多模态光谱测量数据中反演原子结构。论文的核心是生成模型（score-based model）在材料科学中的应用，属于AI for Science范畴。然而，论文并未涉及任何大语言模型（LLM）、深度学习技术原理创新、或关键词列表中除’AI for Science’外的其他技术。所有其他关键词（如MoE、SFT、RAG、CoT、Agents等）均与论文内容完全无关，因此评分为0。只有’AI for Science’高度相关，因为论文明确属于科学领域的AI应用（材料科学），评分为10。

!!! tip deepseek-chat TL;DR

该研究解决了从光谱数据确定非晶材料原子结构的难题，提出了GLASS生成框架，无需势能面知识即可从多模态光谱反演出真实原子结构，并成功应用于非晶硅、硫和冰等争议性实验问题的机理揭示。

摘要翻译

从表征数据中确定原子级结构是材料科学中最常见且复杂的问题之一。尤其在非晶材料中，提出既能保持真实性又与实验数据相符的结构，需要专家指导、良好的原子间势函数或两者兼备。本文介绍GLASS，一种生成式框架，能够将多模态光谱测量数据反演为真实的原子结构，而无需预先了解势能面。该框架通过基于分数的模型从低保真数据中学习结构先验，并基于可微分光谱目标采样分布外结构。利用对分布函数（PDF）、X射线吸收光谱和衍射测量进行的重构，量化了不同光谱模式之间的互补性，并证明PDF是本框架信息量最丰富的探测手段。我们运用GLASS阐释了三个存在争议的实验问题：非晶硅中的次晶态、硫中的液-液相变，以及球磨法制备的非晶冰。在每种案例中，生成的结构均能复现实验测量结果，并揭示仅靠衍射分析无法触及的机制。

摘要 (Abstract)

Determining atomistic structures from characterization data is one of the most common yet intricate problems in materials science. Particularly in amorphous materials, proposing structures that balance realism and agreement with experiments requires expert guidance, good interatomic potentials, or both. Here, we introduce GLASS, a generative framework that inverts multi-modal spectroscopic measurements into realistic atomistic structures without knowledge of the potential energy surface. A score-based model learns a structural prior from low-fidelity data and samples out-of-distribution structures conditioned on differentiable spectral targets. Reconstructions using pair distribution functions (PDFs), X-ray absorption spectroscopy, and diffraction measurements quantify the complementarity between spectral modalities and demonstrate that PDFs is the most informative probe for our framework. We use GLASS to rationalize three contested experimental problems: paracrystallinity in amorphous silicon, a liquid-liquid phase transition in sulfur, and ball-milled amorphous ice. In each case, generated structures reproduce experimental measurements and reveal mechanisms inaccessible to diffraction analysis alone.

关键词: generative framework, amorphous materials, spectroscopic data inversion, atomistic structure elucidation, score-based model, pair distribution functions, materials science, GLASS

246. ❌ A One-Inclusion Graph Approach to Multi-Group Learning

作者: Noah Bergam, Samuel Deng, Daniel Hsu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23208v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多组学习（multi-group learning）的样本复杂度理论界限，属于机器学习理论领域，与所有关键词（均涉及大模型、深度学习技术原理或具体应用）完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了多组学习的样本复杂度问题，提出了一种基于单包含图预测策略的算法，证明了最优的收敛率界限。

摘要翻译

我们证明了多群体学习样本复杂度的最紧已知上界。我们的算法通过推广二分图$b$-匹配，扩展了单包含图预测策略。在群体可实现设定下，我们给出了一个下界，证实了该算法$\log n / n$的收敛速度在一般情况下是最优的。若放宽学习目标，使得评估所针对的群体是在未知样本的情况下选择的，则在群体可实现条件下，我们的算法能够达到最优的$1/n$收敛速度。

摘要 (Abstract)

We prove the tightest-known upper bounds on the sample complexity of multi-group learning. Our algorithm extends the one-inclusion graph prediction strategy using a generalization of bipartite $b$-matching. In the group-realizable setting, we provide a lower bound confirming that our algorithm’s $\log n / n$ convergence rate is optimal in general. If one relaxes the learning objective such that the group on which we are evaluated is chosen obliviously of the sample, then our algorithm achieves the optimal $1/n$ convergence rate under group-realizability.

关键词: multi-group learning, sample complexity, one-inclusion graph, bipartite b-matching, convergence rate, group-realizable setting, lower bound, optimal algorithm

247. ❌ Between Resolution Collapse and Variance Inflation: Weighted Conformal Anomaly Detection in Low-Data Regimes

作者: Oliver Hennhöfer, Christine Preisach 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23205v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于加权保形异常检测方法，研究统计推断中的p值保守性与方差膨胀的权衡问题，并提出基于连续核密度估计的解决方案。论文内容属于统计机器学习领域，涉及异常检测、保形预测、非平稳数据等主题，但完全不涉及大语言模型、深度学习技术原理、AI for Science等关键词所涵盖的任何内容。所有关键词均与大模型技术、深度学习应用或科学AI无关，因此相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了加权保形异常检测中因局部适应导致的p值保守性与方差膨胀的权衡问题，提出了一种连续推断松弛方法，通过连续加权核密度估计解耦局部适应与尾部分辨率，在保持有效误差控制的同时恢复了统计功效。

摘要翻译

标准共形异常检测在可交换性假设下提供了有限的样本边际保证。然而，现实世界数据常呈现分布漂移，这需要通过加权共形方法来适应局部非平稳性。我们证明，这种适应引发了可达到的最小p值与其稳定性之间的关键权衡。随着重要性权重集中于相关的校准实例，有效样本量会减少。这可能导致标准共形p值对于有效的误差控制过于保守，而用于缓解此问题的平滑技术会引入条件方差，可能掩盖异常。我们提出了一种连续推断松弛方法，通过连续加权核密度估计将局部适应性与尾部解析解耦，从而解决这一困境。该方法将有限样本精确性松弛为渐近有效性，同时消除了蒙特卡洛变异性，并恢复了因离散化而损失的统计功效。实证评估证实，我们的方法不仅在离散基线方法无法发现异常的情况下恢复了检测能力，而且在统计功效上优于标准方法，同时在实践中保持了有效的边际误差控制。

摘要 (Abstract)

Standard conformal anomaly detection provides marginal finite-sample guarantees under the assumption of exchangeability . However, real-world data often exhibit distribution shifts, necessitating a weighted conformal approach to adapt to local non-stationarity. We show that this adaptation induces a critical trade-off between the minimum attainable p-value and its stability. As importance weights localize to relevant calibration instances, the effective sample size decreases. This can render standard conformal p-values overly conservative for effective error control, while the smoothing technique used to mitigate this issue introduces conditional variance, potentially masking anomalies. We propose a continuous inference relaxation that resolves this dilemma by decoupling local adaptation from tail resolution via continuous weighted kernel density estimation. While relaxing finite-sample exactness to asymptotic validity, our method eliminates Monte Carlo variability and recovers the statistical power lost to discretization. Empirical evaluations confirm that our approach not only restores detection capabilities where discrete baselines yield zero discoveries, but outperforms standard methods in statistical power while maintaining valid marginal error control in practice.

关键词: conformal anomaly detection, weighted conformal, p-value stability, low-data regimes, continuous inference, kernel density estimation, statistical power, error control

248. ❌ A Schrödinger Eigenfunction Method for Long-Horizon Stochastic Optimal Control

作者: Louis Claeys, Artur Goldman, Zebang Shen, Niao He 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23173v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究高维随机最优控制问题，提出了一种基于薛定谔特征函数的方法来解决长时域控制问题，并使用了神经网络学习特征系统。论文内容属于控制理论、数学物理和计算数学领域，与所有评分关键词（均围绕大模型、深度学习技术及其应用）完全无关。论文中未提及任何大模型、深度学习、AI for Science等相关概念，也未涉及评分关键词中的任何技术或应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于薛定谔特征函数的方法来解决长时域高维随机最优控制问题，通过证明控制问题的线性算子与薛定谔算子的酉等价性，并设计新的损失函数学习特征系统，在多个基准测试中实现了比现有方法高一个数量级的控制精度，同时将内存使用和运行时复杂度从O(Td)降低到O(d)。

摘要翻译

高维随机最优控制问题随着规划时域$T$的延长而愈加困难：现有方法的时间复杂度随$T$线性增长，且控制性能常呈指数级恶化。针对一类具有线性可解性的随机最优控制子问题——即其无控漂移项为某势能梯度的情况，我们突破了这些限制。在此设定下，哈密顿-雅可比-贝尔曼方程可简化为由算子$\mathcal{L}$主导的线性偏微分方程。我们证明，在梯度漂移假设下，$\mathcal{L}$酉等价于具有纯离散谱的薛定谔算子$\mathcal{S} = -Δ+ \mathcal{V}$，这使得长时域控制问题可通过$\mathcal{L}$的特征系统进行高效描述。这一关联带来两个关键成果：首先，对于对称线性二次调节器问题，$\mathcal{S}$与量子谐振子的哈密顿量一致，其闭式特征系统为具有\emph{任意}终端成本的对称LQR问题提供了解析解。其次，在更一般场景中，我们采用神经网络学习$\mathcal{L}$的特征系统。我们发现现有特征函数学习损失函数在控制任务中存在隐式重加权问题导致性能下降，并提出一种新型损失函数以缓解此问题。我们在多个长时域基准测试中评估所提方法，相比现有最优方法实现了控制精度数量级的提升，同时将内存占用与时间复杂度从$\mathcal{O}(Td)$降低至$\mathcal{O}(d)$。

摘要 (Abstract)

High-dimensional stochastic optimal control (SOC) becomes harder with longer planning horizons: existing methods scale linearly in the horizon $T$, with performance often deteriorating exponentially. We overcome these limitations for a subclass of linearly-solvable SOC problems-those whose uncontrolled drift is the gradient of a potential. In this setting, the Hamilton-Jacobi-Bellman equation reduces to a linear PDE governed by an operator $\mathcal{L}$. We prove that, under the gradient drift assumption, $\mathcal{L}$ is unitarily equivalent to a Schrödinger operator $\mathcal{S} = -Δ+ \mathcal{V}$ with purely discrete spectrum, allowing the long-horizon control to be efficiently described via the eigensystem of $\mathcal{L}$. This connection provides two key results: first, for a symmetric linear-quadratic regulator (LQR), $\mathcal{S}$ matches the Hamiltonian of a quantum harmonic oscillator, whose closed-form eigensystem yields an analytic solution to the symmetric LQR with \emph{arbitrary} terminal cost. Second, in a more general setting, we learn the eigensystem of $\mathcal{L}$ using neural networks. We identify implicit reweighting issues with existing eigenfunction learning losses that degrade performance in control tasks, and propose a novel loss function to mitigate this. We evaluate our method on several long-horizon benchmarks, achieving an order-of-magnitude improvement in control accuracy compared to state-of-the-art methods, while reducing memory usage and runtime complexity from $\mathcal{O}(Td)$ to $\mathcal{O}(d)$.

关键词: stochastic optimal control, long-horizon planning, Schrödinger eigenfunction, Hamilton-Jacobi-Bellman equation, neural networks, linear-quadratic regulator, eigensystem learning, computational complexity reduction

249. ❌ DAK-UCB: Diversity-Aware Prompt Routing for LLMs and Generative Models

作者: Donya Jafari, Farzan Farnia 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23140v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM服务中的模型选择问题，提出DAK-UCB算法用于在线选择生成模型，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术内容，如MoE、SLMs、训练方法、推理优化、对齐、代理系统、量化压缩、科学AI应用等，这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM和生成模型服务中基于保真度的模型选择方法忽视输出多样性的问题，提出了多样性感知的DAK-UCB上下文赌博算法，实验证明该方法能在保持生成保真度的同时促进多样性感知的模型选择。

摘要翻译

生成式人工智能与大语言模型服务的扩展凸显了对适应性机制日益增长的需求，这些机制旨在选择合适的可用模型以回应用户提示。近期研究提出了离线和在线学习框架，仅基于最大化以提示为基准的保真度评估分数（例如文本到图像生成中的CLIP分数）来为输入提示识别最优生成式AI模型。然而，这种基于保真度的选择方法忽视了生成输出的多样性，因此可能无法解决生成响应中潜在的多样性缺陷。本文提出多样性感知核化上置信界（Diversity-Aware Kernelized Upper Confidence Bound, DAK-UCB）方法作为一种上下文赌博机算法，用于在线选择生成模型时兼顾多样性考量。所提出的DAK-UCB方法将保真度与多样性相关指标共同纳入选择过程。我们基于提示感知多样性评分函数设计该框架，该函数可分解为对先前生成轮次中提示-输出对的基于双样本期望。具体而言，我们通过联合核距离与核熵度量来阐述该框架的应用。实验结果表明，DAK-UCB在促进多样性感知模型选择的同时，能在一系列提示的生成过程中保持保真度。代码发布于https://github.com/Donya-Jafari/DAK-UCB。

摘要 (Abstract)

The expansion of generative AI and LLM services underscores the growing need for adaptive mechanisms to select an appropriate available model to respond to a user’s prompts. Recent works have proposed offline and online learning formulations to identify the optimal generative AI model for an input prompt, based solely on maximizing prompt-based fidelity evaluation scores, e.g., CLIP-Score in text-to-image generation. However, such fidelity-based selection methods overlook the diversity of generated outputs, and hence, they can fail to address potential diversity shortcomings in the generated responses. In this paper, we introduce the Diversity-Aware Kernelized Upper Confidence Bound (DAK-UCB) method as a contextual bandit algorithm for the online selection of generative models with diversity considerations. The proposed DAK-UCB method incorporates both fidelity and diversity-related metrics into the selection process. We design this framework based on prompt-aware diversity score functions that decompose to a two-sample-based expectation over prompt-output pairs in the previous generation rounds. Specifically, we illustrate the application of our framework using joint kernel distance and kernel entropy measures. Our experimental results demonstrate the effectiveness of DAK-UCB in promoting diversity-aware model selection while maintaining fidelity in the generations for a sequence of prompts. The code is available at https://github.com/Donya-Jafari/DAK-UCB.

关键词: LLMs, generative models, prompt routing, diversity-aware selection, contextual bandit, DAK-UCB, online learning, fidelity and diversity

250. ❌ A Bayesian Learning Approach for Drone Coverage Network: A Case Study on Cardiac Arrest in Scotland

作者: Tathagata Basu, Edoardo Patelli, Gianluca Filippi, Ben Parsonage, Christy Maddock, Massimiliano Vasile, Marco Fossati, Adam Loyd, Shaun Marshall, Paul Gowens 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23134v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究无人机辅助自动体外除颤器（AED）递送网络的优化设计，采用贝叶斯学习框架处理环境不确定性，属于运筹学、医疗应急系统优化领域。论文未涉及任何大模型、深度学习技术原理或应用，与绝大多数关键词（如LLM、MoE、SFT、RAG、量化等）完全无关。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文将计算方法（贝叶斯学习）应用于医疗应急场景（心脏骤停），可视为AI在科学/医疗领域的边缘应用，但并非核心创新点，因此给予5分（有一定关联）。其余关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于贝叶斯学习的可靠性驱动框架，用于优化苏格兰心脏骤停应急响应中无人机辅助AED递送网络的站点布局，结果表明该网络能提高城乡地区覆盖且具有成本效益。

摘要翻译

无人机正日益成为紧急医疗服务系统的补充性装备而受到关注。尽管多项试点研究和飞行试验已证明无人机辅助自动体外除颤器配送的可行性，但由于高昂的资本支出和环境不确定性，运营大规模实际网络仍面临挑战。本文构建了一个考虑可靠性的贝叶斯学习框架，用于在环境与运行不确定性的条件下设计无人机辅助的自动体外除颤器配送网络。我们基于院外心脏骤停患者的生存概率提出目标函数，以确定无人机基地的理想布点位置。此外，我们综合考虑现有紧急医疗服务基础设施的覆盖范围，以提升偏远地区的应急响应可靠性。我们利用苏格兰地理参照的心脏骤停数据对所提方法进行实证阐释。结果表明环境变异性和空间需求模式如何影响城乡区域最优无人机站点的布局。此外，我们通过网络稳健性评估，并采用基于预期质量调整生命年的成本效益分析来评价其经济可行性。研究结果表明，无人机辅助的自动体外除颤器配送具有成本效益预期，并有望显著改善救护车响应时间较长的农村与城市区域的紧急响应覆盖范围。

摘要 (Abstract)

Drones are becoming popular as a complementary system for \ac{ems}. Although several pilot studies and flight trials have shown the feasibility of drone-assisted \ac{aed} delivery, running a full-scale operational network remains challenging due to high capital expenditure and environmental uncertainties. In this paper, we formulate a reliability-informed Bayesian learning framework for designing drone-assisted \ac{aed} delivery networks under environmental and operational uncertainty. We propose our objective function based on the survival probability of \ac{ohca} patients to identify the ideal locations of drone stations. Moreover, we consider the coverage of existing \ac{ems} infrastructure to improve the response reliability in remote areas. We illustrate our proposed method using geographically referenced cardiac arrest data from Scotland. The result shows how environmental variability and spatial demand patterns influence optimal drone station placement across urban and rural regions. In addition, we assess the robustness of the network and evaluate its economic viability using a cost-effectiveness analysis based on expected \ac{qaly}. The findings suggest that drone-assisted \ac{aed} delivery is expected to be cost-effective and has the potential to significantly improve the emergency response coverage in rural and urban areas with longer ambulance response times.

关键词: drone-assisted AED delivery, Bayesian learning framework, cardiac arrest emergency response, network reliability optimization, spatial demand patterns, cost-effectiveness analysis, geographic data analysis, EMS infrastructure coverage

251. ❌ Polaris: A Gödel Agent Framework for Small Language Models through Experience-Abstracted Policy Repair

作者: Aditya Kakade, Vivek Srivastava, Shirish Karande 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23129v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Gödel agent框架Polaris，专注于小语言模型（SLMs）的自主策略修复与改进，因此与’Small Language Models’高度相关（10分）。论文涉及agent的自我改进、策略修正，与’Self-Correction’和’LLM Agents’高度相关（各10分）。论文提到agent进行元推理、解释错误，与’Chain of Thought’、‘System 2 Thinking’和’Explainable AI’有一定关联（各5分）。论文使用7B参数模型，属于大模型范畴，与’Large Language Models’相关（8分）。其他关键词如MoE、Scaling Laws、训练方法、RAG、推理加速等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文提出了Polaris框架，通过经验抽象策略修复实现小语言模型的Gödel agent递归自我改进，在多个推理和问题解决基准测试中显著提升了7B参数模型的性能。

摘要翻译

哥德尔智能体实现递归式自我改进：该智能体通过检验自身策略与执行轨迹，在测试循环中修改其策略。本文提出Polaris——一种面向紧凑模型的哥德尔智能体，它通过经验抽象进行策略修复，将失败转化为策略更新，这一过程包含分析、策略构建、抽象化以及通过保守性检查的最小化代码补丁修复的结构化循环。与响应层面的自我校正或参数调优不同，Polaris在策略层面进行修改，生成可审计的小型补丁，这些补丁持久化嵌入策略中，并在各基准测试的未见实例上重复使用。作为循环的一部分，该智能体进行元推理：解释自身错误，提出对策略的具体修订方案，继而更新策略。为实现累积式策略优化，我们引入经验抽象技术，将失败案例提炼为可迁移至未见实例的紧凑、可复用策略。在涵盖算术推理、组合推理、研究生水平问题求解及创意写作评估的MGSM、DROP、GPQA和LitBench基准测试中，搭载Polaris的70亿参数模型相较于基础策略及竞争基线模型均取得持续性能提升。

摘要 (Abstract)

Gödel agent realize recursive self-improvement: an agent inspects its own policy and traces and then modifies that policy in a tested loop. We introduce Polaris, a Gödel agent for compact models that performs policy repair via experience abstraction, turning failures into policy updates through a structured cycle of analysis, strategy formation, abstraction, and minimal code pat ch repair with conservative checks. Unlike response level self correction or parameter tuning, Polaris makes policy level changes with small, auditable patches that persist in the policy and are reused on unseen instances within each benchmark. As part of the loop, the agent engages in meta reasoning: it explains its errors, proposes concrete revisions to its own policy, and then updates the policy. To enable cumulative policy refinement, we introduce experience abstraction, which distills failures into compact, reusable strategies that transfer to unseen instances. On MGSM, DROP, GPQA, and LitBench (covering arithmetic reasoning, compositional inference, graduate-level problem solving, and creative writing evaluation), a 7-billion-parameter model equipped with Polaris achieves consistent gains over the base policy and competitive baselines.

关键词: Gödel agent, Small Language Models, Policy Repair, Self-improvement, Experience Abstraction, Meta Reasoning, 7-billion-parameter model, Benchmark Evaluation

252. ❌ High-Resolution Tensor-Network Fourier Methods for Exponentially Compressed Non-Gaussian Aggregate Distributions

作者: Juan José Rodríguez-Aldavero, Juan José García-Ripoll 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23106v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是使用量化张量网络（QTT/MPS）方法对加权独立随机变量和的概率分布进行指数压缩的数学计算技术，应用于金融风险计算（VaR/ES）。论文内容完全属于数学、计算科学和金融工程领域，不涉及任何大语言模型、深度学习、AI技术原理或AI在科学领域的应用。所有关键词均与大模型、深度学习、AI技术或AI科学应用相关，因此与本文完全无关，所有关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于量化张量网络（QTT/MPS）的高分辨率傅里叶方法，用于对非高斯聚合分布进行指数压缩，从而高效计算金融风险指标（VaR和ES）。

摘要翻译

独立随机变量加权和的特征函数在量化张量链（QTT）表示——亦称矩阵乘积态（MPS）——中展现出低秩结构，从而能对其完全非高斯概率分布实现高达指数级的压缩。在变量独立的条件下，全局特征函数可分解为局部项的乘积。其低秩QTT结构源于连续模型中固有的频谱平滑性，或源于离散模型中随着分量数 $D$ 增加而产生的频谱能量集中现象。我们以伯努利随机变量与对数正态随机变量的加权和为例验证了这一点。对于前者，尽管在分量数较少（$D$ 较小）时存在不利的、不可压缩的情形，但当分量数 $D \gtrsim 300$ 时，特征函数会发生急剧的键维数坍缩，从而实现多对数时间与内存复杂度。对于后者，该方法在标准硬件上可达到 $N = 2^{30}$ 个频率模式的高分辨率离散化，远超稠密实现 $N = 2^{24}$ 的上限。这些压缩表示使得风险价值（VaR）与预期缺口（ES）的高效计算成为可能，可支持量化金融及其他领域的应用。

摘要 (Abstract)

Characteristic functions of weighted sums of independent random variables exhibit low-rank structure in the quantized tensor train (QTT) representation, also known as matrix product states (MPS), enabling up to exponential compression of their fully non-Gaussian probability distributions. Under variable independence, the global characteristic function factorizes into local terms. Its low-rank QTT structure arises from intrinsic spectral smoothness in continuous models, or from spectral energy concentration as the number of components $D$ grows in discrete models. We demonstrate this on weighted sums of Bernoulli and lognormal random variables. In the former, despite an adversarial, incompressible small-$D$ regime, the characteristic function undergoes a sharp bond-dimension collapse for $D \gtrsim 300$ components, enabling polylogarithmic time and memory scaling. In the latter, the approach reaches high-resolution discretizations of $N = 2^{30}$ frequency modes on standard hardware, far beyond the $N = 2^{24}$ ceiling of dense implementations. These compressed representations enable efficient computation of Value at Risk (VaR) and Expected Shortfall (ES), supporting applications in quantitative finance and beyond.

关键词: tensor network, quantized tensor train, matrix product states, characteristic function, exponential compression, Value at Risk, Expected Shortfall, quantitative finance

253. ❌ SpecXMaster Technical Report

作者: Yutang Ge, Yaning Cui, Hanzheng Li, Jun-Jie Wang, Fanjie Xu, Jinhan Dong, Yongqi Jin, Dongxu Cui, Peng Jin, Guojiang Zhao, Hengxing Cai, Rong Zhu, Linfeng Zhang, Xiaohong Ji, Zhifeng Gao 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23101v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于开发一个名为SpecXMaster的智能框架，用于核磁共振（NMR）分子光谱解释，其核心是使用Agentic Reinforcement Learning（RL）技术。该研究属于AI for Science（AI4Science）在化学/生物信息学领域的应用，具体涉及光谱数据的自动化解释。因此，仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（评分为10分），因为论文直接应用AI技术解决科学领域（具体为化学光谱学）的问题。其他所有关键词均与大模型、深度学习技术原理、训练方法、推理优化、代理系统等具体技术无关，论文未涉及这些主题，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SpecXMaster的智能框架，利用Agentic Reinforcement Learning实现从原始核磁共振（NMR）数据到化学结构的全自动光谱解释，以解决传统光谱解释中的人为偏差、专业知识依赖和解释者差异等问题，并在多个公共基准测试中表现出优越性能。

摘要翻译

智能光谱学是人工智能驱动的闭环科学发现中的关键要素，在物质结构与人工智能之间发挥着至关重要的桥梁作用。然而，依赖专家的传统光谱解析方法面临重大挑战，包括易受人为偏见和错误影响、依赖有限的专门知识以及不同解析者之间存在差异等。为应对这些挑战，我们提出了SpecXMaster，一个利用智能体强化学习进行核磁共振分子光谱解析的智能框架。SpecXMaster能够直接从原始FID数据中自动提取一维氢谱和碳谱的多重性信息。这一端到端的流程实现了将核磁共振光谱全自动解析为化学结构。该框架在多个公开的核磁共振解析基准测试中表现出卓越性能，并已通过专业化学光谱学家的迭代评估得到完善。我们相信，SpecXMaster作为一种新颖的光谱解析方法范式，将对有机化学界产生深远影响。

摘要 (Abstract)

Intelligent spectroscopy serves as a pivotal element in AI-driven closed-loop scientific discovery, functioning as the critical bridge between matter structure and artificial intelligence. However, conventional expert-dependent spectral interpretation encounters substantial hurdles, including susceptibility to human bias and error, dependence on limited specialized expertise, and variability across interpreters. To address these challenges, we propose SpecXMaster, an intelligent framework leveraging Agentic Reinforcement Learning (RL) for NMR molecular spectral interpretation. SpecXMaster enables automated extraction of multiplicity information from both 1H and 13C spectra directly from raw FID (free induction decay) data. This end-to-end pipeline enables fully automated interpretation of NMR spectra into chemical structures. It demonstrates superior performance across multiple public NMR interpretation benchmarks and has been refined through iterative evaluations by professional chemical spectroscopists. We believe that SpecXMaster, as a novel methodological paradigm for spectral interpretation, will have a profound impact on the organic chemistry community.

关键词: Intelligent spectroscopy, NMR spectral interpretation, Agentic Reinforcement Learning, Automated extraction, Chemical structures, AI-driven scientific discovery, End-to-end pipeline, Organic chemistry

254. ❌ Generalization Bounds for Physics-Informed Neural Networks for the Incompressible Navier-Stokes Equations

作者: Sebastien Andre-Sloan, Dibyakanti Kumar, Alejandro F Frangi, Anirbit Mukherjee 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23072v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究物理信息神经网络（PINNs）在求解不可压缩Navier-Stokes方程时的泛化误差上界，属于AI在科学计算（流体力学）中的应用。论文未涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、量化、推理加速等）、对齐训练、智能体系统等关键词。仅与’AI for Science’有一定关联，因为PINNs是AI在科学领域（具体为计算流体力学）的应用，但论文未涉及生物信息学或化学信息学，且创新点在于理论泛化界而非AI技术本身，故给5分。其他关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文首次为使用物理信息神经网络（PINNs）求解不可压缩Navier-Stokes方程建立了泛化误差上界，并通过理论分析和实验验证了所提边界及新型激活函数的有效性。

摘要翻译

本研究首次为通过无监督物理信息神经网络（Physics-Informed Neural Network，PINN）框架训练深度为2的神经网络来近似求解(d+1)维不可压缩Navier-Stokes方程解的方法，建立了严格的理论泛化误差上界。该成果通过约束PINN风险的Rademacher复杂度实现。对于适当权重有界的网络类别，我们推导出的泛化界不显式依赖于网络宽度，且该框架通过流体运动粘性系数与损失正则化参数来刻画泛化间隙。特别地，所得样本复杂度界与维度无关。我们的泛化界为求解流体动力学问题提出了新型激活函数的使用建议。我们在求解Taylor-Green涡基准问题的PINN设置中，对所建议的激活函数及相应理论界进行了实证验证。

摘要 (Abstract)

This work establishes rigorous first-of-its-kind upper bounds on the generalization error for the method of approximating solutions to the (d+1)-dimensional incompressible Navier-Stokes equations by training depth-2 neural networks trained via the unsupervised Physics-Informed Neural Network (PINN) framework. This is achieved by bounding the Rademacher complexity of the PINN risk. For appropriately weight bounded net classes our derived generalization bounds do not explicitly depend on the network width and our framework characterizes the generalization gap in terms of the fluid’s kinematic viscosity and loss regularization parameters. In particular, the resulting sample complexity bounds are dimension-independent. Our generalization bounds suggest using novel activation functions for solving fluid dynamics. We provide empirical validation of the suggested activation functions and the corresponding bounds on a PINN setup solving the Taylor-Green vortex benchmark.

关键词: Physics-Informed Neural Networks, PINNs, Navier-Stokes equations, generalization bounds, Rademacher complexity, fluid dynamics, Taylor-Green vortex, activation functions

255. ❌ MsFormer: Enabling Robust Predictive Maintenance Services for Industrial Devices

作者: Jiahui Zhou, Dan Li, Ruibing Jin, Jian Lou, Yanran Zhao, Zhenghua Chen, Zigui Jiang, See-Kiong Ng 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23076v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于工业预测性维护的AI服务框架，提出了一种轻量级多尺度Transformer模型（MsFormer），用于处理工业物联网传感器数据。所有关键词均与大语言模型（LLMs）相关，而本文研究的是特定领域的Transformer应用，并非LLMs。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为工业AI可视为AI在科学/工程领域的应用，但并非核心匹配，故给5分。其他关键词如MoE、Scaling Laws、RLHF等均与LLMs技术原理或应用相关，与本文无关。

!!! tip deepseek-chat TL;DR

本文提出MsFormer，一种轻量级多尺度Transformer模型，用于解决工业预测性维护中处理多尺度时序传感器数据和小数据集的挑战，在真实数据集上实现了优于现有方法的性能并展示了强泛化能力。

摘要翻译

提供可靠的预测性维护是一项关键的工业人工智能服务，对确保制造设备的高可用性至关重要。现有的深度学习方法在此类任务上展现出有竞争力的结果，但缺乏通用的服务导向框架来捕捉工业物联网传感器数据中的复杂依赖关系。尽管基于Transformer的模型显示出强大的序列建模能力，但其作为稳健人工智能服务的直接部署面临显著瓶颈。具体而言，在实际服务环境中收集的流式传感器数据通常表现出由机器工作原理驱动的多尺度时间相关性。此外，可用于训练故障时间预测服务的数据集通常规模有限。这些问题对直接将现有模型用作稳健的预测服务构成了重大挑战。为应对这些挑战，我们提出了MsFormer，一种轻量级多尺度Transformer，设计为用于可靠工业预测性维护的统一人工智能服务模型。MsFormer包含一个多尺度采样（Multi-scale Sampling, MS）模块和一个定制的位置编码机制，以捕捉多流服务数据中的序列相关性。此外，为适应数据稀缺的服务环境，MsFormer采用了一种轻量级注意力机制，通过简单的池化操作替代自注意力。在真实数据集上的大量实验表明，所提出的框架相较于现有最先进方法实现了显著的性能提升。此外，MsFormer在不同工业设备和操作条件下均表现优异，在保持高度可靠的服务质量（Quality of Service, QoS）的同时，展现出强大的泛化能力。

摘要 (Abstract)

Providing reliable predictive maintenance is a critical industrial AI service essential for ensuring the high availability of manufacturing devices. Existing deep-learning methods present competitive results on such tasks but lack a general service-oriented framework to capture complex dependencies in industrial IoT sensor data. While Transformer-based models show strong sequence modeling capabilities, their direct deployment as robust AI services faces significant bottlenecks. Specifically, streaming sensor data collected in real-world service environments often exhibits multi-scale temporal correlations driven by machine working principles. Besides, the datasets available for training time-to-failure predictive services are typically limited in size. These issues pose significant challenges for directly applying existing models as robust predictive services. To address these challenges, we propose MsFormer, a lightweight Multi-scale Transformer designed as a unified AI service model for reliable industrial predictive maintenance. MsFormer incorporates a Multi-scale Sampling (MS) module and a tailored position encoding mechanism to capture sequential correlations across multi-streaming service data. Additionally, to accommodate data-scarce service environments, MsFormer adopts a lightweight attention mechanism with straightforward pooling operations instead of self-attention. Extensive experiments on real-world datasets demonstrate that the proposed framework achieves significant performance improvements over state-of-the-art methods. Furthermore, MsFormer outperforms across industrial devices and operating conditions, demonstrating strong generalizability while maintaining a highly reliable Quality of Service (QoS).

关键词: predictive maintenance, industrial AI, Transformer, multi-scale temporal correlations, lightweight attention, sensor data, time-to-failure prediction, QoS

256. ❌ Prompt Amplification and Zero-Shot Late Fusion in Audio-Language Models for Speech Emotion Recognition

作者: Saurabh Kataria, Xiao Hu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23057v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究音频-语言模型（ALMs）在语音情感识别（SER）中的应用，提出了一种名为ZS-Fuse的零样本后期融合方法，结合了ALMs和领域专家基础模型（FMs）。论文与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分），因为ALMs是基础模型的一种，论文多次提到Foundation Models（FMs）并探讨其应用。与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为SER属于AI在科学（特别是心理学和计算语言学交叉领域）的应用，但论文未明确涉及生物信息学或化学信息学。其他关键词（如MoE、SFT、RAG等）与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

论文研究了如何通过零样本后期融合方法（ZS-Fuse）结合音频-语言模型和专家基础模型来提升语音情感识别的性能，并提出了提示放大技术以增强零样本能力，在三个数据集上实现了超越现有基线的改进。

摘要翻译

音频-语言模型（Audio-Language Models, ALMs）在语音与非语音音频理解方面正取得显著进展。然而，对于封闭式语音处理任务（如语音情感识别，Speech Emotion Recognition, SER），领域专用的基础模型（Foundation Models, FMs）仍保持最佳性能。使用ALMs进行零样本语音情感识别（Zero-shot SER）是一种常见选择，但其与专用模型协同工作以实现最先进（state-of-the-art, SOTA）性能的潜力尚未得到充分探索。我们提出ZS-Fuse，一种后期融合方法，将来自双编码器ALM的零样本情感估计与专用FMs相结合。为处理情感模糊性及对提示选择的敏感性，我们采用两种策略：1）使用简单的提示集成；2）提出一种称为提示放大的新技术，通过重复音频与文本查询来发掘更强的零样本能力。我们通过将ZS-Fuse与三种双编码器ALM及两种FM进行评估，验证了该方法的有效性，并在三个语音情感识别数据集上报告了相对于SOTA基线模型（如WavLM-Large）的性能提升。

摘要 (Abstract)

Audio-Language Models (ALMs) are making strides in understanding speech and non-speech audio. However, domain-specialist Foundation Models (FMs) remain the best for closed-ended speech processing tasks such as Speech Emotion Recognition (SER). Using ALMs for Zero-shot SER is a popular choice, but their potential to work with specialists to achieve state-of-the-art (SOTA) performance remains unexplored. We propose ZS-Fuse, a late-fusion method that combines zero-shot emotion estimates from a dual-encoder ALM with specialist FMs. To handle ambiguity in emotions and sensitivity to prompt choice, 1) we use a simple prompt ensemble and 2) suggest a novel technique called prompt amplification, which repeats audio and text queries to discover stronger zero-shot capabilities. We demonstrate the efficacy of our technique by evaluating ZS-Fuse with three dual-encoder ALMs and two FMs, and report improvements over SOTA baselines, such as WavLM-Large, on three speech emotion recognition datasets.

关键词: Audio-Language Models, Speech Emotion Recognition, Zero-shot Learning, Late Fusion, Foundation Models, Prompt Amplification, Dual-encoder ALMs, State-of-the-art Performance

257. ❌ Robustness Quantification and Uncertainty Quantification: Comparing Two Methods for Assessing the Reliability of Classifier Predictions

作者: Adrián Detavernier, Jasper De Bock 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22988v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是分类器预测可靠性的评估方法（Robustness Quantification和Uncertainty Quantification），属于传统机器学习中的模型评估和可靠性分析领域。论文内容完全不涉及大语言模型、深度学习技术原理、大模型应用或任何AI for Science的具体技术，与所有评分关键词均无关联。

!!! tip deepseek-chat TL;DR

该论文比较了Robustness Quantification和Uncertainty Quantification两种评估分类器预测可靠性的方法，发现RQ方法在标准设置和分布偏移下都能优于UQ，且两者结合能进一步提升评估效果。

摘要翻译

我们探讨评估分类器个体预测可靠性的两种方法：鲁棒性量化与不确定性量化。本文阐释了两种方法在概念上的差异，并在多个基准数据集上对二者进行了比较，结果表明无论是在标准设定下还是在存在分布偏移的情况下，鲁棒性量化的表现均能优于不确定性量化。除了证明鲁棒性量化可与不确定性量化相竞争外，我们还通过展示两种方法的结合能带来更优的可靠性评估，从而论证了二者的互补性。

摘要 (Abstract)

We consider two approaches for assessing the reliability of the individual predictions of a classifier: Robustness Quantification (RQ) and Uncertainty Quantification (UQ). We explain the conceptual differences between the two approaches, compare both approaches on a number of benchmark datasets and show that RQ is capable of outperforming UQ, both in a standard setting and in the presence of distribution shift. Beside showing that RQ can be competitive with UQ, we also demonstrate the complementarity of RQ and UQ by showing that a combination of both approaches can lead to even better reliability assessments.

关键词: Robustness Quantification, Uncertainty Quantification, classifier predictions, reliability assessment, distribution shift, benchmark datasets, complementarity

258. ❌ Post-Selection Distributional Model Evaluation

作者: Amirmohammad Farzaneh, Osvaldo Simeone 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23055v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种通用的统计框架PS-DME，用于在数据依赖的模型预选后进行分布模型评估，控制后选择假覆盖率。虽然论文不是专门研究大模型技术本身，但在实验部分明确提到将方法应用于"text-to-SQL decoding with large language models”，因此与"Large Language Models"关键词有一定关联（5分）。论文主要关注统计评估方法，而非大模型技术原理、训练方法、推理优化、对齐、应用等具体方面，因此与其他所有关键词无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了PS-DME框架，解决了在数据依赖的模型预选后可靠估计测试时KPI分布的问题，通过基于e值的方法控制后选择假覆盖率，并在合成数据、大语言模型文本到SQL解码和电信网络评估中验证了其有效性。

摘要翻译

形式化模型评估方法通常旨在验证模型是否满足预设的目标关键绩效指标（KPI）水平。然而，在许多应用场景中，相关的目标KPI水平可能无法预先获知，用户可能更希望通过分析模型在测试时所能实现的性能与可靠性之间的完整权衡来比较候选模型。这一任务需要对测试时KPI分布进行可靠估计，但由于同一数据集常被同时用于预筛选候选模型子集和估计其KPI分布，可能导致潜在的选择后偏差，从而使得该任务更为复杂。本文提出选择后分布模型评估（PS-DME），这是一个在任意数据依赖的模型预筛选后，进行统计有效的分布模型评估的通用框架。基于e值构建的PS-DME能够控制分布KPI估计的选择后错误覆盖率（FCR），并被证明比基于样本分割的基线方法具有更高的样本效率。在合成数据、大型语言模型的文本到SQL解码以及电信网络性能评估上的实验表明，PS-DME能够在不同可靠性水平上实现对候选配置的可靠比较，从而支持对性能与可靠性权衡的统计可靠探索。

摘要 (Abstract)

Formal model evaluation methods typically certify that a model satisfies a prescribed target key performance indicator (KPI) level. However, in many applications, the relevant target KPI level may not be known a priori, and the user may instead wish to compare candidate models by analyzing the full trade-offs between performance and reliability achievable at test time by the models. This task, requiring the reliable estimate of the test-time KPI distributions, is made more complicated by the fact that the same data must often be used both to pre-select a subset of candidate models and to estimate their KPI distributions, causing a potential post-selection bias. In this work, we introduce post-selection distributional model evaluation (PS-DME), a general framework for statistically valid distributional model assessment after arbitrary data-dependent model pre-selection. Building on e-values, PS-DME controls post-selection false coverage rate (FCR) for the distributional KPI estimates and is proved to be more sample efficient than a baseline method based on sample splitting. Experiments on synthetic data, text-to-SQL decoding with large language models, and telecom network performance evaluation demonstrate that PS-DME enables reliable comparison of candidate configurations across a range of reliability levels, supporting the statistically reliable exploration of performance–reliability trade-offs.

关键词: post-selection distributional model evaluation, model evaluation, statistical framework, false coverage rate, e-values, KPI distributions, text-to-SQL decoding, large language models

259. ❌ A Critical Review on the Effectiveness and Privacy Threats of Membership Inference Attacks

作者: Najeeb Jebreel, David Sánchez, Josep Domingo-Ferrer 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22987v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于机器学习隐私安全领域的成员推理攻击（MIAs）评估，研究内容为通用机器学习模型的隐私威胁分析，不涉及大模型、深度学习技术原理创新或科学领域应用。所有关键词均与大模型技术、深度学习创新或AI科学应用相关，而本文讨论的是传统机器学习隐私问题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文批判性地评估了成员推理攻击在现实条件下的有效性，发现这些攻击通常构成较弱的隐私威胁，过度依赖它们作为隐私指标可能导致风险高估和不必要的模型效用牺牲。

摘要翻译

成员推断攻击旨在判定特定数据样本是否被包含在机器学习模型的训练集中，并已成为衡量机器学习隐私泄露事实上的标准。我们提出了一个评估框架，用以界定成员推断攻击构成真实隐私威胁的条件，并据此对代表性攻击方法进行系统性评估。研究发现，在我们框架所定义的实际条件下，成员推断攻击仅构成较弱的隐私威胁。因此，若将其作为机器学习领域的隐私度量标准，可能导致对风险的高估，并因采取过度防御措施而造成模型效用的不必要牺牲。

摘要 (Abstract)

Membership inference attacks (MIAs) aim to determine whether a data sample was included in a machine learning (ML) model’s training set and have become the de facto standard for measuring privacy leakages in ML. We propose an evaluation framework that defines the conditions under which MIAs constitute a genuine privacy threat, and review representative MIAs against it. We find that, under the realistic conditions defined in our framework, MIAs represent weak privacy threats. Thus, relying on them as a privacy metric in ML can lead to an overestimation of risk and to unnecessary sacrifices in model utility as a consequence of employing too strong defenses.

关键词: Membership Inference Attacks, Privacy Threats, Machine Learning, Evaluation Framework, Privacy Leakages, Model Utility, Risk Assessment

260. ❌ A PAC-Bayesian approach to generalization for quantum models

作者: Pablo Rodriguez-Grasa, Matthias C. Caro, Jens Eisert, Elies Gil-Fuster, Franz J. Schreiber, Carlos Bravo-Prieto 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22964v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究量子机器学习模型的泛化理论，使用PAC-Bayesian方法分析量子电路的泛化边界，属于量子计算与机器学习交叉领域。所有评分关键词均针对大语言模型（LLM）和深度学习技术，包括模型架构、训练方法、推理优化、对齐技术、应用场景等具体方向。论文完全不涉及任何大语言模型、深度学习技术或相关应用，与所有关键词无任何关联，因此所有关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对量子机器学习模型提出了首个PAC-Bayesian泛化边界，通过分析包含耗散操作的量子电路，建立了依赖于学习参数矩阵范数的非均匀泛化保证，为量子机器学习提供了更精细的理论分析工具。

摘要翻译

泛化性是机器学习理论的核心概念，然而对于量子模型，现有分析主要依赖于基于模型整体容量而非所学具体函数的均匀界。这类基于容量的均匀界通常过于宽松，且完全无法反映实际的训练与学习过程。先前的理论保证未能提供非均匀、数据依赖的边界，以反映所学解的具体特性而非整个假设类的最坏情况行为。为突破这一局限，我们通过分析由广义量子信道（包含中途测量与前馈等耗散操作）构成的分层电路，首次为一类广泛的量子模型推导出PAC-贝叶斯泛化界。通过信道扰动分析，我们建立了依赖于所学参数矩阵范数的非均匀边界；将结果推广至对称性约束的等变量子模型；并通过数值实验验证了理论框架。这项工作提供了可指导模型设计的见解，并为更细致地理解量子机器学习中的泛化性奠定了理论基础。

摘要 (Abstract)

Generalization is a central concept in machine learning theory, yet for quantum models, it is predominantly analyzed through uniform bounds that depend on a model’s overall capacity rather than the specific function learned. These capacity-based uniform bounds are often too loose and entirely insensitive to the actual training and learning process. Previous theoretical guarantees have failed to provide non-uniform, data-dependent bounds that reflect the specific properties of the learned solution rather than the worst-case behavior of the entire hypothesis class. To address this limitation, we derive the first PAC-Bayesian generalization bounds for a broad class of quantum models by analyzing layered circuits composed of general quantum channels, which include dissipative operations such as mid-circuit measurements and feedforward. Through a channel perturbation analysis, we establish non-uniform bounds that depend on the norms of learned parameter matrices; we extend these results to symmetry-constrained equivariant quantum models; and we validate our theoretical framework with numerical experiments. This work provides actionable model design insights and establishes a foundational tool for a more nuanced understanding of generalization in quantum machine learning.

关键词: quantum machine learning, PAC-Bayesian generalization, quantum models, non-uniform bounds, channel perturbation analysis, equivariant quantum models, layered quantum circuits, generalization bounds

261. ❌ Asymptotic Learning Curves for Diffusion Models with Random Features Score and Manifold Data

作者: Anand Jerry George, Nicolas Macris 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22962v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究扩散模型的理论学习行为，特别是使用随机特征神经网络参数化分数函数时，数据分布位于低维流形上的情况。所有评分关键词均与大语言模型（LLMs）及其相关技术（如微调、对齐、推理、代理等）或特定科学AI应用（如生物信息学）相关。论文专注于扩散模型的理论分析，未涉及任何大语言模型技术、应用或相关概念，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了当数据分布位于低维流形上时，使用随机特征神经网络参数化的扩散模型（去噪分数匹配）的理论学习行为，发现对于线性流形，学习分数函数所需的样本复杂度与流形的内在维度成线性关系，而非环境维度，但这种低维结构的好处对于非线性流形会减弱。

摘要翻译

我们研究了当数据分布支撑于低维流形且使用随机特征神经网络参数化分数函数时，去噪分数匹配（即与扩散模型相关的学习任务）的理论行为。在高维极限下，我们推导出了测试误差、训练误差及分数误差的渐近精确表达式。分析表明，对于线性流形，学习分数函数所需的样本复杂度与流形的本征维度呈线性关系，而非与环境维度相关。值得注意的是，一旦流形呈现非线性结构，低维数据带来的优势便开始减弱。这些结果表明扩散模型能够受益于结构化数据；然而，其对具体结构类型的依赖关系是微妙且复杂的。

摘要 (Abstract)

We study the theoretical behavior of denoising score matching–the learning task associated to diffusion models–when the data distribution is supported on a low-dimensional manifold and the score is parameterized using a random feature neural network. We derive asymptotically exact expressions for the test, train, and score errors in the high-dimensional limit. Our analysis reveals that, for linear manifolds the sample complexity required to learn the score function scales linearly with the intrinsic dimension of the manifold, rather than with the ambient dimension. Perhaps surprisingly, the benefits of low-dimensional structure starts to diminish once we have a non-linear manifold. These results indicate that diffusion models can benefit from structured data; however, the dependence on the specific type of structure is subtle and intricate.

关键词: diffusion models, denoising score matching, random feature neural network, low-dimensional manifold, sample complexity, intrinsic dimension, asymptotic analysis, theoretical learning curves

262. ❌ Stepwise Variational Inference with Vine Copulas

作者: Elisabeth Griesbauer, Leiv Rønneberg, Arnoldo Frigessi, Claudia Czado, Ingrid Hobæk Haff 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22959v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于变分推断（VI）和藤Copula的统计方法学创新，提出了一种逐步估计变分参数的通用VI程序。论文内容完全属于统计机器学习领域，涉及变分推断、Copula理论、Kullback-Leibler散度、Rényi散度等传统统计方法，与所有评分关键词（均围绕大模型、深度学习技术及其应用）无直接关联。论文未提及任何大模型、深度学习、AI应用或相关技术概念，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合藤Copula和逐步估计的通用变分推断方法，通过基于Rényi散度的证据下界和直观停止准则，在参数节俭的同时实现了从平均场变分推断到完全潜在依赖的插值，并在稀疏高斯过程等应用中优于平均场变分推断。

摘要翻译

我们提出基于藤Copula的逐步变分推断方法：这是一种将藤Copula结构与一种新颖的变分参数逐步估计流程相结合的通用变分推断程序。藤Copula由基于Copula构建的嵌套树序列组成，通过增加树的数量可以建模更复杂的潜在依赖关系。我们提出沿藤结构逐树逐步估计藤Copula近似后验分布。此外，我们证明传统的反向Kullback-Leibler散度无法恢复藤Copula模型中的正确参数，因此基于Rényi散度定义了证据下界。最后，通过引入直观的停止准则来决定是否向藤结构添加更多树，避免了像大多数其他方法那样需要预先定义变分分布的复杂度参数。因此，我们的方法在平均场变分推断与完全潜在依赖模型之间实现了动态插值。在众多应用场景（尤其是稀疏高斯过程中），本方法能以较少的参数实现超越平均场变分推断的性能表现。

摘要 (Abstract)

We propose stepwise variational inference (VI) with vine copulas: a universal VI procedure that combines vine copulas with a novel stepwise estimation procedure of the variational parameters. Vine copulas consist of a nested sequence of trees built from copulas, where more complex latent dependence can be modeled with increasing number of trees. We propose to estimate the vine copula approximate posterior in a stepwise fashion, tree by tree along the vine structure. Further, we show that the usual backward Kullback-Leibler divergence cannot recover the correct parameters in the vine copula model, thus the evidence lower bound is defined based on the Rényi divergence. Finally, an intuitive stopping criterion for adding further trees to the vine eliminates the need to pre-define a complexity parameter of the variational distribution, as required for most other approaches. Thus, our method interpolates between mean-field VI (MFVI) and full latent dependence. In many applications, in particular sparse Gaussian processes, our method is parsimonious with parameters, while outperforming MFVI.

关键词: variational inference, vine copulas, stepwise estimation, Rényi divergence, evidence lower bound, sparse Gaussian processes, mean-field VI, latent dependence

263. ❌ Privacy-Preserving EHR Data Transformation via Geometric Operators: A Human-AI Co-Design Technical Report

作者: Maolin Wang, Beining Bao, Gan Yuan, Hongyu Chen, Bingkun Zhao, Baoshuo Kan, Jiming Xu, Qi Shi, Yinggong Zhao, Yao Wang, Wei Ying Ma, Jun Yan 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22954v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究电子健康记录（EHR）的隐私保护数据转换框架，通过几何操作符和人类-AI协同设计实现数据脱敏。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文涉及医疗AI和生物医学数据（EHR），属于AI在科学（生物医学）领域的应用，但并非核心创新点（核心是隐私保护数据转换框架）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于几何操作符和人类-AI协同设计的隐私保护框架，用于转换电子健康记录数据，在保护患者隐私的同时保留医学语义和统计特性，并通过理论分析和实验验证了其对抗重建、记录链接、成员推理和属性推理攻击的有效性。

摘要翻译

电子健康档案（EHRs）及其他真实世界临床数据对于临床研究、医学人工智能与生命科学至关重要，但其共享受到隐私、治理与互操作性限制的严重制约。这些障碍形成了持续存在的数据孤岛，阻碍了多中心研究、大规模模型开发以及更广泛的生物医学发现。现有的隐私保护方法，包括多方计算及相关密码学技术，虽能提供强效保护，但常带来显著的计算负担，降低了大规模机器学习与基础模型训练的效能。此外，许多此类方法仅使数据适用于受限计算，却令临床医生和研究人员无法直接查看数据，从而限制了其在仍需直接检查、探索性分析和人工解读的工作流程中的价值。我们提出一种用于结构化临床记录隐私保护共享的真实世界数据转换框架。与将数据转换为不透明表示的传统方式不同，本方法构建了保留医学语义与主要统计特性的数值化转换视图，并在明确定义的威胁模型下，可证明地切断了这些视图与受保护的患者级属性之间的直接关联。通过计算机科学家与AI智能体SciencePal（在人类指导下作为受限工具发明者）的合作，我们设计了三种在该威胁模型下不可逆的转换算子，以及一种针对高风险场景的附加混合策略，并通过重构攻击、记录链接攻击、成员推理攻击和属性推理攻击下的理论分析与实证评估予以验证。

摘要 (Abstract)

Electronic health records (EHRs) and other real-world clinical data are essential for clinical research, medical artificial intelligence, and life science, but their sharing is severely limited by privacy, governance, and interoperability constraints. These barriers create persistent data silos that hinder multi-center studies, large-scale model development, and broader biomedical discovery. Existing privacy-preserving approaches, including multi-party computation and related cryptographic techniques, provide strong protection but often introduce substantial computational overhead, reducing the efficiency of large-scale machine learning and foundation-model training. In addition, many such methods make data usable for restricted computation while leaving them effectively invisible to clinicians and researchers, limiting their value in workflows that still require direct inspection, exploratory analysis, and human interpretation. We propose a real-world-data transformation framework for privacy-preserving sharing of structured clinical records. Instead of converting data into opaque representations, our approach constructs transformed numeric views that preserve medical semantics and major statistical properties while, under a clearly specified threat model, provably breaking direct linkage between those views and protected patient-level attributes. Through collaboration between computer scientists and the AI agent \textbf{SciencePal}, acting as a constrained tool inventor under human guidance, we design three transformation operators that are non-reversible within this threat model, together with an additional mixing strategy for high-risk scenarios, supported by theoretical analysis and empirical evaluation under reconstruction, record linkage, membership inference, and attribute inference attacks.

关键词: privacy-preserving, EHR data transformation, geometric operators, human-AI co-design, clinical records, data sharing, threat model, medical semantics

264. ❌ Weak-PDE-Net: Discovering Open-Form PDEs via Differentiable Symbolic Networks and Weak Formulation

作者: Xinxin Li, Xingyu Cui, Jin Qi, Juan Zhang, Da Li, Junping Yin 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22951v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《Weak-PDE-Net: Discovering Open-Form PDEs via Differentiable Symbolic Networks and Weak Formulation》专注于使用可微符号网络和弱形式从稀疏噪声数据中发现偏微分方程（PDEs），属于科学计算和AI for Science领域。它不涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、训练方法、推理优化、代理系统等），因此与绝大多数关键词完全无关。唯一相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为该论文应用AI（具体是神经网络和可微架构搜索）解决科学发现（PDE发现）问题，属于AI for Science范畴，但并非核心聚焦于大模型或深度学习技术原理创新，故给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了Weak-PDE-Net，一个端到端的可微框架，通过结合可微符号网络和弱形式，从稀疏和噪声数据中鲁棒地发现开形式偏微分方程，并在多个PDE基准测试中验证了其有效性。

摘要翻译

从稀疏噪声数据中发现控制性偏微分方程是数据驱动科学计算中的一个挑战性问题。传统稀疏回归方法通常存在两大局限：(i) 稀疏噪声数据下数值微分的不稳定性，以及(ii) 预定义候选函数库的有限灵活性。本文提出Weak-PDE-Net，一种端到端可微分框架，能够鲁棒地识别开形式偏微分方程。Weak-PDE-Net由两个相互关联的模块构成：前向响应学习器与弱形式偏微分方程生成器。学习器将可学习高斯核嵌入轻量级多层感知机中，作为代理模型自适应地从稀疏观测中捕捉系统动力学特性。与此同时，生成器将符号网络与积分模块相结合以构建弱形式偏微分方程，避免了显式数值微分并提升了对噪声的鲁棒性。为放宽预定义函数库的限制，我们在训练中采用可微分神经架构搜索策略探索函数空间，从而实现对开形式偏微分方程的高效发现。通过引入伽利略不变性约束与对称等变性假设以确保物理一致性，本方法在多变量系统发现方面的能力得到进一步增强。在多个具有挑战性的偏微分方程基准测试上的实验表明，即使在高稀疏与强噪声观测条件下，Weak-PDE-Net仍能准确还原控制方程。

摘要 (Abstract)

Discovering governing Partial Differential Equations (PDEs) from sparse and noisy data is a challenging issue in data-driven scientific computing. Conventional sparse regression methods often suffer from two major limitations: (i) the instability of numerical differentiation under sparse and noisy data, and (ii) the restricted flexibility of a pre-defined candidate library. We propose Weak-PDE-Net, an end-to-end differentiable framework that can robustly identify open-form PDEs. Weak-PDE-Net consists of two interconnected modules: a forward response learner and a weak-form PDE generator. The learner embeds learnable Gaussian kernels within a lightweight MLP, serving as a surrogate model that adaptively captures system dynamics from sparse observations. Meanwhile, the generator integrates a symbolic network with an integral module to construct weak-form PDEs, avoiding explicit numerical differentiation and improving robustness to noise. To relax the constraints of the pre-defined library, we leverage Differentiable Neural Architecture Search strategy during training to explore the functional space, which enables the efficient discovery of open-form PDEs. The capability of Weak-PDE-Net in multivariable systems discovery is further enhanced by incorporating Galilean Invariance constraints and symmetry equivariance hypotheses to ensure physical consistency. Experiments on several challenging PDE benchmarks demonstrate that Weak-PDE-Net accurately recovers governing equations, even under highly sparse and noisy observations.

关键词: PDE discovery, weak formulation, differentiable symbolic networks, sparse noisy data, neural architecture search, open-form PDEs, scientific computing, AI for science

265. ❌ VLGOR: Visual-Language Knowledge Guided Offline Reinforcement Learning for Generalizable Agents

作者: Pengsen Liu, Maosen Zeng, Nan Tang, Kaiyuan Li, Jing-Cheng Pang, Yunan Liu, Yang Yu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22892v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确提到使用LLMs与强化学习结合，并涉及fine-tuning（即SFT），以及构建能够执行任务的智能体（Agents），因此这三个关键词高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、RAG、CoT等均未在摘要中提及或与论文核心内容无关，故得0分。

!!! tip deepseek-chat TL;DR

该论文提出了VLGOR框架，通过结合视觉和语言知识生成模拟交互数据，以解决LLMs在物理环境感知和任务泛化方面的不足，实验表明其在未见任务上的成功率比基线方法提高了24%以上。

摘要翻译

将大型语言模型（LLMs）与强化学习（RL）相结合，能够使智能体更有效地解析语言指令以执行任务。然而，大型语言模型通常缺乏对物理环境的直接感知，这限制了其对环境动态的理解以及向未见任务泛化的能力。为应对这一局限，我们提出了视觉-语言知识引导的离线强化学习（Visual-Language Knowledge-Guided Offline Reinforcement Learning, VLGOR）框架，该框架整合视觉与语言知识以生成模拟交互轨迹，从而丰富交互数据。VLGOR的核心前提是微调一个视觉-语言模型，使其能够基于初始视觉观察和高层指令预测未来状态与动作，确保生成的轨迹在时间上连贯且空间上合理。此外，我们采用反事实提示来为离线强化学习训练生成更多样化的轨迹，使智能体能够获取基于视觉线索的环境知识，从而更好地遵循语言指令。在机器人操作基准测试上的实验表明，VLGOR在需要新最优策略的未见任务上显著提升了性能，其成功率较基线方法高出24%以上。

摘要 (Abstract)

Combining Large Language Models (LLMs) with Reinforcement Learning (RL) enables agents to interpret language instructions more effectively for task execution. However, LLMs typically lack direct perception of the physical environment, which limits their understanding of environmental dynamics and their ability to generalize to unseen tasks. To address this limitation, we propose Visual-Language Knowledge-Guided Offline Reinforcement Learning (VLGOR), a framework that integrates visual and language knowledge to generate imaginary rollouts, thereby enriching the interaction data. The core premise of VLGOR is to fine-tune a vision-language model to predict future states and actions conditioned on an initial visual observation and high-level instructions, ensuring that the generated rollouts remain temporally coherent and spatially plausible. Furthermore, we employ counterfactual prompts to produce more diverse rollouts for offline RL training, enabling the agent to acquire knowledge that facilitates following language instructions while grounding in environments based on visual cues. Experiments on robotic manipulation benchmarks demonstrate that VLGOR significantly improves performance on unseen tasks requiring novel optimal policies, achieving a success rate over 24% higher than the baseline methods.

关键词: Large Language Models, Reinforcement Learning, Visual-Language Model, Offline RL, Generalizable Agents, Fine-tuning, Robotic Manipulation, Task Generalization

266. ❌ Conditionally Identifiable Latent Representation for Multivariate Time Series with Structural Dynamics

作者: Minkey Chang, Jae-Young Kim 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22886v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是多元时间序列的潜在因子模型（iVDFM），专注于可识别性保证、线性对角动力学和概率预测，属于传统的统计机器学习/时间序列分析领域。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文未涉及任何大模型、深度学习或AI for Science内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了可识别变分动态因子模型（iVDFM），解决了多元时间序列中潜在因子学习的可识别性问题，并在合成数据和真实世界基准上展示了改进的因子恢复和竞争性的概率预测性能。

摘要翻译

我们提出可识别变分动态因子模型（Identifiable Variational Dynamic Factor Model, iVDFM），该模型能够从多元时间序列中学习具有可识别性保证的潜在因子。通过将iVAE风格的调节机制应用于驱动动态性的创新过程而非潜在状态本身，我们证明了因子在置换及分量仿射（或单调可逆）变换的意义下是可识别的。线性对角动态性保持了这种可识别性，并可通过伴随矩阵与Krylov方法实现可扩展计算。我们在合成数据上展示了改进的因子恢复效果，在合成结构因果模型上实现了稳定的干预准确性，并在真实世界基准数据集上取得了具有竞争力的概率性预测性能。

摘要 (Abstract)

We propose the Identifiable Variational Dynamic Factor Model (iVDFM), which learns latent factors from multivariate time series with identifiability guarantees. By applying iVAE-style conditioning to the innovation process driving the dynamics rather than to the latent states, we show that factors are identifiable up to permutation and component-wise affine (or monotone invertible) transformations. Linear diagonal dynamics preserve this identifiability and admit scalable computation via companion-matrix and Krylov methods. We demonstrate improved factor recovery on synthetic data, stable intervention accuracy on synthetic SCMs, and competitive probabilistic forecasting on real-world benchmarks.

关键词: identifiable latent representation, multivariate time series, structural dynamics, variational dynamic factor model, factor recovery, probabilistic forecasting, linear diagonal dynamics, intervention accuracy

267. ❌ Balancing Safety and Efficiency in Aircraft Health Diagnosis: A Task Decomposition Framework with Heterogeneous Long-Micro Scale Cascading and Knowledge Distillation-based Interpretability

作者: Xinhang Chen, Zhihuan Wei, Yang Hu, Zhiguo Zeng, Kang Zeng, Suili Yang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22885v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于航空健康诊断的特定领域，提出了一种任务分解框架（DDF），结合了长-微尺度级联和基于知识蒸馏的可解释性。它主要涉及计算机视觉（卷积、注意力）、任务分解、知识蒸馏和可解释AI，但没有明确涉及大语言模型（LLM）、深度学习技术原理创新或任何评分关键词中的具体技术（如MoE、SFT、RAG等）。唯一的相关性是：1）“Mechanistic Interpretability OR Explainable AI”（5分），因为论文通过知识蒸馏提供可解释性；2）“AI for Science OR Bioinformatics OR Cheminformatics”（5分），因为航空健康诊断可视为AI在科学/工程领域的应用。其他关键词均不相关，得0分。

!!! tip deepseek-chat TL;DR

该研究解决了通用航空全机诊断中的数据不确定性、任务异质性和计算效率低下的问题，通过提出诊断分解框架（DDF）和长-微尺度诊断器（LMSD），实现了异常检测与故障分类的解耦，在NGAFID数据集上提高了多类加权惩罚指标（MCWPM）4-8%并显著减少了训练时间。

摘要翻译

通用航空整机诊断面临三重挑战：数据不确定性、任务异质性与计算低效性。现有端到端方法统一建模健康判别与故障表征，忽视了全局上下文建模与局部特征提取之间的固有感受野冲突，且在严重类别不平衡下产生高昂训练成本。为此，本研究提出诊断解耦框架（DDF），通过长微尺度诊断器（LMSD）将诊断显式分解为异常检测（AD）与故障分类（FC）两个子任务。采用“长程全局筛查与微尺度局部精诊”策略，LMSD利用多头自注意力卷积令牌化器（ConvTokMHSA）进行全局运行模式判别，并借助多微核网络（MMK Net）实现局部故障特征提取。解耦式训练分离了“大样本轻量化”与“小样本复杂化”的优化路径，显著降低计算开销。同时，基于知识蒸馏的关键性提取层（KEL）为两阶段决策提供物理可追溯的解释，实现设计即解释的透明性。在NGAFID真实航空数据集上的实验表明，该方法在多类别加权惩罚指标（MCWPM）上较基线模型提升约4-8%，且训练时间大幅缩短，验证了其在任务适应性、可解释性与计算效率上的综合优势，为通用航空健康管理提供了可部署的方法论。

摘要 (Abstract)

Whole-aircraft diagnosis for general aviation faces threefold challenges: data uncertainty, task heterogeneity, and computational inefficiency. Existing end-to-end approaches uniformly model health discrimination and fault characterization, overlooking intrinsic receptive field conflicts between global context modeling and local feature extraction, while incurring prohibitive training costs under severe class imbalance. To address these, this study proposes the Diagnosis Decomposition Framework (DDF), explicitly decoupling diagnosis into Anomaly Detection (AD) and Fault Classification (FC) subtasks via the Long-Micro Scale Diagnostician (LMSD). Employing a “long-range global screening and micro-scale local precise diagnosis” strategy, LMSD utilizes Convolutional Tokenizer with Multi-Head Self-Attention (ConvTokMHSA) for global operational pattern discrimination and Multi-Micro Kernel Network (MMK Net) for local fault feature extraction. Decoupled training separates “large-sample lightweight” and “small-sample complex” optimization pathways, significantly reducing computational overhead. Concurrently, Keyness Extraction Layer (KEL) via knowledge distillation furnishes physically traceable explanations for two-stage decisions, materializing interpretability-by-design. Experiments on the NGAFID real-world aviation dataset demonstrate approximately 4-8% improvement in Multi-Class Weighted Penalty Metric (MCWPM) over baselines with substantially reduced training time, validating comprehensive advantages in task adaptability, interpretability, and efficiency. This provides a deployable methodology for general aviation health management.

关键词: Aircraft Health Diagnosis, Task Decomposition Framework, Long-Micro Scale Cascading, Knowledge Distillation, Interpretability, Anomaly Detection, Fault Classification, Computational Efficiency

268. ❌ Towards The Implicit Bias on Multiclass Separable Data Under Norm Constraints

作者: Shengping Xie, Zekun Wu, Quan Chen, Kaixu Tang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22824v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是优化算法（Normalized Steepest Descent, NucGD）在可分离多类数据上的隐式偏差，属于深度学习优化理论的基础研究。论文内容聚焦于梯度下降算法的几何性质、核范数约束、低秩结构、随机优化动态等理论分析，并未涉及大模型（LLMs）、模型训练技术（预训练、微调、对齐等）、推理优化、AI代理、科学AI应用等关键词所代表的具体技术或应用领域。所有关键词均与大模型技术、应用或相关子领域直接相关，而本文是纯优化理论工作，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了在核范数约束下梯度下降算法对多类可分离数据的隐式偏差，提出了NucGD优化器来强制低秩结构，并通过理论分析和实验揭示了随机优化动态如何影响最大边际解的收敛。

摘要翻译

基于梯度的算法所引发的隐式偏差对于过参数化模型的泛化至关重要，但其机制可能十分微妙。本研究利用归一化最陡下降框架，探究优化几何如何影响多类可分数据上的解。我们提出了一种几何感知优化器NucGD，该优化器通过核范数约束来强制低秩结构。除了算法本身，我们将NucGD与新兴的低秩投影方法联系起来，提供了一个统一的视角。为实现可扩展训练，我们通过异步幂迭代推导出一种高效的无SVD更新规则。此外，我们通过实验剖析了随机优化动态的影响，描述了由小批量采样和动量引起的不同程度梯度噪声如何调节向预期最大间隔解的收敛过程。我们的代码可在以下网址获取：https://github.com/Tsokarsic/observing-the-implicit-bias-on-multiclass-seperable-data。

摘要 (Abstract)

Implicit bias induced by gradient-based algorithms is essential to the generalization of overparameterized models, yet its mechanisms can be subtle. This work leverages the Normalized Steepest Descent} (NSD) framework to investigate how optimization geometry shapes solutions on multiclass separable data. We introduce NucGD, a geometry-aware optimizer designed to enforce low rank structures through nuclear norm constraints. Beyond the algorithm itself, we connect NucGD with emerging low-rank projection methods, providing a unified perspective. To enable scalable training, we derive an efficient SVD-free update rule via asynchronous power iteration. Furthermore, we empirically dissect the impact of stochastic optimization dynamics, characterizing how varying levels of gradient noise induced by mini-batch sampling and momentum modulate the convergence toward the expected maximum margin solutions.Our code is accessible at: https://github.com/Tsokarsic/observing-the-implicit-bias-on-multiclass-seperable-data.

关键词: implicit bias, gradient descent, multiclass separable data, nuclear norm constraints, low-rank structures, stochastic optimization, maximum margin solutions, NucGD optimizer

269. ❌ Universal and efficient graph neural networks with dynamic attention for machine learning interatomic potentials

作者: Shuyu Bi, Zhede Zhao, Qiangchao Sun, Tao Hu, Xionggang Lu, Hongwei Cheng 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22810v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于开发一种用于机器学习原子间势能（MLIPs）的高效图神经网络框架（MLANet），属于AI for Science在计算化学/材料科学领域的应用。论文核心是图神经网络、动态注意力机制和计算效率优化，与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、智能体等）完全无关。仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为其应用场景（分子动力学模拟、材料科学）属于AI for Science范畴，但论文本身不涉及生物信息学或化学信息学的具体方法，故给5分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为MLANet的高效图神经网络框架，通过双路径动态注意力机制和多视角池化策略，解决了机器学习原子间势能在精度和计算效率方面的挑战，实现了高精度、低成本的原子尺度模拟。

摘要翻译

分子动力学模拟的核心本质在于原子间势函数。传统经验势函数精度不足，而第一性原理方法计算成本过高。机器学习原子间势（MLIPs）有望以线性计算成本实现接近量子力学的精度，但现有模型在效率与稳定性方面仍面临挑战。本文提出机器学习先进神经网络（MLANet），一种高效且鲁棒的图神经网络框架。MLANet引入了双路径动态注意力机制以实现几何感知的消息传递，并采用多视角池化策略来构建全面的系统表征。该设计能够高精度建模原子环境，同时实现卓越的计算效率，使得高保真模拟更为可行。在涵盖各类体系的广泛数据集上进行了测试，包括有机分子（如QM7、MD17）、周期性无机材料（如含锂晶体）、二维材料（如双层石墨烯、黑磷）、表面催化反应（如甲酸盐分解）以及带电体系，MLANet在保持有竞争力的预测精度的同时，其计算成本显著低于主流等变模型，并能实现稳定的长时间分子动力学模拟。MLANet为大规模、高精度的原子模拟提供了一个高效且实用的工具。

摘要 (Abstract)

The core of molecular dynamics simulation fundamentally lies in the interatomic potential. Traditional empirical potentials lack accuracy, while first-principles methods are computationally prohibitive. Machine learning interatomic potentials (MLIPs) promise near-quantum accuracy at linear cost, but existing models still face challenges in efficiency and stability. We presents Machine Learning Advances Neural Network (MLANet), an efficient and robust graph neural network framework. MLANet introduces a dual-path dynamic attention mechanism for geometry-aware message passing and a multi-perspective pooling strategy to construct comprehensive system representations. This design enables highly accurate modeling of atomic environments while achieving exceptional computational efficiency, making high-fidelity simulations more accessible. Tested across a wide range of datasets spanning diverse systems, including organic molecules (e.g., QM7, MD17), periodic inorganic materials (e.g., Li-containing crystals), two-dimensional materials (e.g., bilayer graphene, black phosphorus), surface catalytic reactions (e.g., formate decomposition), and charged systems, MLANet maintains competitive prediction accuracy while its computational cost is markedly lower than mainstream equivariant models, and it enables stable long-time molecular dynamics simulations. MLANet provides an efficient and practical tool for large-scale, high-accuracy atomic simulations.

关键词: graph neural networks, dynamic attention, machine learning interatomic potentials, molecular dynamics simulation, computational efficiency, atomic environments, message passing, high-accuracy simulation

270. ❌ Combinatorial Privacy: Private Multi-Party Bitstream Grand Sum by Hiding in Birkhoff Polytopes

作者: Praneeth Vepakomma 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22808v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是多方安全计算中的隐私保护协议（PolyVeil），专注于布尔求和的隐私保护，使用Birkhoff多面体和置换矩阵进行编码，并分析差分隐私保证。论文主题属于密码学、隐私计算和分布式系统领域，与所有提供的大模型和深度学习技术关键词（如LLM、MoE、训练方法、推理优化、AI应用等）完全无关，没有任何涉及大模型技术原理或科学领域AI应用的内容。

!!! tip deepseek-chat TL;DR

该论文提出了PolyVeil协议，用于多方布尔求和的隐私保护计算，通过将私有比特编码为Birkhoff多面体中的置换矩阵，实现了服务器端的完美模拟安全性，并分析了差分隐私保证，揭示了计算复杂性与隐私保证之间的基本张力。

摘要翻译

我们提出PolyVeil协议，一种用于$k$个客户端之间私有布尔求和的方案，该协议将私有比特编码为伯克霍夫多胞体中的置换矩阵。双层架构使服务器获得完美的基于模拟的安全性（统计距离为零），而独立的聚合器则需通过积和式与混合判别式面对#P难似然推断问题。两种变体（完整版与压缩版）的区别在于聚合器观测到的数据形式。
我们建立了具有显式常数的有限样本$(\varepsilon,\delta)$-差分隐私分析。在完整版变体中，聚合器观察到每个客户端对应的双随机矩阵，其对数的利普希茨常数以$n^4 K_t$增长，信噪比分析表明差分隐私保证仅在私有信号不可检测时才具有非平凡意义。在压缩版变体中，聚合器仅观测单个标量，单变量密度比在中等信噪比下可产生非平凡的$\varepsilon$值，其中最优诱饵数量需在中心极限定理精度与噪声集中度之间取得平衡。
这揭示了一个根本性矛盾：#P难度要求完整的矩阵视角（伯克霍夫结构可见），而非平凡的差分隐私则需要标量视角（低维度）。二者能否在同一变体中同时成立仍是一个开放问题。该协议无需公钥基础设施，通信复杂度为$O(k)$，并可输出精确的聚合结果。

摘要 (Abstract)

We introduce PolyVeil, a protocol for private Boolean summation across $k$ clients that encodes private bits as permutation matrices in the Birkhoff polytope. A two-layer architecture gives the server perfect simulation-based security (statistical distance zero) while a separate aggregator faces #P-hard likelihood inference via the permanent and mixed discriminant. Two variants (full and compressed) differ in what the aggregator observes. We develop a finite-sample $(\varepsilon,δ)$-DP analysis with explicit constants. In the full variant, where the aggregator sees a doubly stochastic matrix per client, the log-Lipschitz constant grows as $n^4 K_t$ and a signal-to-noise analysis shows the DP guarantee is non-vacuous only when the private signal is undetectable. In the compressed variant, where the aggregator sees a single scalar, the univariate density ratio yields non-vacuous $\varepsilon$ at moderate SNR, with the optimal decoy count balancing CLT accuracy against noise concentration. This exposes a fundamental tension. #P-hardness requires the full matrix view (Birkhoff structure visible), while non-vacuous DP requires the scalar view (low dimensionality). Whether both hold simultaneously in one variant remains open. The protocol needs no PKI, has $O(k)$ communication, and outputs exact aggregates.

关键词: private Boolean summation, Birkhoff polytope, permutation matrices, differential privacy, secure multi-party computation, PolyVeil protocol, simulation-based security, #P-hard inference

271. ❌ Transformers Trained via Gradient Descent Can Provably Learn a Class of Teacher Models

作者: Chenyang Zhang, Qingyue Zhao, Quanquan Gu, Yuan Cao 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22801v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是Transformer模型的理论学习能力，特别是作为学生模型从教师模型（包括卷积层、图卷积层、稀疏标记选择模型和组稀疏线性预测器）中学习的理论保证。论文的核心是理论分析，证明简化的一层Transformer可以恢复教师模型的参数并达到最优总体损失，同时具有良好的分布外泛化能力。所有给定的关键词都聚焦于大模型（LLM）的应用、训练技术、推理优化、对齐、代理系统、科学应用等具体技术或应用领域，而本论文是纯理论分析，研究的是Transformer作为通用函数逼近器的理论性质，不涉及任何具体的大模型技术、应用或工程实现。因此，论文与所有关键词均无直接关联，所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文从理论上证明了简化的一层Transformer模型能够作为学生模型，成功学习一类教师模型（包括卷积层、图卷积层和经典统计学习模型）的参数，达到最优总体损失，并在温和假设下对广泛的分布外数据具有良好的泛化能力。

摘要翻译

Transformer模型已在众多应用领域取得巨大成功，但其成功的理论基础在很大程度上仍未得到充分探索。为揭示Transformer在多样化场景与任务中表现出的强大能力，我们通过理论分析研究将其作为学生模型，从一类教师模型中学习。具体而言，我们分析涵盖的教师模型包括带平均池化的卷积层、图卷积层，以及多种经典统计学习模型，包括稀疏令牌选择模型的变体[Sanford et al., 2023, Wang et al., 2024]和群稀疏线性预测器[Zhang et al., 2025]。在从这类教师模型学习时，我们证明具有简化“仅位置”注意力机制的单层Transformer能够成功恢复教师模型的所有参数块，从而达到最优总体损失。基于训练后的Transformer对教师模型的高效模仿能力，我们进一步证明，在温和假设下，它们能够对广泛分布外数据实现良好泛化。我们分析的关键在于识别出多种学习任务共享的基本双线性结构，这使得在将这些任务视为Transformer的教师模型时，能够为其建立统一的学习保证。

摘要 (Abstract)

Transformers have achieved great success across a wide range of applications, yet the theoretical foundations underlying their success remain largely unexplored. To demystify the strong capacities of transformers applied to versatile scenarios and tasks, we theoretically investigate utilizing transformers as students to learn from a class of teacher models. Specifically, the teacher models covered in our analysis include convolution layers with average pooling, graph convolution layers, and various classic statistical learning models, including a variant of sparse token selection models [Sanford et al., 2023, Wang et al., 2024] and group-sparse linear predictors [Zhang et al., 2025]. When learning from this class of teacher models, we prove that one-layer transformers with simplified “position-only’’ attention can successfully recover all parameter blocks of the teacher models, thus achieving the optimal population loss. Building upon the efficient mimicry of trained transformers towards teacher models, we further demonstrate that they can generalize well to a broad class of out-of-distribution data under mild assumptions. The key in our analysis is to identify a fundamental bilinear structure shared by various learning tasks, which enables us to establish unified learning guarantees for these tasks when treating them as teachers for transformers.

关键词: Transformers, theoretical analysis, teacher models, gradient descent, parameter recovery, population loss, out-of-distribution generalization, bilinear structure

272. ❌ Exposure-Normalized Bed and Chair Fall Rates via Continuous AI Monitoring

作者: Paolo Gabriel, Peter Rehani, Zack Drumm, Tyler Troy, Tiffany Wyatt, Narinder Singh 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22785v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文是一项医疗健康领域的观察性研究，使用连续AI监测来估计跌倒率，核心是应用AI技术解决具体的临床问题（跌倒风险评估）。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关，因为这些关键词主要针对大语言模型和深度学习技术本身的研究。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在科学（具体是生物医学/健康科学）领域的应用，但论文并未深入探讨AI技术本身，而是将其作为工具，因此相关性较弱，给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

本研究通过连续AI监测评估了医院环境中基于暴露时间的跌倒率，发现椅子跌倒率（17.8/1000暴露小时）高于病床（4.3/1000暴露小时），且多数椅子跌倒与脚踏板定位失败有关，建议测试更安全的椅子设置而非减少使用。

摘要翻译

本回顾性队列研究采用持续人工智能监测，通过暴露时间而非占用床日来评估跌倒发生率。2024年8月至2025年12月期间，3,980个符合条件的监测单元累计产生292,914条小时级数据，经概率加权计算得出：每1,000小时座椅暴露时间的跌倒发生率为17.8次，每1,000小时床铺暴露时间的发生率为4.3次。在研究窗口期内，经裁定确认的43例跌倒事件与监测流程匹配，其中40例关联到符合主要泊松模型条件的暴露时长数据，计算得出调整后的座椅与床铺发生率比值为2.35（95%置信区间0.87-6.33；p=0.0907）。在另一独立扩大观察队列（经去重后n=32例事件）中，7例直接发生于座椅的跌倒事件中有6例涉及脚踏板定位失效。由于本研究为单一医疗体系内的观察性研究，这些发现仍属于假设生成性质，其意义在于支持测试更安全的座椅配置方案，而非减少座椅使用。

摘要 (Abstract)

This retrospective cohort study used continuous AI monitoring to estimate fall rates by exposure time rather than occupied bed-days. From August 2024 to December 2025, 3,980 eligible monitoring units contributed 292,914 hourly rows, yielding probability-weighted rates of 17.8 falls per 1,000 chair exposure-hours and 4.3 per 1,000 bed exposure-hours. Within the study window, 43 adjudicated falls matched the monitoring pipeline, and 40 linked to eligible exposure hours for the primary Poisson model, producing an adjusted chair-versus-bed rate ratio of 2.35 (95% confidence interval 0.87 to 6.33; p=0.0907). In a separate broader observation cohort (n=32 deduplicated events), 6 of 7 direct chair falls involved footrest-positioning failures. Because this was an observational study in a single health system, these findings remain hypothesis-generating and support testing safer chair setups rather than using chairs less.

关键词: continuous AI monitoring, fall rates, exposure time, chair falls, bed falls, retrospective cohort study, healthcare, patient safety

273. ❌ Explainable Threat Attribution for IoT Networks Using Conditional SHAP and Flow Behavior Modelling

作者: Samuel Ozechi, Jennifer Okonkwoabutu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22771v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于物联网网络安全中的威胁归因和可解释性，使用梯度提升模型和SHAP方法进行特征解释。论文内容与大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理等）完全无关。唯一相关的关键词是’Mechanistic Interpretability OR Explainable AI’，因为论文明确使用SHAP进行模型解释，属于可解释AI范畴，但并非核心创新点，因此给予8分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于梯度提升模型和SHAP的可解释威胁归因方法，用于物联网网络攻击分类，能够识别不同攻击的行为特征并提高入侵检测系统的可解释性。

摘要翻译

随着物联网（IoT）在关键基础设施、智能环境和消费设备中的持续扩展，保护其免受网络威胁变得日益重要。传统的入侵检测模型通常将物联网威胁视为二元分类问题，或依赖于不透明的模型，从而限制了可信度。本研究利用CICIoT2023数据集，探讨物联网环境中的多类别威胁归因问题，将超过30种攻击变体归类为8个具有语义意义的类别。我们结合梯度提升模型与SHAP（SHapley Additive exPlanations）方法，提供全局及特定类别的解释，从而深入洞察驱动每类攻击分类的特征。结果表明，该模型通过流时序、数据包大小均匀性、TCP标志动态和统计方差，能够区分不同攻击的行为特征。进一步分析揭示了各类别的特征归因和决策轨迹，验证了这些观察到的模式。我们的研究结果有助于开发更准确、可解释的入侵检测系统，弥合了高性能机器学习与物联网环境中AI驱动网络安全对信任和可问责性需求之间的差距。

摘要 (Abstract)

As the Internet of Things (IoT) continues to expand across critical infrastructure, smart environments, and consumer devices, securing them against cyber threats has become increasingly vital. Traditional intrusion detection models often treat IoT threats as binary classification problems or rely on opaque models, thereby limiting trust. This work studies multiclass threat attribution in IoT environments using the CICIoT2023 dataset, grouping over 30 attack variants into 8 semantically meaningful classes. We utilize a combination of a gradient boosting model and SHAP (SHapley Additive exPlanations) to deliver both global and class-specific explanations, enabling detailed insight into the features driving each attack classification. The results show that the model distinguishes distinct behavioral signatures of the attacks using flow timing, packet size uniformity, TCP flag dynamics, and statistical variance. Additional analysis that exposes both feature attribution and the decision trajectory per class further validates these observed patterns. Our findings contribute to the development of more accurate and explainable intrusion detection systems, bridging the gap between high-performance machine learning and the need for trust and accountability in AI-driven cybersecurity for IoT environments.

关键词: IoT security, threat attribution, explainable AI, SHAP, gradient boosting, intrusion detection, CICIoT2023, cybersecurity

274. ❌ Caterpillar of Thoughts: The Optimal Test-Time Algorithm for Large Language Models

作者: Amir Azarmehr, Soheil Behnezhad, Alma Ghafari 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22784v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLMs的测试时计算算法优化，核心贡献是提出Caterpillar of Thoughts (CaT)算法，与Chain of Thought (CoT)和Tree of Thoughts (ToT)直接相关，因此’Large Language Models’和’Chain of Thought’得10分；‘System 2 Thinking’得8分，因为论文涉及深度推理和回溯；‘Self-Correction’得5分，因为算法允许回溯和修订部分解决方案；其他关键词与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在测试时如何最优地利用固定计算预算进行推理，提出了Caterpillar of Thoughts (CaT)算法，证明其比Tree of Thoughts (ToT)在提高成功率的同时减少了令牌生成数量。

摘要翻译

大型语言模型（LLMs）在获得额外测试时计算资源（例如采样、思维链、回溯或部分解修订）的情况下，通常能产生显著更优的输出。尽管此类技术的实证成果日益丰富，但关于推理时计算应如何组织，或如何最优利用固定计算预算的理论理解仍较为有限。
我们将测试时计算建模为一种与马尔可夫链交互的算法：算法可在任意时刻从任何先前观测到的状态恢复生成。换言之，与标准马尔可夫链中状态被动生成不同，我们允许算法随时回溯至马尔可夫链中任何已观测状态。现有多种测试时算法——如思维链（Chain-of-Thought, CoT）（Wei et al., 2023）、思维树（Tree-of-Thoughts, ToT）（Yao et., 2023）或最佳k采样（Best-of-$k$）（Brown et al., 2024）——均可视为该模型下的具体算法实例。
我们证明，尽管回溯能力可将生成次数指数级降低，但理论上仅需一种极有限形式的回溯即可实现最优。具体而言，我们证明最优算法始终生成一种“毛虫树”结构：若移除最优算法生成的状态树的叶节点，将得到一条路径。受此最优算法特征启发，我们提出“思维毛虫”（Caterpillar of Thoughts, CaT）这一新型测试时计算算法，旨在减少标记/状态生成次数。实证评估表明，与思维树（ToT）相比，CaT在提升任务成功率的同时，显著降低了标记生成数量。

摘要 (Abstract)

Large language models (LLMs) can often produce substantially better outputs when allowed to use additional test-time computation, such as sampling, chain of thought, backtracking, or revising partial solutions. Despite the growing empirical success of such techniques, there is limited theoretical understanding of how inference time computation should be structured, or what constitutes an optimal use of a fixed computation budget. We model test-time computation as an algorithm interacting with a Markov chain: at any point, the algorithm may resume generation from any previously observed state. That is, unlike standard Markov chains where the states are drawn passively, we allow the algorithm to backtrack to any previously observed state of the Markov chain at any time. Many of the existing test-time algorithms, such as Chain-of-Thought (CoT) (Wei et al., 2023), Tree-of-Thoughts (ToT) (Yao et al., 2023), or Best-of-$k$ (Brown et al., 2024) could be seen as specific algorithms in this model. We prove that while backtracking can reduce the number of generations exponentially, a very limited form of backtracking is theoretically sufficient. Namely, we show that the optimal algorithm always generates a caterpillar tree. That is, if we remove the leaves of the state tree generated by the optimal algorithm, we obtain a path. Motivated by our characterization of the optimal algorithm, we present Caterpillar of Thoughts (CaT), a new test-time computation algorithm, reducing the number of token/state generations. Our empirical evaluation shows that CaT, compared to ToT, achieves a better success rate while also reducing the number of token generations.

关键词: Large Language Models, Test-time Computation, Chain of Thought, Tree of Thoughts, Caterpillar of Thoughts, Backtracking, Inference Algorithm, Markov Chain

275. ❌ From Arithmetic to Logic: The Resilience of Logic and Lookup-Based Neural Networks Under Parameter Bit-Flips

作者: Alan T. L. Bacellar, Sathvik Chemudupati, Shashank Nag, Allison Seigler, Priscila M. V. Lima, Felipe M. G. França, Lizy K. John 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22770v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究深度神经网络在安全关键边缘环境中的硬件容错性，主要关注参数位翻转错误下的模型鲁棒性，通过理论分析和实验验证了低精度、高稀疏性、有界激活和浅层深度等设计趋势对容错性的积极影响，并特别研究了基于逻辑和查找表的神经网络架构。论文与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、智能体等）完全无关，仅与’Quantization OR Model Compression OR Low-bit Weights’有一定关联（5分），因为论文探讨了降低数值精度（低比特权重）对模型容错性的影响，但这不是论文的核心创新点，而是作为分析的一个方面。

!!! tip deepseek-chat TL;DR

该论文研究了深度神经网络在硬件位翻转错误下的鲁棒性，通过理论推导和实验验证发现，采用低精度、高稀疏性、有界激活和浅层深度等设计，特别是转向基于逻辑和查找表的离散布尔架构，可以在硬件容错性方面提供更优的准确性与鲁棒性权衡。

摘要翻译

在安全关键边缘环境中部署深度神经网络（DNN）需要其具备对抗硬件引发的比特翻转错误的鲁棒性。尽管实证研究表明降低数值精度可提升容错能力，但这一现象的理论基础仍未得到充分探索。在本研究中，我们将鲁棒性视为神经架构的结构特性，而非仅作为特定数据集训练所得解决方案的属性。通过推导在多种数值格式和层原语中独立参数比特翻转下的期望均方误差（MSE），我们证明在此类损坏模型下，较低精度、较高稀疏性、有界激活函数以及较浅的网络深度始终更具优势。进而，我们论证了基于逻辑和查找表的神经网络实现了这些设计趋势的联合极限。通过对MLPerf Tiny基准测试套件的消融实验，我们观察到实证趋势与理论预测一致，并且基于查找表（LUT）的模型在标准浮点模型急剧失效的损坏机制中仍保持高度稳定性。此外，我们发现了一种基于逻辑的架构所特有的新型偶数层恢复效应，并分析了其产生的结构条件。总体而言，我们的研究结果表明，从连续算术权重转向离散布尔查找表，可为硬件容错提供更优的精度-鲁棒性权衡。

摘要 (Abstract)

The deployment of deep neural networks (DNNs) in safety-critical edge environments necessitates robustness against hardware-induced bit-flip errors. While empirical studies indicate that reducing numerical precision can improve fault tolerance, the theoretical basis of this phenomenon remains underexplored. In this work, we study resilience as a structural property of neural architectures rather than solely as a property of a dataset-specific trained solution. By deriving the expected squared error (MSE) under independent parameter bit flips across multiple numerical formats and layer primitives, we show that lower precision, higher sparsity, bounded activations, and shallow depth are consistently favored under this corruption model. We then argue that logic and lookup-based neural networks realize the joint limit of these design trends. Through ablation studies on the MLPerf Tiny benchmark suite, we show that the observed empirical trends are consistent with the theoretical predictions, and that LUT-based models remain highly stable in corruption regimes where standard floating-point models fail sharply. Furthermore, we identify a novel even-layer recovery effect unique to logic-based architectures and analyze the structural conditions under which it emerges. Overall, our results suggest that shifting from continuous arithmetic weights to discrete Boolean lookups can provide a favorable accuracy-resilience trade-off for hardware fault tolerance.

关键词: neural network robustness, hardware fault tolerance, bit-flip errors, low precision, logic-based neural networks, lookup tables, MLPerf Tiny benchmark, accuracy-resilience trade-off

276. ❌ REALITrees: Rashomon Ensemble Active Learning for Interpretable Trees

作者: Simon D. Nguyen, Hayden McTavish, Kentaro Hoffman, Cynthia Rudin, Tyler H. McCormick 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22750v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于主动学习（Active Learning）和可解释机器学习（特别是稀疏决策树），与绝大多数关键词（涉及大模型、深度学习、对齐、推理、代理等）完全无关。唯一有微弱关联的是’Mechanistic Interpretability OR Explainable AI’，因为论文涉及可解释的决策树模型，但并非核心关注点，因此给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于Rashomon集合枚举的主动学习方法（REAL），用于稀疏决策树，通过构建所有近最优模型的委员会来改进样本选择，在中等噪声环境中比随机集成方法收敛更快。

摘要翻译

主动学习通过选择能最大化信息增益的样本来降低标注成本。主流框架委员会查询（Query-by-Committee, QBC）通常依赖基于扰动的多样性策略，即通过随机特征子集选择或数据遮蔽来引发模型分歧。虽然这种方法近似于认知不确定性的一种表征，但它牺牲了对合理假设空间的直接刻画。我们提出一种互补性方法：拉什蒙集成主动学习（Rashomon Ensembled Active Learning, REAL），该方法通过穷举枚举所有近似最优模型构成的拉什蒙集合（Rashomon Set）来构建委员会。针对该集合内的函数冗余问题，我们采用PAC-贝叶斯框架，利用吉布斯后验根据经验风险对委员会成员进行加权。借助最新的算法进展，我们针对稀疏决策树模型类实现了对该集合的精确枚举。在合成数据集和经典主动学习基准测试中，REAL均优于随机集成方法，尤其在中等噪声环境中，它能策略性地利用扩展的模型多样性实现更快的收敛速度。

摘要 (Abstract)

Active learning reduces labeling costs by selecting samples that maximize information gain. A dominant framework, Query-by-Committee (QBC), typically relies on perturbation-based diversity by inducing model disagreement through random feature subsetting or data blinding. While this approximates one notion of epistemic uncertainty, it sacrifices direct characterization of the plausible hypothesis space. We propose the complementary approach: Rashomon Ensembled Active Learning (REAL) which constructs a committee by exhaustively enumerating the Rashomon Set of all near-optimal models. To address functional redundancy within this set, we adopt a PAC-Bayesian framework using a Gibbs posterior to weight committee members by their empirical risk. Leveraging recent algorithmic advances, we exactly enumerate this set for the class of sparse decision trees. Across synthetic and established active learning baselines, REAL outperforms randomized ensembles, particularly in moderately noisy environments where it strategically leverages expanded model multiplicity to achieve faster convergence.

关键词: Active Learning, Rashomon Set, Sparse Decision Trees, Query-by-Committee, Ensemble Methods, Interpretable Models, PAC-Bayesian, Model Disagreement

277. ❌ Reconstruction-Guided Slot Curriculum: Addressing Object Over-Fragmentation in Video Object-Centric Learning

作者: WonJun Moon, Hyun Seok Seong, Jae-Pil Heo 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22758v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频对象中心学习中的对象过分割问题，提出了一种基于重建引导的槽课程学习方法（SlotCurri），包括渐进式槽分配、结构感知损失和循环推理。虽然属于深度学习在计算机视觉中的应用，但论文内容与所有评分关键词（均围绕大语言模型技术、训练方法、推理优化、对齐、代理系统等）完全无关，没有涉及任何大模型技术、语言模型、科学AI应用或相关创新方法。

!!! tip deepseek-chat TL;DR

该论文针对视频对象中心学习中存在的对象过分割问题，提出了一种重建引导的槽课程学习方法（SlotCurri），通过渐进式槽分配、结构感知损失和循环推理，显著提升了对象分割的准确性和时间一致性，在YouTube-VIS和MOVi-C数据集上分别取得了FG-ARI指标+6.8和+8.3的显著提升。

摘要翻译

视频物体中心学习旨在将原始视频分解为少量物体槽位，但现有的槽位注意力模型常遭受严重的过度碎片化问题。这是因为模型在训练中被隐式鼓励占用所有槽位以最小化重建损失，导致单个物体被多个冗余槽位表示。我们通过引入重建引导的槽位课程学习（SlotCurri）来解决这一局限。训练开始时仅使用少量粗粒度槽位，并逐步在重建误差较高的区域分配新槽位，从而仅在需要处扩展表示容量，从源头防止碎片化。然而，在槽位扩展过程中，只有当粗粒度语义已充分分离时，有意义的子部件才会显现；但在初始槽位预算较小且使用均方误差目标的情况下，语义边界往往保持模糊。因此，我们在均方误差基础上增加了结构感知损失，该损失通过保持局部对比度和边缘信息来促使每个槽位强化其语义边界。最后，我们提出一种循环推理机制，将槽位沿帧序列向前和向后滚动，即使在最早期的帧中也能产生时间一致的物体表示。综上，SlotCurri通过将表示容量分配到重建失败的区域来解决物体过度碎片化问题，并借助结构线索和循环推理进一步增强性能。在YouTube-VIS数据集上FG-ARI指标显著提升+6.8，在MOVi-C数据集上提升+8.3，验证了SlotCurri的有效性。代码发布于github.com/wjun0830/SlotCurri。

摘要 (Abstract)

Video Object-Centric Learning seeks to decompose raw videos into a small set of object slots, but existing slot-attention models often suffer from severe over-fragmentation. This is because the model is implicitly encouraged to occupy all slots to minimize the reconstruction objective, thereby representing a single object with multiple redundant slots. We tackle this limitation with a reconstruction-guided slot curriculum (SlotCurri). Training starts with only a few coarse slots and progressively allocates new slots where reconstruction error remains high, thus expanding capacity only where it is needed and preventing fragmentation from the outset. Yet, during slot expansion, meaningful sub-parts can emerge only if coarse-level semantics are already well separated; however, with a small initial slot budget and an MSE objective, semantic boundaries remain blurry. Therefore, we augment MSE with a structure-aware loss that preserves local contrast and edge information to encourage each slot to sharpen its semantic boundaries. Lastly, we propose a cyclic inference that rolls slots forward and then backward through the frame sequence, producing temporally consistent object representations even in the earliest frames. All combined, SlotCurri addresses object over-fragmentation by allocating representational capacity where reconstruction fails, further enhanced by structural cues and cyclic inference. Notable FG-ARI gains of +6.8 on YouTube-VIS and +8.3 on MOVi-C validate the effectiveness of SlotCurri. Our code is available at github.com/wjun0830/SlotCurri.

关键词: Video Object-Centric Learning, object over-fragmentation, slot-attention models, reconstruction-guided slot curriculum, SlotCurri, temporally consistent object representations, FG-ARI, progressive slot allocation

278. ❌ Algorithmic warm starts for Hamiltonian Monte Carlo

作者: Matthew S. Zhang, Jason M. Altschuler, Sinho Chewi 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22741v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是经典统计计算中的哈密顿蒙特卡洛（HMC）采样算法的理论改进，属于计算数学和统计物理领域。论文内容完全不涉及大模型、深度学习、AI for Science或任何列出的关键词技术。所有关键词均与大模型技术原理、训练方法、推理优化、对齐、应用等主题相关，而本文是纯理论算法分析，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了高维强对数凹分布采样中哈密顿蒙特卡洛算法需要预热启动的计算瓶颈问题，证明了非Metropolized HMC可在O~(d^{1/4})迭代内生成预热启动，从而将整体采样复杂度从O~(d^{1/2})提升至O~(d^{1/4})。

摘要翻译

从连续概率密度中生成样本是统计学、工程学和科学领域的核心算法问题。在高维场景下，哈密顿蒙特卡洛（Hamiltonian Monte Carlo，HMC）已成为主流软件包中的默认算法。然而，尽管已有大量关于HMC的研究工作且其实证效果广泛成功，但其所需迭代次数如何随维度$d$变化仍不明确。一方面，多项结果表明，在接近平稳分布的“热启动”条件下，Metropolized HMC可在$O(d^{1/4})$次迭代内收敛。另一方面，若无热启动，Metropolized HMC的速度会显著下降，例如即使对于各向同性高斯分布这类简单目标分布，也可能需要$Ω(d^{1/2})$次迭代。因此，寻找热启动成为HMC的计算瓶颈。
针对已深入研究的、满足强对数凹性（或等周性）及三阶导数有界的概率分布采样问题，我们解决了这一难题。我们证明，在$\tilde{O}(d^{1/4})$次迭代内，\emph{非Metropolized} HMC即可生成热启动状态，此后可利用Metropolized HMC基于该热启动进行高效采样。最终我们达到的$\tilde{O}(d^{1/4})$复杂度是在上述假设下实现高精度采样的最快算法，超越了先前最佳的$\tilde{O}(d^{1/2})$结果。这为相关场景下关于Metropolized HMC维度复杂度的长期研究画上了句号，同时也为实际应用提供了简洁的热启动方案。

摘要 (Abstract)

Generating samples from a continuous probability density is a central algorithmic problem across statistics, engineering, and the sciences. For high-dimensional settings, Hamiltonian Monte Carlo (HMC) is the default algorithm across mainstream software packages. However, despite the extensive line of work on HMC and its widespread empirical success, it remains unclear how many iterations of HMC are required as a function of the dimension $d$. On one hand, a variety of results show that Metropolized HMC converges in $O(d^{1/4})$ iterations from a warm start close to stationarity. On the other hand, Metropolized HMC is significantly slower without a warm start, e.g., requiring $Ω(d^{1/2})$ iterations even for simple target distributions such as isotropic Gaussians. Finding a warm start is therefore the computational bottleneck for HMC. We resolve this issue for the well-studied setting of sampling from a probability distribution satisfying strong log-concavity (or isoperimetry) and third-order derivative bounds. We prove that \emph{non-Metropolized} HMC generates a warm start in $\tilde{O}(d^{1/4})$ iterations, after which we can exploit the warm start using Metropolized HMC. Our final complexity of $\tilde{O}(d^{1/4})$ is the fastest algorithm for high-accuracy sampling under these assumptions, improving over the prior best of $\tilde{O}(d^{1/2})$. This closes the long line of work on the dimensional complexity of MHMC for such settings, and also provides a simple warm-start prescription for practical implementations.

关键词: Hamiltonian Monte Carlo, sampling algorithms, warm starts, high-dimensional statistics, log-concave distributions, computational complexity, Markov chain Monte Carlo, isoperimetry

279. ❌ Behavioral Heterogeneity as Quantum-Inspired Representation

作者: Mohammad Elayan, Wissam Kontar 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22729v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是驾驶行为异质性的量子启发表示方法，使用密度矩阵和随机傅里叶特征等技术，属于交通行为建模领域。所有关键词均与大模型、深度学习技术原理或AI在科学领域的应用无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种量子启发的表示方法来建模驾驶行为的异质性，将每个驾驶员表示为具有结构化数学特性的演化密度矩阵，并在TGSIM驾驶数据上验证了该方法能够有效提取和分析驾驶模式。

摘要翻译

驾驶员的异质性常被简化为标签或离散状态，将固有的动态特性压缩为静态类别。我们引入一种量子启发的表征方法，将每位驾驶员建模为一个演化的潜在状态，该状态以具有结构化数学特性的密度矩阵呈现。行为观测通过非线性随机傅里叶特征进行嵌入，而状态演化则融合了行为的时间持续性与情境相关的特征激活。我们在实证驾驶数据——第三代仿真数据（TGSIM）上评估了该方法，展示了驾驶特征的提取与分析过程。

摘要 (Abstract)

Driver heterogeneity is often reduced to labels or discrete regimes, compressing what is inherently dynamic into static categories. We introduce quantum-inspired representation that models each driver as an evolving latent state, presented as a density matrix with structured mathematical properties. Behavioral observations are embedded via non-linear Random Fourier Features, while state evolution blends temporal persistence of behavior with context-dependent profile activation. We evaluate our approach on empirical driving data, Third Generation Simulation Data (TGSIM), showing how driving profiles are extracted and analyzed.

关键词: behavioral heterogeneity, quantum-inspired representation, density matrix, random Fourier features, driver modeling, latent state evolution, driving profiles, TGSIM data

280. ❌ Multitask-Informed Prior for In-Context Learning on Tabular Data: Application to Steel Property Prediction

作者: Dimitrios Sinodinos, Bahareh Nikpour, Jack Yi Wei, Sushant Sinha, Xiaoping Ma, Kashif Rehman, Stephen Yue, Narges Armanfard 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22738v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文核心是改进TabPFN（一种基于Transformer的基础模型）用于表格数据的上下文学习，通过多任务学习微调策略提升钢铁性能预测。高度相关关键词：‘In-context Learning’（核心方法，10分）、‘Foundation Models’（TabPFN是基础模型，8分）、‘Supervised Fine-tuning’（论文提出两种微调策略，8分）、‘AI for Science’（应用于钢铁工业科学问题，8分）。中等相关：‘Pre-training’（涉及基础模型预训练概念，5分）、‘Parameter-efficient Fine-tuning’（任务特定适配器类似PEFT思想，5分）。其他关键词与论文的表格数据、工业应用、具体微调方法无直接关联，得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种多任务学习框架，通过目标平均和任务特定适配器微调TabPFN基础模型，显著提升了钢铁热轧过程中机械性能预测的准确性和计算效率。

摘要翻译

在薄板坯直接轧制等热轧工艺中，由于化学成分、工艺参数与最终微观结构之间复杂的相互作用，钢材力学性能的精确预测仍具挑战性。传统的经验与实验方法虽有效，但通常资源消耗大，且难以适应多变的生产条件。此外，现有方法大多未能显式利用关键力学性能间的强相关性，错失了通过多任务学习提升预测精度的机会。为此，我们提出一种多任务学习框架，通过新颖的微调策略将多任务感知注入TabPFN的先验中——TabPFN是一种基于Transformer的表格数据上下文学习基础模型。该模型原设计用于单目标回归或分类，我们通过两种互补方法增强其先验：（一）目标平均法，提供与TabPFN单目标架构兼容的统一标量信号；（二）任务特定适配器，在微调过程中引入任务特定的监督。这些策略共同引导模型形成能捕捉关键力学指标间跨属性关系的多任务感知先验。在工业TSDR数据集上的大量实验表明，我们的多任务适应方法在多项评估指标上均优于经典机器学习方法与近期先进的表格学习模型。值得注意的是，相较于任务特定微调，我们的方法在提升预测精度的同时提高了计算效率，证明多任务感知的先验适应能使表格数据基础模型为TSDR中的自动化工业质量控制与工艺优化提供可扩展、快速且可靠的部署方案。

摘要 (Abstract)

Accurate prediction of mechanical properties of steel during hot rolling processes, such as Thin Slab Direct Rolling (TSDR), remains challenging due to complex interactions among chemical compositions, processing parameters, and resultant microstructures. Traditional empirical and experimental methodologies, while effective, are often resource-intensive and lack adaptability to varied production conditions. Moreover, most existing approaches do not explicitly leverage the strong correlations among key mechanical properties, missing an opportunity to improve predictive accuracy through multitask learning. To address this, we present a multitask learning framework that injects multitask awareness into the prior of TabPFN–a transformer-based foundation model for in-context learning on tabular data–through novel fine-tuning strategies. Originally designed for single-target regression or classification, we augment TabPFN’s prior with two complementary approaches: (i) target averaging, which provides a unified scalar signal compatible with TabPFN’s single-target architecture, and (ii) task-specific adapters, which introduce task-specific supervision during fine-tuning. These strategies jointly guide the model toward a multitask-informed prior that captures cross-property relationships among key mechanical metrics. Extensive experiments on an industrial TSDR dataset demonstrate that our multitask adaptations outperform classical machine learning methods and recent state-of-the-art tabular learning models across multiple evaluation metrics. Notably, our approach enhances both predictive accuracy and computational efficiency compared to task-specific fine-tuning, demonstrating that multitask-aware prior adaptation enables foundation models for tabular data to deliver scalable, rapid, and reliable deployment for automated industrial quality control and process optimization in TSDR.

关键词: TabPFN, in-context learning, tabular data, multitask learning, fine-tuning, steel property prediction, foundation model, industrial application

281. ❌ Spiking Personalized Federated Learning for Brain-Computer Interface-Enabled Immersive Communication

作者: Chen Shang, Dinh Thai Hoang, Diep N. Nguyen, Jiadong Yu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22727v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文主要研究脑机接口（BCI）驱动的沉浸式通信框架，采用个性化联邦学习（PFL）和脉冲神经网络（SNN）处理脑信号数据，以降低能耗并保护隐私。论文内容与大多数关键词（如LLM、MoE、RLHF、RAG等）完全无关，因为这些关键词主要涉及大语言模型及其相关技术（训练、推理、对齐、应用等），而本文未涉及任何语言模型或自然语言处理。唯一略有相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文将AI（特别是联邦学习和SNN）应用于脑信号分析和生物医学领域（脑机接口），属于AI在科学/生物信息学中的应用，但并非核心焦点（核心是通信框架和能耗优化），因此给予5分（有一定关联）。其他关键词均得0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于脑机接口和脉冲神经网络的个性化联邦学习框架，用于沉浸式通信中的脑信号分析，在保持高识别准确率的同时将推理能耗降低了6.46倍。

摘要翻译

本研究提出一种新型沉浸式通信框架，该框架利用脑机接口（BCI）获取脑信号以推断用户中心状态（例如意图及与感知相关的不适感），从而在个体差异显著的情况下实现更具个性化、更鲁棒的沉浸式自适应。具体而言，我们开发了一种个性化联邦学习（PFL）模型来分析与处理采集的脑信号，该模型不仅能够适应神经多样性的脑信号数据，还能防止敏感脑信号信息的泄露。为解决在能量受限的沉浸式终端（如头戴式显示器）上持续进行设备端学习与推理的能量瓶颈问题，我们进一步将脉冲神经网络（SNNs）嵌入至PFL模型中。通过利用稀疏、事件驱动的脉冲计算，这种基于SNN的PFL在保持具有竞争力的个性化性能的同时，显著降低了训练与推理的计算量和能耗。在真实脑信号数据集上的实验表明，与基于传统人工神经网络的个性化基线方法相比，我们的方法在实现最佳整体识别精度的同时，将推理能耗降低了6.46倍。

摘要 (Abstract)

This work proposes a novel immersive communication framework that leverages brain-computer interface (BCI) to acquire brain signals for inferring user-centric states (e.g., intention and perception-related discomfort), thereby enabling more personalized and robust immersive adaptation under strong individual variability. Specifically, we develop a personalized federated learning (PFL) model to analyze and process the collected brain signals, which not only accommodates neurodiverse brain-signal data but also prevents the leakage of sensitive brain-signal information. To address the energy bottleneck of continual on-device learning and inference on energy-limited immersive terminals (e.g., head-mounted display), we further embed spiking neural networks (SNNs) into the PFL. By exploiting sparse, event-driven spike computation, the SNN-enabled PFL reduces the computation and energy cost of training and inference while maintaining competitive personalization performance. Experiments on real brain-signal dataset demonstrate that our method achieves the best overall identification accuracy while reducing inference energy by 6.46$\times$ compared with conventional artificial neural network-based personalized baselines.

关键词: brain-computer interface, immersive communication, personalized federated learning, spiking neural networks, energy efficiency, brain signals, on-device learning, privacy preservation

282. ❌ Double Coupling Architecture and Training Method for Optimization Problems of Differential Algebraic Equations with Parameters

作者: Wenqiang Yang, Wenyuan Wu, Yong Feng, Changbo Chen 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22724v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是针对参数化微分代数方程优化问题的双物理信息神经网络架构和训练方法，属于科学计算和工程优化领域。论文的核心技术是物理信息神经网络和遗传算法，与评分关键词列表中的绝大多数大模型和深度学习技术原理（如LLM、MoE、RLHF、RAG等）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将AI（具体是PINN）应用于科学模拟和工程优化问题，属于AI for Science的广义范畴，但并非论文的核心创新点（核心是特定架构和训练方法），因此给予5分（有一定关联）。其他所有关键词均未在论文标题或摘要中涉及，且与论文主题无关，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种双物理信息神经网络架构和遗传算法增强的训练框架，用于解决参数化微分代数方程的优化问题，实现了约束与目标函数的解耦，提高了多任务优化的训练精度和效率。

摘要翻译

仿真建模在产品开发中至关重要，其被整合于设计与制造流程以提升效率与质量。仿真模型通常表现为复杂的非线性微分代数方程。产品需求的日益多样化催生了多任务优化的需求，这成为仿真建模研究中的一个关键挑战。研究提出了一种双物理信息神经网络架构，用于解耦参数化微分代数方程优化问题中的约束条件与目标函数。理论分析表明，引入具有全局误差界的松弛变量可确保网络与优化问题在解上的等价性。一种遗传算法增强的物理信息神经网络训练框架提升了训练精度与效率，避免了微分代数方程的冗余求解。该方法能够通过单一训练实现多任务目标的泛化，并保持对产品需求的实时响应能力。

摘要 (Abstract)

Simulation and modeling are essential in product development, integrated into the design and manufacturing process to enhance efficiency and quality. They are typically represented as complex nonlinear differential algebraic equations. The growing diversity of product requirements demands multi-task optimization, a key challenge in simulation modeling research. A dual physics-informed neural network architecture has been proposed to decouple constraints and objective functions in parametric differential algebraic equation optimization problems. Theoretical analysis shows that introducing a relaxation variable with a global error bound ensures solution equivalence between the network and optimization problem. A genetic algorithm-enhanced training framework for physics-informed neural networks improves training precision and efficiency, avoiding redundant solving of differential algebraic equations. This approach enables generalization for multi-task objectives with a single, training maintaining real-time responsiveness to product requirements.

关键词: differential algebraic equations, optimization problems, physics-informed neural networks, genetic algorithm, multi-task optimization, parameter decoupling, training framework, simulation modeling

283. ❌ Non-Adversarial Imitation Learning Provably Free of Compounding Errors: The Role of Bellman Constraints

作者: Tian Xu, Chenyang Wang, Xiaochen Zhai, Ziniu Li, Yi-Chen Li, Yang Yu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22713v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于强化学习领域的模仿学习（Imitation Learning），特别是非对抗性Q值学习方法，研究如何通过Bellman约束传播Q值来减少复合误差。所有评分关键词均与大语言模型、深度学习技术原理或科学AI应用相关，而本文的核心是强化学习中的模仿学习算法，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

本文揭示了现有非对抗性模仿学习方法IQ-Learn实际上会退化为行为克隆并遭受复合误差，并提出了一种新的基于Bellman约束的Dual Q-DM方法，理论上能消除复合误差并恢复专家动作。

摘要翻译

对抗模仿学习（Adversarial Imitation Learning, AIL）通过减轻行为克隆（Behavioral Cloning, BC）中的复合误差实现了高质量的模仿，但由于对抗性优化，其训练过程常表现出不稳定性。为避免此问题，一类以IQ-Learn为代表的非对抗性基于Q值的模仿学习方法应运而生，并被广泛认为能通过利用在线环境交互来超越BC。然而，本文重新审视IQ-Learn并证明，该方法在理论上可简化为BC，且其模仿差距下界与任务步长呈二次方依赖关系，因此仍受复合误差影响。理论分析表明，尽管使用了在线交互，IQ-Learn对示范数据未覆盖的状态上的所有动作的Q值进行了均匀抑制，因而无法实现泛化。为解决这一局限，我们引入了一种用于分布匹配的对偶框架，从而提出了一种新的基于Q值的模仿学习方法——对偶Q分布匹配。对偶Q分布匹配的核心机制是通过引入贝尔曼约束，将已访问状态的高Q值传播至未访问状态，从而实现超越示范数据的泛化。我们证明对偶Q分布匹配等价于AIL，且能够恢复示范数据之外的专家动作，从而减轻复合误差。据我们所知，对偶Q分布匹配是首个在理论上保证能消除复合误差的非对抗性模仿学习方法。实验结果进一步验证了我们的理论结论。

摘要 (Abstract)

Adversarial imitation learning (AIL) achieves high-quality imitation by mitigating compounding errors in behavioral cloning (BC), but often exhibits training instability due to adversarial optimization. To avoid this issue, a class of non-adversarial Q-based imitation learning (IL) methods, represented by IQ-Learn, has emerged and is widely believed to outperform BC by leveraging online environment interactions. However, this paper revisits IQ-Learn and demonstrates that it provably reduces to BC and suffers from an imitation gap lower bound with quadratic dependence on horizon, therefore still suffering from compounding errors. Theoretical analysis reveals that, despite using online interactions, IQ-Learn uniformly suppresses the Q-values for all actions on states uncovered by demonstrations, thereby failing to generalize. To address this limitation, we introduce a primal-dual framework for distribution matching, yielding a new Q-based IL method, Dual Q-DM. The key mechanism in Dual Q-DM is incorporating Bellman constraints to propagate high Q-values from visited states to unvisited ones, thereby achieving generalization beyond demonstrations. We prove that Dual Q-DM is equivalent to AIL and can recover expert actions beyond demonstrations, thereby mitigating compounding errors. To the best of our knowledge, Dual Q-DM is the first non-adversarial IL method that is theoretically guaranteed to eliminate compounding errors. Experimental results further corroborate our theoretical results.

关键词: Imitation Learning, Compounding Errors, Bellman Constraints, Q-based Methods, Non-adversarial Learning, Distribution Matching, IQ-Learn, Dual Q-DM

284. ❌ Coordinate Encoding on Linear Grids for Physics-Informed Neural Networks

作者: Tetsuro Tsuchino, Motoki Shiga 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22700v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究的是物理信息神经网络（PINNs）在求解偏微分方程（PDEs）中的应用，提出了一种基于线性网格单元的坐标编码方法以解决训练收敛慢的问题。所有关键词均与大模型（LLMs）或深度学习技术原理的创新直接相关，而本文专注于传统的深度神经网络在科学计算中的应用，并未涉及大模型、MoE、缩放定律、预训练、对齐、推理加速、智能体等前沿大模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学（具体是计算物理/工程）领域的应用，但并非生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文针对物理信息神经网络（PINNs）训练收敛慢的问题，提出了一种基于线性网格单元的坐标编码方法，通过分离局部域和使用自然三次样条插值，有效提升了训练收敛速度并降低了计算成本。

摘要翻译

在求解偏微分方程时，利用物理定律的机器学习方法因具备无网格求解、无监督学习及适用于高维问题等优势而受到广泛关注。其中一种有效方法基于物理信息神经网络，该网络依托于深度神经网络，后者在众多学术与工业应用中表现出卓越性能。然而，由于频谱偏差问题导致收敛速度显著缓慢，物理信息神经网络在模型训练中面临困难。本研究提出一种基于物理信息神经网络的方法，该方法在线性网格单元上配备了坐标编码层。所提出的方法通过网格单元分离局部域，提升了训练收敛速度。此外，通过采用与坐标轴无关的线性网格单元，降低了整体计算成本。该方法还通过使用自然三次样条在网格点之间对编码坐标进行适当插值，实现了高效稳定的模型训练，这保证了为损失函数计算的模型导数函数的连续性。数值实验结果表明，所提方法具有优异的性能表现和高效的训练收敛速度。

摘要 (Abstract)

In solving partial differential equations (PDEs), machine learning utilizing physical laws has received considerable attention owing to advantages such as mesh-free solutions, unsupervised learning, and feasibility for solving high-dimensional problems. An effective approach is based on physics-informed neural networks (PINNs), which are based on deep neural networks known for their excellent performance in various academic and industrial applications. However, PINNs struggled with model training owing to significantly slow convergence because of a spectral bias problem. In this study, we propose a PINN-based method equipped with a coordinate-encoding layer on linear grid cells. The proposed method improves the training convergence speed by separating the local domains using grid cells. Moreover, it reduces the overall computational cost by using axis-independent linear grid cells. The method also achieves efficient and stable model training by adequately interpolating the encoded coordinates between grid points using natural cubic splines, which guarantees continuous derivative functions of the model computed for the loss functions. The results of numerical experiments demonstrate the effective performance and efficient training convergence speed of the proposed method.

关键词: Physics-Informed Neural Networks, PINNs, partial differential equations, coordinate encoding, linear grid cells, training convergence, spectral bias, natural cubic splines

285. ❌ Vision-based Deep Learning Analysis of Unordered Biomedical Tabular Datasets via Optimal Spatial Cartography

作者: Sakib Mostafa, Tarik Massoud, Maximilian Diehn, Lei Xing, Md Tauhidul Islam 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22675v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于开发一种名为Dynomap的深度学习框架，用于将无序的生物医学表格数据转换为优化的空间特征图，以便视觉模型能够有效处理。论文的核心是深度学习在生物医学数据分析中的应用，特别是针对表格数据的特征映射和预测优化。所有关键词中，只有"AI for Science OR Bioinformatics OR Cheminformatics"与论文高度相关，因为论文明确涉及生物信息学（如液体活检、转录组学）和AI在科学（生物医学）领域的应用。其他关键词主要涉及大语言模型（LLMs）的技术细节、训练方法、推理优化、代理系统等，这些在论文中均未提及或讨论，因此评分为0。论文的创新点在于表格数据的空间表示学习，而非大模型技术本身。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Dynomap的深度学习框架，通过动态特征映射将无序的生物医学表格数据转换为优化的空间特征图，从而提升视觉模型在癌症亚型预测和帕金森病声音分析等任务中的准确性和可解释性。

摘要翻译

表格数据是生物医学研究的核心，涵盖液体活检、批量和单细胞转录组学、电子健康记录以及表型分析等领域。然而，与图像或序列数据不同，表格数据集缺乏内在的空间组织特征：数据特征被视为无序维度，其相互关系必须由模型隐式推断。这限制了视觉架构在非空间生物医学数据中利用局部结构及高阶特征交互的能力。本文提出动态特征映射（Dynomap），一种端到端的深度学习框架，能够直接从数据中学习任务优化的特征空间拓扑结构。Dynomap通过完全可微的渲染机制，联合优化特征布局与预测任务，无需依赖启发式规则、预定义分组或外部先验知识。通过将高维表格向量转化为学习得到的特征图，Dynomap使基于视觉的模型能够有效处理无序的生物医学输入数据。在多个临床与生物数据集中，Dynomap的表现始终优于经典机器学习方法、现代深度表格模型及现有向量转图像方法。在液体活检数据中，Dynomap将临床相关的基因特征组织为连贯的空间模式，并将多类癌症亚型预测准确率最高提升18%。在帕金森病语音数据集中，该框架聚类了疾病相关的声学描述符，并将准确率最高提升8%。在其他生物医学数据集中同样观察到类似的性能提升与可解释的特征组织模式。这些结果表明，Dynomap作为一种通用策略，能够有效连接表格数据与基于视觉的深度学习，并在高维生物医学数据中揭示具有临床意义的结构化模式。

摘要 (Abstract)

Tabular data are central to biomedical research, from liquid biopsy and bulk and single-cell transcriptomics to electronic health records and phenotypic profiling. Unlike images or sequences, however, tabular datasets lack intrinsic spatial organization: features are treated as unordered dimensions, and their relationships must be inferred implicitly by the model. This limits the ability of vision architectures to exploit local structure and higher-order feature interactions in non-spatial biomedical data. Here we introduce Dynamic Feature Mapping (Dynomap), an end-to-end deep learning framework that learns a task-optimized spatial topology of features directly from data. Dynomap jointly optimizes feature placement and prediction through a fully differentiable rendering mechanism, without relying on heuristics, predefined groupings, or external priors. By transforming high-dimensional tabular vectors into learned feature maps, Dynomap enables vision-based models to operate effectively on unordered biomedical inputs. Across multiple clinical and biological datasets, Dynomap consistently outperformed classical machine learning, modern deep tabular models, and existing vector-to-image approaches. In liquid biopsy data, Dynomap organized clinically relevant gene signatures into coherent spatial patterns and improved multiclass cancer subtype prediction accuracy by up to 18%. In a Parkinson disease voice dataset, it clustered disease-associated acoustic descriptors and improved accuracy by up to 8%. Similar gains and interpretable feature organization were observed in additional biomedical datasets. These results establish Dynomap as a general strategy for bridging tabular and vision-based deep learning and for uncovering structured, clinically relevant patterns in high-dimensional biomedical data.

关键词: Dynamic Feature Mapping, Dynomap, tabular data, biomedical research, deep learning, vision-based models, feature organization, clinical prediction

286. ❌ Bounding Box Anomaly Scoring for simple and efficient Out-of-Distribution detection

作者: Mohamed Bahi Yahiaoui, Geoffrey Daniel, Loïc Giraldi, Jérémie Bruyelle, Julyan Arbel 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22660v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的分布外检测方法，提出了一种基于边界框抽象的后处理方法。虽然涉及深度学习，但研究内容与所有评分关键词（均围绕大语言模型技术、训练方法、推理优化、对齐、科学应用等）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于边界框抽象的分布外检测方法BBAS，通过监控卷积层变量和多层表示，在图像分类基准上实现了稳健的分布内外样本分离。

摘要翻译

分布外（Out-of-distribution, OOD）检测旨在识别与训练分布不同的输入，以减少深度神经网络产生不可靠预测的风险。在基于特征空间的后处理方法中，OOD检测通常通过在预训练网络的表示空间中近似分布内支持来实现。现有方法往往需要在紧凑的参数化模型（如基于马氏距离的评分）与更灵活但依赖参考样本的方法（如k近邻）之间进行权衡。边界框抽象提供了一种有吸引力的中间视角，它通过对隐藏激活进行紧凑的轴对齐概括来表示分布内支持。本文提出边界框异常评分（Bounding Box Anomaly Scoring, BBAS），这是一种利用边界框抽象的后处理OOD检测方法。BBAS结合了基于区间超出的分级异常评分、适用于卷积层的监控变量，以及解耦的聚类与边界框构建机制，从而获得更丰富且多层级的表示。在图像分类基准测试上的实验表明，BBAS能够稳健地区分分布内与分布外样本，同时保持了边界框方法的简洁性、紧凑性和可更新性。

摘要 (Abstract)

Out-of-distribution (OOD) detection aims to identify inputs that differ from the training distribution in order to reduce unreliable predictions by deep neural networks. Among post-hoc feature-space approaches, OOD detection is commonly performed by approximating the in-distribution support in the representation space of a pretrained network. Existing methods often reflect a trade-off between compact parametric models, such as Mahalanobis-based scores, and more flexible but reference-based methods, such as k-nearest neighbors. Bounding-box abstraction provides an attractive intermediate perspective by representing in-distribution support through compact axis-aligned summaries of hidden activations. In this paper, we introduce Bounding Box Anomaly Scoring (BBAS), a post-hoc OOD detection method that leverages bounding-box abstraction. BBAS combines graded anomaly scores based on interval exceedances, monitoring variables adapted to convolutional layers, and decoupled clustering and box construction for richer and multi-layer representations. Experiments on image-classification benchmarks show that BBAS provides robust separation between in-distribution and out-of-distribution samples while preserving the simplicity, compactness, and updateability of the bounding-box approach.

关键词: Out-of-distribution detection, Bounding-box abstraction, Post-hoc method, Anomaly scoring, Deep neural networks, Image classification, Convolutional layers, Representation space

287. ❌ Generalizing Dynamics Modeling More Easily from Representation Perspective

作者: Yiming Wang, Zhengnan Zhang, Genghe Zhang, Jiawen Dan, Changchun Li, Chenlong Hu, Chris Nugent, Jun Liu, Ximing Li, Bo Yang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22655v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	5.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出PDEDER方法，使用预训练语言模型（PLM）学习复杂系统动力学，核心涉及预训练（关键词5得10分）和微调（关键词6得5分），属于AI for Science应用（关键词27得10分）。World Models相关（关键词24得5分）因涉及动力学建模。其他关键词（如LLM、MoE、推理方法等）未在摘要中提及或与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种通用的预训练动力学编码器（PDEDER），通过预训练语言模型将观测数据嵌入到潜在空间，以更有效地建模复杂系统的动力学，并在多个真实和合成系统上验证了其有效性和泛化能力。

摘要翻译

从观测数据中学习系统动力学是气候、生态、流体等诸多现实世界复杂系统应用中的关键问题。近年来，神经动力学建模方法已成为一种主流解决方案，其先将对象的观测数据嵌入到潜在空间，再使用神经常微分方程（Neural Ordinary Differential Equations, ODE）等神经方法学习动力学规律。现有的动力学建模方法通常针对不同复杂系统的每次观测分别构建特定模型，导致其跨系统泛化能力较差。受预训练模型巨大成功的启发，我们提出了一种通用的预训练动力学编码器（Pre-trained Dynamics EncoDER, PDEDER），它能够将原始状态观测数据嵌入到一个潜在空间，在该空间中动力学规律更易于捕捉。为实现通用的PDEDER，我们通过最小化李雅普诺夫指数（Lyapunov exponent）目标来预训练任意预训练语言模型（Pre-trained Language Model, PLM），该目标约束了潜在空间中所学主导动力学行为的混沌特性。通过惩罚嵌入观测数据的发散性，我们的PDEDER促进了局部稳定且结构良好的潜在动力学，从而比在原始观测空间中更有效地进行动力学建模。此外，我们结合了重构与预测目标，以降低获得过度平滑潜在空间的风险。具体而言，我们从23个复杂系统中收集了152组真实世界与合成观测数据作为预训练语料，并用其预训练PDEDER。对于任何未来的动态观测数据，我们均可使用特定的动力学建模方法对PDEDER进行微调。我们在12个动态系统上，通过域内与跨域设置下的短期/长期预测任务评估PDEDER，实证结果验证了PDEDER的有效性与泛化能力。

摘要 (Abstract)

Learning system dynamics from observations is a critical problem in many applications over various real-world complex systems, e.g., climate, ecology, and fluid systems. Recently, neural dynamics modeling method have become a prevalent solution that embeds the object’s observations into a latent space before learning dynamics using neural methods such as neural Ordinary Differential Equations (ODE). Existing dynamics modeling methods induce a specific model for each observation of different complex systems, resulting in poor generalization across systems. Inspired by the great success of pre-trained models, we conduct a generalized Pre-trained Dynamics EncoDER (PDEDER) which can embed the original state observations into a latent space where the dynamics can be captured more easily. To conduct the generalized PDEDER, we pre-train any Pre-trained Language Model (PLM) by minimizing the Lyapunov exponent objective, which constrains the chaotic behavior of governing dynamics learned in the latent space. By penalizing the divergence of embedded observations, our PDEDER promotes locally stable and well-structured latent dynamics, thereby facilitating more effective dynamics modeling than in the original observation space. In addition, we incorporate reconstruction and forecasting objectives to mitigate the risk of obtaining an over-smoothed latent space. Specifically, we collect 152 sets of real-world and synthetic observations from 23 complex systems as pre-training corpora and employ them to pre-train PDEDER. Given any future dynamic observation, we can fine-tune PDEDER with any specific dynamics modeling method. We evaluate PDEDER on 12 dynamic systems by short/long-term forecasting under both in-domain and cross-domain settings, and the empirical results indicate the effectiveness and generalizability of PDEDER.

关键词: dynamics modeling, pre-trained language model, latent space, neural ODE, generalization, complex systems, Lyapunov exponent, forecasting

288. ❌ Transfer learning via interpolating structures

作者: T. A. Dardeno, A. J. Hughes, L. A. Bull, R. S. Mills, N. Dervilis, K. Worden 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22621v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究结构健康监测中的迁移学习，通过参数化结构变化实现异构结构间的知识迁移，属于传统工程领域的机器学习应用，与所有大模型、深度学习技术原理相关的关键词均无直接关联，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文研究了在异构结构健康监测中，如何通过参数化中间结构实现知识迁移，并证明在高度不同的系统间也能实现正向迁移。

摘要翻译

尽管基于群体的结构健康监测（Population-based Structural Health Monitoring, PBSHM）近期取得了进展，但在高度异构结构（即异质群体）间的知识迁移仍是一个挑战。当前研究提出，异质迁移可通过中间结构实现，这些中间结构能够弥合目标结构间的信息差距。该技术的一个关键思想是，通过改变材料属性与几何形状等参数，一个结构可被连续变形为另一个结构。此方法通过案例研究进行了验证：案例1涉及模拟异质桥梁设计的参数化（及其间迁移）；案例2则通过一系列有限元模型，展示了简化物理表示的“桥梁”与“飞机”之间的迁移。此前在基于结构相似性预测正向迁移的讨论中，曾出现一个诙谐问题：“桥梁何时不是飞机？”虽然其明显答案是“永远不是”，但本文研究结果表明，在某些情况下，高度异构系统间确实能够实现正向迁移。

摘要 (Abstract)

Despite recent advances in population-based structural health monitoring (PBSHM), knowledge transfer between highly-disparate structures (i.e., heterogeneous populations) remains a challenge. The current work proposes that heterogeneous transfer may be accomplished via intermediate structures that bridge the gap in information between the structures of interest. A key aspect of the technique is the idea that by varying parameters such as material properties and geometry, one structure can be continuously morphed into another. The approach is demonstrated via a case study involving the parameterisation of (and transfer between) simulated heterogeneous bridge designs (Case 1). Transfer between simplified physical representations of a ‘bridge’ and ‘aeroplane’ is then demonstrated in Case 2, via a chain of finite-element models. The facetious question ‘When is a bridge not an aeroplane?’ has been previously asked in the context of predicting positive transfer based on structural similarity. While the obvious answer to this question is ‘Always,’ the results presented in the current paper show that, in some cases, positive transfer can indeed be achieved between highly-disparate systems.

关键词: transfer learning, structural health monitoring, heterogeneous populations, parameterization, finite-element models, knowledge transfer, intermediate structures

289. ❌ Overfitting and Generalizing with (PAC) Bayesian Prediction in Noisy Binary Classification

作者: Xiaohan Zhu, Mesrob I. Ohannessian, Nathan Srebro 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22644v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究PAC-Bayes学习规则在二元分类中的理论性质，属于经典机器学习理论范畴，与所有关键词（均涉及大模型、深度学习技术及其应用）完全无关。论文未涉及任何大模型架构、训练方法、推理优化、对齐技术、科学应用等现代深度学习主题。

!!! tip deepseek-chat TL;DR

该论文研究了PAC-Bayes学习规则在噪声二元分类中的过拟合与泛化问题，发现贝叶斯预测器在不可知情况下会导致非零超额损失，而通过调整正则化参数可以确保一致收敛。

摘要翻译

我们考虑一种用于二分类的PAC-Bayes类型学习规则，该规则在随机“后验”预测器的训练误差与其到预先指定的“先验”的KL散度之间进行权衡。这可以视为一种改进的两部分编码最小描述长度（MDL）学习规则向连续先验和随机预测的扩展。当平衡参数$λ=1$时，该学习规则恢复为（经验）贝叶斯后验；其一种改进变体则恢复为轮廓后验，从而与标准贝叶斯预测相联系（除对单参数噪声水平的处理方式外）。然而，从风险最小化预测的角度看，这种贝叶斯预测器存在过拟合问题，在不可知情形下可能导致非零的额外损失。相反，选择$λ\gg 1$（可视为使用依赖于样本量的先验）能确保即使在不可知情形下，额外损失也能一致地趋近于零。我们精确刻画了欠正则化（与过正则化）作为平衡参数$λ$函数的影响，理解了欠正则化被缓和或导致灾难性后果的不同机制。本研究扩展了Zhu与Srebro [2025]仅考虑离散先验的工作，将其推广至PAC-Bayes类型学习规则，并通过其严格的贝叶斯解释，更广泛地应用于贝叶斯预测领域。

摘要 (Abstract)

We consider a PAC-Bayes type learning rule for binary classification, balancing the training error of a randomized ‘‘posterior’’ predictor with its KL divergence to a pre-specified ‘‘prior’’. This can be seen as an extension of a modified two-part-code Minimum Description Length (MDL) learning rule, to continuous priors and randomized predictions. With a balancing parameter of $λ=1$ this learning rule recovers an (empirical) Bayes posterior and a modified variant recovers the profile posterior, linking with standard Bayesian prediction (up to the treatment of the single-parameter noise level). However, from a risk-minimization prediction perspective, this Bayesian predictor overfits and can lead to non-vanishing excess loss in the agnostic case. Instead a choice of $λ\gg 1$, which can be seen as using a sample-size-dependent-prior, ensures uniformly vanishing excess loss even in the agnostic case. We precisely characterize the effect of under-regularizing (and over-regularizing) as a function of the balance parameter $λ$, understanding the regimes in which this under-regularization is tempered or catastrophic. This work extends previous work by Zhu and Srebro [2025] that considered only discrete priors to PAC Bayes type learning rules and, through their rigorous Bayesian interpretation, to Bayesian prediction more generally.

关键词: PAC-Bayes, binary classification, overfitting, generalization, Bayesian prediction, regularization, excess loss, agnostic learning

290. ❌ Causal Discovery in Action: Learning Chain-Reaction Mechanisms from Interventions

作者: Panayiotis Panayiotou, Özgür Şimşek 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22620v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于因果发现的理论和方法研究，特别是针对链式反应系统的因果图识别问题。虽然属于AI for Science的广义范畴，但论文内容完全不涉及大模型、深度学习技术原理、模型训练优化、推理加速、对齐、代理系统等任何评分关键词的具体技术。论文的核心是因果推断的数学理论和算法设计，与评分关键词列表中的大模型相关技术没有直接关联。

!!! tip deepseek-chat TL;DR

该论文研究了在具有链式反应结构的动态系统中，如何通过阻断干预来唯一识别因果图，并提出了一种具有有限样本保证的最小估计器，在合成模型和多样化环境中实现了可靠的因果恢复。

摘要翻译

因果发现在一般动态系统中具有挑战性，因为若无强结构假设，即使基于干预数据，底层因果图也可能无法被识别。然而，许多现实世界系统表现出定向的、级联式的结构，其中各组件依次激活，且上游故障会抑制下游效应。我们研究此类链式反应系统中的因果发现，并证明通过阻断个体组件激活的干预措施，因果结构可被唯一识别。我们提出一种具有有限样本保证的最小估计器，实现了指数级误差衰减与对数级样本复杂度。在合成模型与多样化的链式反应环境中的实验表明，仅需少量干预即可实现可靠恢复，而观测性启发式方法在因果效应存在延迟或重叠的情况下则会失效。

摘要 (Abstract)

Causal discovery is challenging in general dynamical systems because, without strong structural assumptions, the underlying causal graph may not be identifiable even from interventional data. However, many real-world systems exhibit directional, cascade-like structure, in which components activate sequentially and upstream failures suppress downstream effects. We study causal discovery in such chain-reaction systems and show that the causal structure is uniquely identifiable from blocking interventions that prevent individual components from activating. We propose a minimal estimator with finite-sample guarantees, achieving exponential error decay and logarithmic sample complexity. Experiments on synthetic models and diverse chain-reaction environments demonstrate reliable recovery from a few interventions, while observational heuristics fail in regimes with delayed or overlapping causal effects.

关键词: causal discovery, chain-reaction systems, interventional data, identifiability, blocking interventions, finite-sample guarantees, causal graph, dynamical systems

291. ❌ Precision-Varying Prediction (PVP): Robustifying ASR systems against adversarial attacks

作者: Matías Pizarro, Raghavan Narasimhan, Asja Fischer 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22590v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究自动语音识别（ASR）系统的对抗鲁棒性，通过改变推理精度来防御对抗攻击。所有关键词均涉及大模型、深度学习技术原理或科学AI应用，而本文专注于传统ASR模型（非大语言模型）的对抗防御，未涉及任何关键词中的技术或应用领域，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种通过随机采样推理精度来增强自动语音识别系统对抗鲁棒性的方法，并展示了该方法能显著提高鲁棒性并实现竞争性的对抗样本检测性能。

摘要翻译

随着自动化和智能体系统的日益普及，确保自动语音识别（ASR）模型的对抗鲁棒性变得至关重要。我们观察到，在推理过程中改变ASR模型的精度会降低对抗攻击成功的可能性。利用这一现象，我们通过在预测时对精度进行简单随机采样来增强模型的鲁棒性。此外，这一思路可转化为一种对抗样本检测策略，即通过比较不同精度下的输出结果，并利用简单的高斯分类器进行分析。实验分析表明，该方法能显著提升多种ASR模型及攻击类型下的鲁棒性，并在检测性能上展现出竞争力。

摘要 (Abstract)

With the increasing deployment of automated and agentic systems, ensuring the adversarial robustness of automatic speech recognition (ASR) models has become critical. We observe that changing the precision of an ASR model during inference reduces the likelihood of adversarial attacks succeeding. We take advantage of this fact to make the models more robust by simple random sampling of the precision during prediction. Moreover, the insight can be turned into an adversarial example detection strategy by comparing outputs resulting from different precisions and leveraging a simple Gaussian classifier. An experimental analysis demonstrates a significant increase in robustness and competitive detection performance for various ASR models and attack types.

关键词: automatic speech recognition, adversarial robustness, precision-varying prediction, adversarial attacks, inference precision, adversarial example detection, ASR models

292. ❌ Subspace Tensor Orthogonal Rotation Model (STORM) for Batch Alignment, Cell Type Deconvolution, and Gene Imputation in Spatial Transcriptomic Data

作者: Sean Cottrell, Guo-Wei Wei, Longxiu Huang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22477v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于空间转录组学数据分析，提出了一种新的张量分解方法（STORM）用于批次对齐、细胞类型反卷积和基因插补。所有关键词（除最后一个外）均涉及大模型、深度学习技术原理或相关应用，而本文使用的是传统的张量分解方法，明确提到“与黑盒深度学习方法不同”，因此与这些关键词完全无关。最后一个关键词“AI for Science OR Bioinformatics OR Cheminformatics”得5分，因为论文属于生物信息学领域，应用AI/计算方法解决科学问题，但并非大模型或深度学习创新，只是传统机器学习在科学领域的应用，相关性中等。

!!! tip deepseek-chat TL;DR

该论文提出了一种新的可解释张量分解模型（STORM），用于解决空间转录组学数据中的批次效应、混合细胞类型信号和基因表达插补问题，并在实验中展示了最先进的性能。

摘要翻译

空间转录组数据分析将细胞转录活动与空间坐标整合，以识别空间域、推断细胞类型动态并表征组织内的基因表达模式。尽管近期取得进展，该领域仍存在显著挑战，包括批次效应的处理、混合细胞类型信号的处理以及测量不佳或缺失基因表达的插补。本研究通过提出一种新颖的子空间张量正交旋转模型（Subspace Tensor Orthogonal Rotation Model, STORM）应对这些挑战，该模型通过在物理模式或微环境层面考量，将空间维度和几何结构各异的多张切片进行对齐。为此，STORM提出了一种不规则张量分解技术，用于分解一系列基因表达矩阵并将其整合至共享的潜在空间以进行下游分析。相较于黑箱深度学习方法，所提出的模型具有内在可解释性。数值实验表明，该模型在空间转录组数据的纵向与横向批次整合、细胞类型反卷积以及未测量基因插补任务中均表现出最先进的性能。

摘要 (Abstract)

Spatial transcriptomics data analysis integrates cellular transcriptional activity with spatial coordinates to identify spatial domains, infer cell-type dynamics, and characterize gene expression patterns within tissues. Despite recent advances, significant challenges remain, including the treatment of batch effects, the handling of mixed cell-type signals, and the imputation of poorly measured or missing gene expression. This work addresses these challenges by introducing a novel Subspace Tensor Orthogonal Rotation Model (STORM) that aligns multiple slices which vary in their spatial dimensions and geometry by considering them at the level of physical patterns or microenvironments. To this end, STORM presents an irregular tensor factorization technique for decomposing a collection of gene expression matrices and integrating them into a shared latent space for downstream analysis. In contrast to black-box deep learning approaches, the proposed model is inherently interpretable. Numerical experiments demonstrate state-of-the-art performance in vertical and horizontal batch integration, cell-type deconvolution, and unmeasured gene imputation for spatial transcriptomics data.

关键词: spatial transcriptomics, batch alignment, cell-type deconvolution, gene imputation, tensor factorization, interpretable model, irregular tensor, latent space

293. ❌ CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning

作者: Dongxia Wu, Shiye Su, Yuhui Zhang, Elaine Sui, Emma Lundberg, Emily B. Fox, Serena Yeung-Levy 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21743v2

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究使用强化学习（RL）对虚拟细胞模型进行后训练，以提高其生物合理性，属于AI在生物信息学领域的应用。与大多数关键词（如LLMs、MoE、Scaling Laws等）无关，因为这些关键词主要涉及大语言模型技术原理。仅与两个关键词相关：1. “Post-training OR Supervised Fine-tuning OR SFT”：论文使用RL对预训练的CellFlux模型进行后训练，属于后训练范畴，但并非典型的监督微调，因此给5分。2. “AI for Science OR Bioinformatics OR Cheminformatics”：论文直接应用AI于生物信息学（虚拟细胞建模），是核心内容，给10分。其他关键词如RLHF、PEFT、RAG等均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出使用强化学习对虚拟细胞模型进行后训练，通过生物约束奖励函数优化模型，使生成的细胞图像更符合生物物理规律，从而提升虚拟细胞建模的生物意义。

摘要翻译

利用生成模型构建虚拟细胞以在计算机中模拟细胞行为，正成为加速药物发现的一种新兴且有前景的研究范式。然而，现有的基于图像的生成方法可能产生违反基本物理和生物学约束的、不合理的细胞图像。为解决这一问题，我们提出使用强化学习对虚拟细胞模型进行后训练，将具有生物学意义的评估器作为奖励函数。我们设计了涵盖三个类别——生物功能、结构有效性和形态正确性——的七种奖励，并优化了最先进的CellFlux模型，从而得到CellFluxRL。在所有奖励指标上，CellFluxRL均持续优于CellFlux，且通过测试时缩放可进一步提升性能。总体而言，我们的研究成果提出了一个通过强化学习强制执行基于物理约束的虚拟细胞建模框架，推动生成结果从“视觉上真实”迈向“生物学上有意义”的新阶段。

摘要 (Abstract)

Building virtual cells with generative models to simulate cellular behavior in silico is emerging as a promising paradigm for accelerating drug discovery. However, prior image-based generative approaches can produce implausible cell images that violate basic physical and biological constraints. To address this, we propose to post-train virtual cell models with reinforcement learning (RL), leveraging biologically meaningful evaluators as reward functions. We design seven rewards spanning three categories-biological function, structural validity, and morphological correctness-and optimize the state-of-the-art CellFlux model to yield CellFluxRL. CellFluxRL consistently improves over CellFlux across all rewards, with further performance boosts from test-time scaling. Overall, our results present a virtual cell modeling framework that enforces physically-based constraints through RL, advancing beyond “visually realistic” generations towards “biologically meaningful” ones.

关键词: virtual cell modeling, reinforcement learning, biological constraints, post-training, generative models, drug discovery, CellFlux, bioinformatics

294. ❌ Reaching for the performance limit of hybrid density functional theory for molecular chemistry

作者: Jiashu Liang, Martin Head-Gordon 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23466v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于开发一种新的密度泛函理论（DFT）方法（COACH functional），属于计算化学领域，与所有大模型、深度学习、AI技术原理相关的关键词均无直接关联。唯一可能相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文涉及分子化学的计算方法，属于科学计算（AI for Science）的广义范畴，但论文本身并未使用AI或机器学习技术，而是传统的数值优化和理论化学方法，因此给予5分（有一定关联）。其他所有关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种系统化协议来开发密度泛函近似（COACH functional），在分子化学基准测试中提高了准确性和可转移性，同时保持了计算实用性。

摘要翻译

密度泛函理论（DFT）在精度与效率之间提供了出色的平衡，但实用的密度泛函近似在简洁性、准确性和可迁移性之间面临着不可避免的权衡。因此，需要一种系统化的方案来开发在选定应用领域内可靠且最精确的泛函。本文提出了一种结合约束条件强制、灵活泛函形式和现代优化技术的方案。将这一策略应用于范围分离杂化（RSH）meta-GGA框架，我们得到了经过精心优化且适当约束的杂化泛函（COACH）。在广泛的分子基准测试中，相较于包括ωB97M-V在内的主流RSH meta-GGA泛函，COACH在保持其计算层级实用性的同时，显著提升了准确性和可迁移性。最后，我们对剩余权衡与饱和行为的分析表明，进一步的系统性进展可能需要纳入真正的非局域信息。

摘要 (Abstract)

Density functional theory (DFT) offers an exceptional balance between accuracy and efficiency, but practical density functional approximations face an unavoidable trade-off among simplicity, accuracy, and transferability. A systematic protocol is therefore needed to develop functionals that are reliably most accurate within a chosen application domain. Here we present such a protocol by combining constraint enforcement, flexible functional forms, and modern optimization. Applying this strategy to the range-separated hybrid (RSH) meta-GGA framework, we obtain the carefully optimized and appropriately constrained hybrid (COACH) functional. Across broad molecular benchmarks, COACH improves both accuracy and transferability relative to leading RSH meta-GGAs, including \omegaB97M-V, while retaining the computational practicality of its rung. Finally, our analysis of the remaining trade-offs and saturation behavior suggests that further systematic progress will likely require the incorporation of genuinely nonlocal information.

关键词: Density functional theory, Hybrid functional, Molecular chemistry, Accuracy, Transferability, Optimization, Range-separated hybrid, COACH functional

295. ❌ A unified variational framework for the inverse Kohn-Sham problem

作者: Nan Sheng 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23452v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是密度泛函理论中的逆Kohn-Sham问题，属于计算化学和量子物理领域，与所有大模型、深度学习、AI技术相关的关键词均无直接关联。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于科学计算领域，但论文本身并未涉及AI方法，而是纯粹的数学物理理论框架研究，因此仅给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对密度泛函理论中的逆Kohn-Sham问题，提出了一个统一的变分框架，将现有的多种反演方法（如Wu-Yang、Zhao-Morrison-Parr等）纳入同一理论结构中进行分类和解释。

摘要翻译

逆科恩-沈（KS）问题旨在寻找一个局域有效势，其无相互作用基态能复现给定的电子密度。现有逆问题表述常以不同理论框架呈现，包括约化变分优化、惩罚正则化、基于响应的迭代法以及偏微分方程约束优化等。本研究通过两个步骤为逆KS理论构建了统一的变分框架。首先，我们指出精确密度泛函理论中嵌入的固定密度无相互作用约束搜索是逆KS问题的自然变分锚点。在此设定下，KS势作为与密度复现条件相关的变分对偶对象出现。其次，我们阐明了各类主流逆问题表述如何被理解为同一逆KS结构的不同实现形式，并依据其将KS态方程与密度复现条件处理为目标函数、约束条件、惩罚项或可行性关系的不同方式，纳入更广泛的优化理论分类体系。在此框架中，吴-杨方法表现为约化的精确乘子表述，赵-莫里森-帕尔方法对应二次惩罚松弛格式，而偏微分方程约束途径则属于显式态约束表述。该统一视角同样适用于增广拉格朗日法与全量残差表述，并澄清了不同逆方法中加性常数模糊性、渐近归一化、非光滑变分结构及弱能隙不稳定性等问题的角色。

摘要 (Abstract)

The inverse Kohn-Sham (KS) problem seeks a local effective potential whose noninteracting ground state reproduces a prescribed electron density. Existing inversion formulations are often expressed in disparate languages, including reduced variational optimization, penalty regularization, response-based iteration, and PDE-constrained optimization. In this work, we develop a unified variational framework for inverse KS theory in two steps. First, we identify the fixed-density noninteracting constrained search embedded in exact density functional theory as the natural variational anchor of inverse KS inversion. In this setting, the KS potential appears as the variational dual object associated with density reproduction. Second, we show how the principal inversion formulations may be understood as realizations of the same inverse-KS structure and how they fit into a broader optimization-theoretic classification according to whether the KS state equations and density-reproduction condition are treated as objectives, constraints, penalties, or feasibility relations. Within this framework, Wu-Yang appears as a reduced exact-multiplier formulation, Zhao-Morrison-Parr as a quadratic-penalty relaxation, and PDE-constrained approaches as explicit state-constraint formulations. The same viewpoint also accommodates augmented-Lagrangian and all-at-once residual formulations, and clarifies the roles of additive-constant ambiguity, asymptotic normalization, nonsmooth variational structure, and weak-gap instability across inversion methods.

关键词: inverse Kohn-Sham problem, density functional theory, variational framework, electron density, KS potential, constrained search, optimization theory, PDE-constrained optimization

296. ❌ Elucidating the Synergetic Interplay between Average Intermolecular Coupling and Coupling Disorder in Short-Time Exciton Transfer

作者: Siwei Wang, Guangming Liu, Hsing-Ta Chen 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23427v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究分子聚集体中的激子传输动力学，属于理论物理和计算化学领域，专注于开发分析框架来理解短时间激子传输行为。论文内容完全不涉及大模型、深度学习、人工智能或任何机器学习技术，所有关键词均与大模型技术原理、应用或相关方法无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文开发了一个分析框架来研究一维晶格中短时间激子动力学，揭示了平均分子间耦合与耦合无序之间的协同作用对激子传输的影响。

摘要翻译

分子聚集体中的激子输运是决定有机光电子器件与光捕获系统性能的基本过程。尽管多数理论研究侧重于长时程输运行为，但超快光谱学的最新进展使短时域行为成为焦点——在该时域内，激子运动在飞秒至皮秒时间尺度上仍保持弹道性。本研究针对同时存在点位能量（对角）无序与分子间耦合（非对角）涨落的一维晶格，建立了短时激子动力学的解析理论框架。利用倒空间分析方法，我们推导了局域激发与移动高斯初始条件下第一和第二空间矩的闭合表达式。解析与数值结果表明：长时程动力学受对角无序影响，而短时弹道扩张主要由非对角无序主导。关键的是，我们揭示了平均分子间耦合与非对角耦合无序强度之间的协同相互作用，证明二者对短时激子输运具有等效贡献。此外，我们将这一普适无序模型与宏观量子电动力学框架下的实际分子系统相结合，从而为表征和优化复杂介电介质中无序分子聚集体的超快能量流提供了理论基础。

摘要 (Abstract)

Exciton transport in molecular aggregates is a fundamental process governing the performance of organic optoelectronics and light-harvesting systems. While most theoretical studies have emphasized long-time transport behavior, recent advances in ultrafast spectroscopy have brought into focus the short-time regime, in which exciton motion remains ballistic on femtosecond-to-picosecond timescales. In this work, we develop an analytical framework for short-time exciton dynamics in a one-dimensional lattice subject to both on-site energetic (diagonal) disorder and intermolecular coupling (off-diagonal) fluctuations. Utilizing the reciprocal-space analysis, we derive closed-form expressions for the first and second spatial moments considering both localized excitation and moving Gaussian initial conditions. Our analytical and numerical results show that, while the long-time dynamics are influenced by diagonal disorder, the short-time ballistic expansion is governed primarily by off-diagonal disorder. Crucially, we reveal a synergistic interplay between the average intermolecular coupling and the off-diagonal coupling disorder strength, demonstrating that they contribute equivalently to short-time exciton transport. Moreover, we integrate this generic disorder model with a realistic molecular system within the framework of macroscopic quantum electrodynamics, thereby providing a theoretical foundation for characterizing and optimizing ultrafast energy flow of disordered molecular aggregates in complex dielectric media.

关键词: exciton transport, molecular aggregates, short-time dynamics, intermolecular coupling, off-diagonal disorder, analytical framework, ultrafast spectroscopy, macroscopic quantum electrodynamics

297. ❌ Exact density-functional theory as parallel ensemble variational hierarchies: from Lieb’s formulation to Kohn-Sham theory

作者: Nan Sheng 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23399v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 该论文是关于精确密度泛函理论（DFT）的数学物理基础研究，聚焦于理论框架的重构和澄清，特别是Lieb的系综表述、Kohn-Sham理论以及变分层次结构。所有评分关键词均涉及大模型、深度学习技术及其应用（如LLMs、MoE、训练方法、推理技术、AI代理等），而本文完全不涉及这些主题。论文属于理论物理/计算化学领域，与人工智能、机器学习或大模型技术无直接关联。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文重构了精确基态密度泛函理论，通过区分相互作用的系综变分层次和非相互作用的系综变分层次，并澄清它们与Kohn-Sham辅助构造的关系，提供了一个更清晰的理论框架，统一解释了分数粒子数、导数不连续性等概念。

摘要翻译

精确的基态密度泛函理论包含两个并行的变分结构，它们常被压缩为单一叙述：一个植根于Lieb系综表述的相互作用体系层级，以及一个植根于精确系综非相互作用理论的非相互作用体系层级。我们围绕这一并行结构重构了精确DFT，并将这两个精确框架与在共同可容许密度类上连接它们的Kohn-Sham辅助密度泛函构造区分开来。从这一视角看，Levy-Lieb约束搜索、Hohenberg-Kohn图像以及通常的纯态非相互作用或Kohn-Sham表述，都表现为在附加限制下的更狭义特例。同样的组织方式也将分数粒子数、分段线性、单边化学势、导数不连续性、分数轨道占据以及Janak型关系纳入统一的变分图景中。交换关联结构也从同一角度被重新审视，它表现为相互作用与非相互作用层级之间的接口量，而不仅仅是Kohn-Sham分解中的未知余项。其结果是对精确DFT的形式重组，澄清了在压缩阐述中常被模糊的区分，包括泛函定义域与可表示性类别、非相互作用支撑势结构与Kohn-Sham辅助构造、以及密度复现与谱解释之间的差异。

摘要 (Abstract)

Exact ground-state density-functional theory contains two parallel variational structures that are often compressed into a single narrative: an interacting hierarchy rooted in Lieb’s ensemble formulation and a noninteracting hierarchy rooted in exact ensemble noninteracting theory. We reconstruct exact DFT around this parallel structure and distinguish both exact frameworks from the Kohn-Sham auxiliary density-functional construction that links them on a common admissible density class. From this viewpoint, the Levy-Lieb constrained search, the Hohenberg-Kohn picture, and ordinary pure-state noninteracting or Kohn-Sham formulations appear as narrower specializations under additional restrictions. The same organization also places fractional particle number, piecewise linearity, one-sided chemical potentials, derivative discontinuity, fractional orbital occupations, and Janak-type relations within a single variational picture. Exchange-correlation structure is reconsidered from the same standpoint, where it appears as the interface quantity between the interacting and noninteracting hierarchies rather than merely as the unknown remainder of the Kohn-Sham decomposition. The result is a formal reorganization of exact DFT that clarifies distinctions often blurred in compressed expositions, including functional domain versus representability class, noninteracting supporting-potential structure versus Kohn-Sham auxiliary construction, and density reproduction versus spectral interpretation.

关键词: density-functional theory, variational hierarchies, Lieb’s formulation, Kohn-Sham theory, ensemble noninteracting theory, exchange-correlation structure, fractional particle number, derivative discontinuity

298. ❌ Vectorial Imaging of the Photodissociation of 2-Bromobutane Oriented via Hexapolar State Selection

作者: Masaaki Nakamura, Po-Yu Tsai, Shiun-Jr Yang, King-Chuen Lin, Toshio Kasai, Dock-Chil Che, Andrea Lombardi, Federico Palazzetti, Vincenzo Aquilanti 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23331v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是物理化学领域的分子光解离实验，具体涉及2-溴丁烷的取向光解离、离子成像技术和矢量相关分析。论文内容完全属于实验物理化学范畴，未涉及任何大模型、深度学习、人工智能或相关技术原理。所有评分关键词均与大模型技术及其应用相关，与该论文的研究主题无任何关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文通过六极态选择取向技术研究了2-溴丁烷的光解离过程，利用切片离子成像技术分析了光碎片反冲速度、跃迁偶极矩和永久偶极矩三个矢量之间的相关性，发现两个对映体的光碎片角分布没有显著差异。

摘要翻译

分子取向技术正逐渐应用于基元化学过程的研究中，以揭示那些可能被随机旋转运动所掩盖的结构与动力学特性。近期，针对比以往复杂得多的不对称陀螺分子与手性分子，已成功实现了取向控制。本工作中，我们在线性偏振激光引发六极杆取向2-溴丁烷的解离实验中，报告并讨论了光碎片反冲速度矢量v、跃迁偶极矩μ与永久偶极矩d之间的关联性。通过（2+1）共振增强多光子电离技术，分别在234.0纳米和254.1纳米波长下获取了Br*（2P1/2）与Br（2P3/2）光碎片的切片离子影像。对激光偏振倾斜角为45度时获得的切片离子影像进行详细分析，揭示了三个矢量在反冲坐标系中的关联信息，该坐标系由两个极角α、χ和一个方位角φμd所限定。在254.1纳米波长下，分别从对映异构体消除的Br碎片切片离子影像显示不对称因子接近于零，因此光碎片角分布未呈现显著差异。而在234.0纳米波长下Br*碎片的消除主要与平行跃迁相关，产生了高达1.85的各向异性参数，故可视为单态激发。通过优化拟合，反冲坐标系角度α与χ分别确定为163.8°与164.1°，而φμd最佳拟合值接近0°。由于在本研究条件下三个矢量的空间排布差异微小，两种对映异构体的光碎片角分布未表现出明显区别……

摘要 (Abstract)

Molecular orientation techniques are becoming available in the study of elementary chemical processes, in order to highlight those structural and dynamical properties that would be concealed by random rotational motions. Recently successful orientation was achieved for asymmetric-top and chiral molecules of much larger complexity than hitherto. In this work, we report and discuss the correlation between the vectors photofragment recoil velocity v, transition dipole moment μ, and permanent dipole moment d in a dissociation experiment on hexapole oriented 2-bromobutane, photoinitiated by a linearly polarized laser. The sliced ion images of the Br* (2P1/2) and Br (2P3/2) photofragment were acquired at 234.0 and 254.1 nm, respectively, by (2+1) resonance-enhanced multiphoton ionization technique. A detailed analysis of the sliced ion images obtained at a tilting angle 45o of the laser polarization provides the information on correlation of the three vectors, which are confined by two polar angles α, \c{hi} and one azimuthal angle φμd in the recoil frame. The sliced ion images of Br fragments eliminated individually from the enantiomers at 254.1 nm yield the asymmetric factor close to zero; for this reason the photofragment angular distributions do not show significant differences. The elimination of Br* fragment at 234.0 nm is mainly correlated with a parallel transition, giving rise to a large anisotropy parameter of 1.85, and thus can be considered as a single state excitation. The resulting recoil frame angles are optimized to 163.8° and 164.1° for α and \c{hi}, respectively, whereas φμd approaches close to 0o for the best fit. Since in the present case, the three vectors have an only slight spatial arrangement, the photofragment angular distributions of the two enantiomers do not show appreciable differences…

关键词: photodissociation, 2-bromobutane, hexapole orientation, sliced ion imaging, vector correlation, enantiomers, anisotropy parameter, resonance-enhanced multiphoton ionization

299. ❌ Doppler dual-comb coherent Raman spectromicroscopy

作者: Florian M. Schweizer, Hannah Terrasa, Manish Garg 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23094v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于光学物理和化学成像技术，提出了一种基于多普勒效应的双频梳相干拉曼光谱显微技术。论文内容完全属于实验物理和仪器开发领域，涉及激光技术、非线性光学和光谱学方法。所有关键词均与大语言模型、深度学习、人工智能技术原理或应用无关，因此除’AI for Science OR Bioinformatics OR Cheminformatics’外，其他关键词均得0分。‘AI for Science’关键词得5分，因为该论文属于科学研究（化学成像）领域，但并未使用AI方法，只是广义的科学应用范畴。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于多普勒效应的双频梳相干拉曼光谱显微技术，通过单超宽带激光源产生两个频率梳，实现了背景自由、高灵敏度、快速（毫秒级采集）的宽带化学成像，并在聚甲基丙烯酸甲酯微球成像中获得了约280纳米的衍射极限空间分辨率。

摘要翻译

基于拉曼过程的化学成像技术对于以无标记方式研究生物与化学样品至关重要。受激拉曼光谱（SRS）克服了自发拉曼光谱中信号水平低的关键局限，但其代价是仅能探测狭窄的拉曼谱带。通过双光频梳实现相干反斯托克斯拉曼光谱（CARS）的时域方案可以获得宽拉曼带宽，然而，由于需要对两个独立的超短激光源进行严格的时间同步，其实施要求极高。本文提出一种时域相干拉曼光谱技术，该技术利用单个超宽带激光源通过多普勒效应产生的两个光频梳。与CARS不同，在我们的方法中，两个宽带光频梳（τ ~ 6 fs）脉冲式激发的振动发生干涉，周期性调制介质的克尔非线性响应，导致两个光频梳均经历交叉相位调制（XPM）。这种相位调制导致光频梳在反斯托克斯区域产生光谱展宽和周期性调制。双光频梳方法将振动频率下转换约10⁻⁸倍，使得我们能够在反斯托克斯区域采用光子计数方法。这使得我们的技术极具通用性，且无背景干扰、灵敏度高、速度快（毫秒级采集时间），能够以无损的脉冲能量（~ 100 pJ）入射样品，探测从宽带隙电介质、液体到单个微粒的一系列样品。由于XPM过程涉及高阶非线性，我们在对~ 8 μm聚甲基丙烯酸甲酯微珠进行超宽带化学成像时，实现了衍射极限空间分辨率（~ 280 nm）约2.5倍的提升。

摘要 (Abstract)

Chemical imaging enabled by Raman processes is crucial to investigating biological and chemical samples in a label-free manner. Stimulated Raman spectroscopy (SRS) overcomes the key limitation associated with low signal levels in spontaneous Raman spectroscopy, however, at the expense of probing only narrow Raman bands. Time-domain implementation of coherent anti-Stokes Raman spectroscopy (CARS) by dual frequency combs can achieve broad Raman bandwidths; nevertheless, its execution is demanding due to strenuous temporal-synchronization of two independent ultrashort laser sources. Here, we introduce time-domain coherent Raman spectroscopy utilizing two frequency combs generated by the Doppler effect from a single ultra-broadband laser source. In contrast to CARS, in our approach, the interference of impulsively launched vibrations by two broadband frequency combs (τ ~ 6 fs) periodically modulates the Kerr nonlinear response of the medium, leading to cross-phase modulation (XPM) experienced by both the combs. This phase modulation leads to spectral broadening and periodic modulation in the anti-Stokes region of the combs. Down-conversion by a factor of ~ 10-8 in the frequency of the vibrations enabled by the dual-comb approach empowered us to use photon-counting methodology in the anti-Stokes region. This makes our technique extremely versatile, background-free, sensitive and fast (millisecond acquisition times), in probing a range of samples from wide bandgap dielectrics and liquids to individual micro-particles with nondestructive pulse energies (~ 100 pJ) incident on the sample. Owing to the higher-order nonlinearity involved in the XPM process, we achieved ~ 2.5 times improvement in diffraction-limited spatial resolution (~ 280 nm) in ultra-broadband chemical imaging of a ~ 8 μm bead of poly-methyl-methacrylate.

关键词: coherent Raman spectroscopy, dual frequency combs, Doppler effect, chemical imaging, cross-phase modulation, photon-counting, diffraction-limited resolution, poly-methyl-methacrylate

300. ❌ Ab Initio Simulation of Femtosecond Time-Resolved Multi-Pulse Spectroscopies applied to the Heptazine$\cdots$H$_2$O Complex

作者: Sebastian V. Pios, Maxim F. Gelin, Wolfgang Domcke, Lipeng Chen 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.22671v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是计算化学领域中的飞秒时间分辨多脉冲光谱模拟方法，具体应用于七嗪-水复合物的光诱导动力学研究。论文内容完全专注于分子光谱学、量子化学计算和实验模拟方法，不涉及任何大语言模型、深度学习、AI技术原理或AI在科学领域的应用。所有关键词（1-26）均与论文主题无关，因此评分为0分。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，但论文使用的是传统的量子化学计算方法（ab initio on-the-fly），而非AI/机器学习方法，因此仅给予5分（有一定关联），因为计算化学属于广义的科学计算领域。

!!! tip deepseek-chat TL;DR

该论文将准经典门-窗方法推广到多脉冲光谱学，开发了计算高效的模拟协议，并应用于七嗪-水复合物，证明泵浦激发实验比传统方法能提供更丰富的超快辐射弛豫动力学信息。

摘要翻译

在多维时间分辨光谱实验中，通常采用多个（两个以上）具有可变脉冲延迟时间的短激光脉冲，以时间分辨方式探索分子发色团的光诱导动力学。本研究将近期为瞬态吸收泵浦-探测（PP）光谱开发的准经典门道-窗口（DW）方法[M. F. Gelin等人，J. Chem. Theory Comput. 2021, 17, 2394]推广至多脉冲光谱技术。研究以泵浦-推动-探测（PPP）光谱（涉及三个激光脉冲）和泵浦诱导二维（P-2D）光谱（涉及五个激光脉冲）作为具体范例。准经典DW近似形成了概念简明且计算高效的模拟方案，适用于结合$ab$ $initio$（第一性原理）飞行电子结构计算进行实施。通过对氢键结合的七嗪$\cdots$H$_2$O复合物进行PPP和P-2D光谱模拟表明，相较于传统的PP和二维（2D）实验，泵浦激发型实验能为七嗪$\cdots$H$_2$O复合物激发电子态的超快无辐射弛豫动力学提供更为丰富的信息。

摘要 (Abstract)

In multi-dimensional time-resolved spectroscopic experiments, multiple (more than two) short laser pulses with variable pulse delay times are employed for the time-resolved exploration of the photoinduced dynamics of molecular chromophores. In the present work, the quasi-classical doorway-window (DW) methodology recently developed for transient absorption pump-probe (PP) spectroscopy [M. F. Gelin et al., J. Chem. Theory Comput. 2021, 17, 2394] has been generalized to multi-pulse spectroscopies. Pump-push-probe (PPP) spectroscopy (involving three laser pulses) and pump-induced two-dimensional (P-2D) spectroscopy (involving five laser pulses) are considered as specific examples. The quasi-classical DW approximation results in conceptually simple and computationally efficient simulation protocols which are suitable for implementation with $ab$ $initio$ on-the-fly electronic-structure calculations. Simulations of PPP and P-2D spectra performed for the hydrogen-bonded heptazine$\cdots$H$_2$O complex illustrate that pump-stimulated experiments provide much richer information on the ultrafast radiationless relaxation dynamics of the excited electronic states of the heptazine$\cdots$H$_2$O complex than conventional PP and 2D experiments.

关键词: time-resolved spectroscopy, multi-pulse spectroscopy, quasi-classical doorway-window method, ab initio simulation, heptazine-water complex, ultrafast dynamics, pump-push-probe, pump-induced 2D spectroscopy

301. ❌ Radial Gausslets

作者: Steven R. White 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22646v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究的是电子结构计算中的径向高斯基组构造，属于计算化学/物理领域，与所有大模型、深度学习、AI技术原理相关的关键词完全无关。仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有微弱关联，因为该研究属于科学计算领域，但论文本身并未使用AI方法，而是传统数值方法，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于原子系统电子结构计算的径向高斯基组构造方法，解决了原有高斯基组受限于笛卡尔坐标的问题，并通过Hartree-Fock和精确对角化验证了其准确性。

摘要翻译

高斯波瓣是为数不多的可实现双指标/对角电子-电子相互作用项的电子结构基组之一。高斯波瓣的一个弱点在于，由于其源于一维构造，始终与笛卡尔坐标体系绑定。本文将其推广至三维原子基组中的径向坐标，构建了径向高斯波瓣。该径向基组通过相对较少的函数数量即可形成极为紧凑的径向基，并保持对角相互作用项的特性。我们通过原子体系的哈特里-福克方法与精确对角化计算，验证了此构造的精确性。

摘要 (Abstract)

Gausslets are one of the few examples of basis sets for electronic structure which allow for two-index/diagonal electron-electron interaction terms. A weakness of gausslets is that, because of their 1D origin, they have been tied to Cartesian coordinates. Here we generalize the gausslet construction for the radial coordinate in three dimensions for atomic basis sets. These radial gausslets make a very compact radial basis with a relatively modest number of functions, with diagonal interaction terms. We illustrate the accuracy of this construction with Hartree–Fock and exact diagonalization on atomic systems.

关键词: Gausslets, radial basis, electronic structure, atomic systems, diagonal interaction terms, Hartree-Fock, exact diagonalization, basis sets

302. ❌ Influence Functional Approach to Non-Perturbative Exciton Binding Renormalization from Phonons

作者: Rohit Rana, Eric R. Heller, Antonios M. Alvertis, Jeffrey B. Neaton, David T. Limmer 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22575v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究凝聚态物理中的激子结合能重整化问题，使用第一性原理计算、GW近似、密度泛函微扰理论和路径积分蒙特卡洛方法。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理系统等）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于计算物理/材料科学领域，可视为’AI for Science’的广义应用（尽管论文未明确提及AI），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文研究了声子如何随温度变化重整化激子结合能，通过第一性原理哈密顿量构建和非微扰路径积分蒙特卡洛计算，发现激子结合能主要受光学声子耦合影响，并与实验定量吻合。

摘要翻译

我们构建了一个多体模型哈密顿量，用以描述声子如何随温度变化重整化激子结合能。通过采用GW近似和密度泛函微扰理论，我们能够完全基于第一性原理对该哈密顿量进行参数化。为了非微扰地捕获静态准粒子性质，我们采用基于影响泛函的路径积分蒙特卡洛方法，在虚时间下演化该哈密顿量。对于一类Wannier-Mott型激子，我们的结合能计算结果与实验数据在定量上吻合。研究发现，除了来自纵向光学模的长程偶极相互作用外，声学模和横向光学模产生的短程形变势能在升温条件下能显著重整化电子和空穴极化子的结合能。然而，激子结合能仅在与光学声子耦合时才会发生明显的重整化。

摘要 (Abstract)

We construct a many-body model Hamiltonian to capture how phonons renormalize exciton binding as a function of temperature. By using the GW approximation and density functional perturbation theory, we are able to parameterize this Hamiltonian completely from first principles. To capture static quasiparticle properties non-perturbatively, we evolve this Hamiltonian in imaginary time with path integral Monte Carlo using an influence functional based approach. For a class of Wannier-Mott type excitons, our binding energies are in quantitative agreement with experiment. We find that in addition to long-range dipolar interactions from longitudinal optical modes, short-ranged deformation potentials from acoustic modes and transverse optical modes can significantly renormalize electron and hole polaron binding energies at elevated temperature. However, exciton binding energies are only appreciably renormalized by coupling to optical phonons.

关键词: exciton binding, phonon renormalization, GW approximation, density functional perturbation theory, path integral Monte Carlo, influence functional, Wannier-Mott excitons, polaron binding energies

303. ❌ A Density-Based Continuous Local Symmetry Measure

作者: Duc Anh Lai, Devin A. Matthews 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22476v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究化学中的局部对称性测量，属于计算化学领域，与深度学习、大模型技术完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及化学信息学中的分子结构分析，但论文本身未提及AI方法，仅使用传统计算化学方法，因此给予5分（有一定关联）。其他所有关键词均与大模型、深度学习技术相关，与论文内容无任何交集，均为0分。

!!! tip deepseek-chat TL;DR

该论文针对化学中局部对称性研究不足的问题，提出了一种基于电子密度定位的局部对称性测量新框架，并展示了其在分子结构分析中的应用价值。

摘要翻译

尽管连续对称性理论在现代化学中日益受到关注，局部对称性研究仍显不足。这导致对称性与化学行为之间的关系常被模糊化，限制了模糊对称性度量的实际应用。本研究引入了一种基于电子密度定域性的局部对称性评估新框架，并展示了几种代表性分子的连续对称性表征。我们的方法不仅能定量捕捉全局对称性，还能揭示局部化学环境中对称性的独特特征。文中亦探讨了相关概念——局部手性或手性拓扑性。总体而言，所提出的局部对称性与手性度量方法为理解分子结构及结构-性质关系提供了重要见解。

摘要 (Abstract)

Although continuous symmetry theory has attracted increasing attention in modern chemistry, local symmetry remains under-investigated. As a consequence, the relationship between symmetry and chemical behavior is often obscured, limiting the practical use of fuzzy symmetry measures. In this study, we introduce a novel framework for evaluating local symmetry based on electron density localization, and present continuous symmetry representations for several representative molecules. Our approach not only quantitatively captures global symmetry, but also reveals distinctive features of symmetry in a local chemical environment. The related concept, local chirality or chirotopicity, is also discussed. Overall, the proposed local symmetry and chirality measures provide valuable insights into molecular structure and structure-property relationships.

关键词: continuous symmetry, local symmetry, electron density, molecular structure, chirality, chemical behavior, symmetry measure, density-based

304. ❌ Molecular dynamics study of perchloric acid using the extended Madrid-2019 force field

作者: M. Cruz-Sánchez, S. Blazquez, C. Vega, V. M. Trejos 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22521v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文是关于使用扩展的Madrid-2019力场进行高氯酸分子动力学模拟的纯计算化学研究，主要涉及力场开发、热力学性质预测和结构分析。论文内容与绝大多数关键词（涉及大模型、深度学习、训练方法、推理优化、AI代理等）完全无关，因为这些关键词都属于人工智能/机器学习领域，而该论文属于传统计算化学/分子模拟领域。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于计算化学（cheminformatics相关领域），但论文本身并未使用任何AI/ML方法，只是传统分子模拟，因此给予5分（有一定关联但非核心）。

!!! tip deepseek-chat TL;DR

该研究使用扩展的Madrid-2019力场模拟高氯酸溶液，成功预测了多种浓度下的密度、结构特征和输运性质，并与实验数据取得了良好的一致性。

摘要翻译

高氯酸（HClO$_4$）被广泛用于制备高氯酸盐，在推进剂、工业、环境化学及生物学领域均有应用。本研究采用扩展马德里-2019力场中高氯酸根阴离子（ClO$_4^-$）和氧鎓阳离子（H$_3$O$^+$）的分子间参数，结合TIP4P/2005水模型，对高氯酸溶液进行模拟。该力场对单价离子采用±0.85e的标度电荷，已广泛应用于水性离子体系。我们利用该模型预测了不同浓度高氯酸溶液的热力学性质[密度及最大密度对应温度（TMD）]、结构特征（离子-水关联：离子-氢与离子-氧）及输运性质（自扩散系数与黏度）。模型对浓度高达10 $m$ 的溶液密度预测与实验数据高度吻合。我们还在较宽温度范围内进行了分子模拟，以确定不同质量摩尔浓度下高氯酸的TMD。在298.15 K和1 bar条件下，浓度低于4 $m$ 的溶液黏度预测值与实验数据吻合良好。文中结合模型的优势与局限性对结果进行了讨论。

摘要 (Abstract)

Perchloric acid (HClO$_4$) is widely used to prepare perchlorate salts with applications in propellants, industry, environmental chemistry, and biology. In this work, we used the intermolecular parameters from the extended Madrid-2019 force field for the perchlorate anion (ClO$_4^-$) and the oxonium cation (H$_3$O$^+$) together with TIP4P/2005 water to model perchloric acid solutions. The force field uses scaled charges of $\pm0.85e$ for monovalent ions and has been widely applied for aqueous ionic systems. We used the model to predict thermodynamic properties [densities and temperatures of maximum in density (TMD)], structural features (ion-water correlations: ion-hydrogen and ion-oxygen), and transport properties (self-diffusion coefficients and viscosity) of perchloric acid solutions at several concentrations. Experimental densities are predicted in excellent agreement up to 10 $m$. We also performed molecular simulations over a wide range of temperatures in order to determine the TMD of perchloric acid at different molalities. Predicted viscosities at 298.15 K and 1 bar are in good agreement with experimental data for concentrations below 4 $m$. Results are discussed in terms of model strengths and limitations.

关键词: molecular dynamics, perchloric acid, force field, thermodynamic properties, structural features, transport properties, simulation, aqueous solutions

Token 消耗统计

总计: 887,634 tokens（输入 565,648 / 输出 321,986）

模型	输入	输出	合计
deepseek-chat	542,471	299,522	841,993
glm-4.7	23,177	22,464	45,641

📊 ArXiv 研究报告 (2026-03-26)

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

评分设置

📈 论文统计

⭐ 及格论文详细分析

1. SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

SpecEyes：通过推测感知与规划加速智能体多模态大语言模型

2. Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents

知识获取胜过模型规模：面向持久化智能体的记忆增强路由

3. Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs

稀疏但关键：LLM RLVR微调中分布偏移的Token级别分析

4. Separating Diagnosis from Control: Auditable Policy Adaptation in Agent-Based Simulations with LLM-B

将诊断与控制分离：基于LLM诊断的可审计智能体模拟策略自适应

5. MedObvious: Exposing the Medical Moravec’s Paradox in VLMs via Clinical Triage

MedObvious：通过临床分诊揭示视觉语言模型中的医学莫拉维克悖论

6. Analysing LLM Persona Generation and Fairness Interpretation in Polarised Geopolitical Contexts

分析极化地缘政治语境下大语言模型的人设生成与公平性解读

7. Gaze-Regularized Vision-Language-Action Models for Robotic Manipulation

基于视线正则化的视觉-语言-动作模型用于机器人操作

📋 所有论文列表

1. ✅ SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

2. ✅ Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents

3. ✅ Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs

4. ✅ Separating Diagnosis from Control: Auditable Policy Adaptation in Agent-Based Simulations with LLM-Based Diagnostics

5. ✅ MedObvious: Exposing the Medical Moravec’s Paradox in VLMs via Clinical Triage

6. ✅ Analysing LLM Persona Generation and Fairness Interpretation in Polarised Geopolitical Contexts

7. ✅ Gaze-Regularized Vision-Language-Action Models for Robotic Manipulation

8. ❌ SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM

9. ❌ ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling

10. ❌ Focus, Don’t Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding

11. ❌ Leveraging LLMs and Social Media to Understand User Perception of Smartphone-Based Earthquake Early Warnings

12. ❌ Understanding LLM Performance Degradation in Multi-Instance Processing: The Roles of Instance Count and Context Length

13. ❌ TorR: Towards Brain-Inspired Task-Oriented Reasoning via Cache-Oriented Algorithm-Architecture Co-design

14. ❌ Avoiding Over-smoothing in Social Media Rumor Detection with Pre-trained Propagation Tree Transformer

15. ❌ Permutation-Symmetrized Diffusion for Unconditional Molecular Generation

16. ❌ PhySe-RPO: Physics and Semantics Guided Relative Policy Optimization for Diffusion-Based Surgical Smoke Removal

17. ❌ Failure of contextual invariance in gender inference with large language models

18. ❌ VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

19. ❌ ReqFusion: A Multi-Provider Framework for Automated PEGS Analysis Across Software Domains

20. ❌ VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

21. ❌ InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting

22. ❌ 3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding

23. ❌ Code Review Agent Benchmark

24. ❌ Evaluating LLM-Based Test Generation Under Software Evolution

25. ❌ Targeted Adversarial Traffic Generation : Black-box Approach to Evade Intrusion Detection Systems in IoT Networks

26. ❌ Mecha-nudges for Machines

27. ❌ Bilevel Autoresearch: Meta-Autoresearching Itself

28. ❌ Biased Error Attribution in Multi-Agent Human-AI Systems Under Delayed Feedback

29. ❌ SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling

30. ❌ Beyond Preset Identities: How Agents Form Stances and Boundaries in Generative Societies

31. ❌ Planning over MAPF Agent Dependencies via Multi-Dependency PIBT

32. ❌ Graph Energy Matching: Transport-Aligned Energy-Based Modeling for Graph Generation

33. ❌ Natural Language Interfaces for Spatial and Temporal Databases: A Comprehensive Overview of Methods, Taxonomy, and Future Directions

34. ❌ Contrastive Metric Learning for Point Cloud Segmentation in Highly Granular Detectors

35. ❌ Edge Radar Material Classification Under Geometry Shifts

36. ❌ RelayS2S: A Dual-Path Speculative Generation for Real-Time Dialogue

37. ❌ WISTERIA: Weak Implicit Signal-based Temporal Relation Extraction with Attention

38. ❌ Unilateral Relationship Revision Power in Human-AI Companion Interaction

39. ❌ LLM Olympiad: Why Model Evaluation Needs a Sealed Exam

40. ❌ Designing Agentic AI-Based Screening for Portfolio Investment

41. ❌ Curriculum-Driven 3D CT Report Generation via Language-Free Visual Grafting and Zone-Constrained Compression

42. ❌ A Comparative Study of Machine Learning Models for Hourly Forecasting of Air Temperature and Relative Humidity

43. ❌ Emergence of Fragility in LLM-based Social Networks: the Case of Moltbook

44. ❌ A Multimodal Framework for Human-Multi-Agent Interaction

45. ❌ Not All Tokens Are Created Equal: Query-Efficient Jailbreak Fuzzing for LLMs

46. ❌ AI Lifecycle-Aware Feasibility Framework for Split-RIC Orchestration in NTN O-RAN

47. ❌ SafeSeek: Universal Attribution of Safety Circuits in Language Models

48. ❌ Neural ODE and SDE Models for Adaptation and Planning in Model-Based Reinforcement Learning

49. ❌ Online library learning in human visual puzzle solving

50. ❌ A Learning Method with Gap-Aware Generation for Heterogeneous DAG Scheduling

51. ❌ MemCollab: Cross-Agent Memory Collaboration via Contrastive Trajectory Distillation

52. ❌ General Machine Learning: Theory for Learning Under Variable Regimes

53. ❌ PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments

54. ❌ ImplicitRM: Unbiased Reward Modeling from Implicit Preference Data for LLM alignment

55. ❌ Reasoning over Semantic IDs Enhances Generative Recommendation

56. ❌ SAiW: Source-Attributable Invisible Watermarking for Proactive Deepfake Defense

57. ❌ Robust Safety Monitoring of Language Models via Activation Watermarking

58. ❌ Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models

59. ❌ Why AI-Generated Text Detection Fails: Evidence from Explainable AI Beyond Benchmark Accuracy